Whelp.. One or more devices error has finally hit me

orddie · Jan 27, 2017

Lots of good info on the forums and I thank you all for this thus far. Im mainly looking for confirmation. I moved the drives from one case to another and SOME of the reviews of the case say... it killed my hard drive. So I did a stupid thing and moved the drive to another port in the case to see if the error moves. Apparently free-nas is smart enough to not care where the drive is... it's still da0.

error started last night, While running last night, it claimed to 2.5k errors. so I rebooted to confirm the drive was seated correctly. brought it back up, saw errors climb again, shutdown and moved the drive. below is the output after booting the server the 3rd time.

Code:

[root@freenas] ~# zpool status -v
  pool: Vmware
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
		attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
		using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 2h58m with 0 errors on Sat Jan 21 10:58:15 2017
config:

		NAME											STATE	 READ WRITE CKSUM
		Vmware										  DEGRADED	 0	 0	 0
		  raidz1-0									  DEGRADED	 0	 0	 0
			gptid/a830a765-ae8d-11e6-b089-001b21a7c63c  DEGRADED	 0	 0	33  too many errors
			gptid/a89e4ab3-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
			gptid/a90d594d-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
			gptid/a979fb28-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
		  raidz1-1									  ONLINE	   0	 0	 0
			gptid/a9e56fc8-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
			gptid/aa5673dc-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
			gptid/aac48004-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
			gptid/ab336f0d-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0h1m with 0 errors on Thu Dec 29 03:46:41 2016
config:

		NAME										  STATE	 READ WRITE CKSUM
		freenas-boot								  ONLINE	   0	 0	 0
		  gptid/87474e74-b21a-11e6-a53f-001b21a7c63c  ONLINE	   0	 0	 0

I'm going to disable my hourly S.M.A.R.T test and run a Long test on the drive. but here is the short output

Code:

[root@freenas] ~# smartctl -a /dev/da0
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:			   HITACHI
Product:			  HUC109045CSS600
Revision:			 A2B0
Compliance:		   SPC-4
User Capacity:		450,098,159,616 bytes [450 GB]
Logical block size:   512 bytes
Rotation Rate:		10020 rpm
Form Factor:		  2.5 inches
Logical Unit id:	  0x5000cca0162c6f04
Serial number:		KMGTEJXF
Device type:		  disk
Transport protocol:   SAS (SPL-3)
Local Time is:		Fri Jan 27 09:08:20 2017 EST
SMART support is:	 Available - device has SMART capability.
SMART support is:	 Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:	 36 C
Drive Trip Temperature:		85 C

Manufactured in week 50 of year 2012
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  11
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  76
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 1951387957067776

Error counter log:
		   Errors Corrected by		   Total   Correction	 Gigabytes	Total
			   ECC		  rereads/	errors   algorithm	  processed	uncorrected
		   fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:		  0	44227		 0	 44227	 143934		407.599		   0
write:		 0   262436		 0	262436	  83490	   1158.285		   0
verify:		0		0		 0		 0	  27039		  0.000		   0

Non-medium error count:		0

SMART Self-test log
Num  Test			  Status				 segment  LifeTime  LBA_first_err [SK ASC ASQ]
	 Description							  number   (hours)
# 1  Background short  Completed				   -	1575				 - [-   -	-]
# 2  Background short  Completed				   -	1574				 - [-   -	-]
# 3  Background short  Completed				   -	1573				 - [-   -	-]
# 4  Background short  Completed				   -	1572				 - [-   -	-]
# 5  Background short  Completed				   -	1571				 - [-   -	-]
# 6  Background short  Completed				   -	1570				 - [-   -	-]
# 7  Background short  Completed				   -	1569				 - [-   -	-]
# 8  Background long   Completed				   -	1568				 - [-   -	-]
# 9  Background short  Completed				   -	1566				 - [-   -	-]
#10  Background short  Completed				   -	1565				 - [-   -	-]
#11  Background short  Completed				   -	1564				 - [-   -	-]
#12  Background short  Completed				   -	1563				 - [-   -	-]
#13  Background short  Completed				   -	1562				 - [-   -	-]
#14  Background short  Completed				   -	1561				 - [-   -	-]
#15  Background short  Completed				   -	1560				 - [-   -	-]
#16  Background short  Completed				   -	1559				 - [-   -	-]
#17  Background short  Completed				   -	1558				 - [-   -	-]
#18  Background short  Completed				   -	1557				 - [-   -	-]
#19  Background short  Completed				   -	1556				 - [-   -	-]
#20  Background short  Completed				   -	1555				 - [-   -	-]

Long (extended) Self Test duration: 3670 seconds [61.2 minutes]

so... I gotta replace the SPIN drive. which is okay. but I would THINK smart would tell me something!

Spearfoot · Jan 27, 2017

orddie said:
so... I gotta replace the SPIN drive. which is okay. but I would THINK smart would tell me something!

Indeed! I've had drives suddenly die without any kind of warning from SMART tests... but you're seeing a bunch of FreeNAS errors w/ no SMART warnings at all. I'm not as familiar with the SMART output from SAS drives; is this section is our clue?

Code:

Error counter log:
		 Errors Corrected by		 Total Correction	Gigabytes	Total
			 ECC		 rereads/	errors algorithm	 processed	uncorrected
		 fast | delayed rewrites corrected invocations [10^9 bytes] errors
read:		 0	44227		0	44227	143934		407.599		 0
write:		0 262436		0	262436	 83490	 1158.285		 0
verify:		0		0		0		0	 27039		 0.000		 0

Did you try replacing the cable? :)

orddie · Jan 27, 2017

Spearfoot said:
Indeed! I've had drives suddenly die without any kind of warning from SMART tests... but you're seeing a bunch of FreeNAS errors w/ no SMART warnings at all. I'm not as familiar with the SMART output from SAS drives; is this section is our clue?

Code:
Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 0 44227 0 44227 143934 407.599 0 write: 0 262436 0 262436 83490 1158.285 0 verify: 0 0 0 0 27039 0.000 0

Did you try replacing the cable? :)

remember, I moved the drive into another port / slot in the case. so I would have expected the error to move drives if it was cable / port on the raid card.

to answer your question, no I did not try and replace the cable.

Spearfoot · Jan 27, 2017

orddie said:
remember, I moved the drive into another port / slot in the case. so I would have expected the error to move drives if it was cable / port on the raid card.

to answer your question, no I did not try and replace the cable.

Doooh!

orddie · Jan 27, 2017

so.. long test is done! looks like it passed?

Code:

smartctl -t long /dev/da0

Code:



[root@freenas] ~# smartctl -a /dev/da0
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:			   HITACHI
Product:			  HUC109045CSS600
Revision:			 A2B0
Compliance:		   SPC-4
User Capacity:		450,098,159,616 bytes [450 GB]
Logical block size:   512 bytes
Rotation Rate:		10020 rpm
Form Factor:		  2.5 inches
Logical Unit id:	  0x5000cca0162c6f04
Serial number:		KMGTEJXF
Device type:		  disk
Transport protocol:   SAS (SPL-3)
Local Time is:		Fri Jan 27 11:14:51 2017 EST
SMART support is:	 Available - device has SMART capability.
SMART support is:	 Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:	 34 C
Drive Trip Temperature:		85 C

Manufactured in week 50 of year 2012
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  11
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  76
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 1952980047757312

Error counter log:
		   Errors Corrected by		   Total   Correction	 Gigabytes	Total
			   ECC		  rereads/	errors   algorithm	  processed	uncorrected
		   fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:		  0	44363		 0	 44363	 144697		408.321		   0
write:		 0   263471		 0	263471	  83533	   1160.020		   0
verify:		0		0		 0		 0	  27640		  0.000		   0

Non-medium error count:		0

SMART Self-test log
Num  Test			  Status				 segment  LifeTime  LBA_first_err [SK ASC ASQ]
	 Description							  number   (hours)
# 1  Background long   Completed				   -	1578				 - [-   -	-]
# 2  Background short  Completed				   -	1575				 - [-   -	-]
# 3  Background short  Completed				   -	1574				 - [-   -	-]
# 4  Background short  Completed				   -	1573				 - [-   -	-]
# 5  Background short  Completed				   -	1572				 - [-   -	-]
# 6  Background short  Completed				   -	1571				 - [-   -	-]
# 7  Background short  Completed				   -	1570				 - [-   -	-]
# 8  Background short  Completed				   -	1569				 - [-   -	-]
# 9  Background long   Completed				   -	1568				 - [-   -	-]
#10  Background short  Completed				   -	1566				 - [-   -	-]
#11  Background short  Completed				   -	1565				 - [-   -	-]
#12  Background short  Completed				   -	1564				 - [-   -	-]
#13  Background short  Completed				   -	1563				 - [-   -	-]
#14  Background short  Completed				   -	1562				 - [-   -	-]
#15  Background short  Completed				   -	1561				 - [-   -	-]
#16  Background short  Completed				   -	1560				 - [-   -	-]
#17  Background short  Completed				   -	1559				 - [-   -	-]
#18  Background short  Completed				   -	1558				 - [-   -	-]
#19  Background short  Completed				   -	1557				 - [-   -	-]
#20  Background short  Completed				   -	1556				 - [-   -	-]

Long (extended) Self Test duration: 3670 seconds [61.2 minutes]

rs225 · Jan 27, 2017

If this problem is following the drive, then it could be a drive electronics problem (or possibly power). It may not show up in SMART, because those are generally looking for errors either on the cable, or in the drive mechanics. This problem might be right in the middle. Only ZFS checksums seem to be detecting it.

orddie · Jan 27, 2017

rs225 said:
If this problem is following the drive, then it could be a drive electronics problem (or possibly power). It may not show up in SMART, because those are generally looking for errors either on the cable, or in the drive mechanics. This problem might be right in the middle. Only ZFS checksums seem to be detecting it.

either way.. replace the drive.. yes?

rs225 · Jan 27, 2017

Replace the drive, but if the problem continues, then the drive isn't bad after all. :)

Arwen · Jan 27, 2017

Uh, I would think running hourly short SMART tests would be both overkill and potentially causing a little un-needed wear.
My own opinion is that even daily short SMART tests would be too much. Weekly or several times a week perhaps.

orddie · Jan 28, 2017

rs225 said:
Replace the drive, but if the problem continues, then the drive isn't bad after all. :)

HA.. Thanks!

Arwen said:
Uh, I would think running hourly short SMART tests would be both overkill and potentially causing a little un-needed wear.
My own opinion is that even daily short SMART tests would be too much. Weekly or several times a week perhaps.

we have different views on how important storing porn is

SweetAndLow · Jan 28, 2017

Are you sure the drive that is degraded is da0. Double check using glabel status.

Sent from my Nexus 5X using Tapatalk

danb35 · Jan 28, 2017

Arwen said:
Uh, I would think running hourly short SMART tests would be both overkill and potentially causing a little un-needed wear.

I doubt they'd cause any significant wear (there's going to be plenty of I/O going on anyway), but agree that hourly SMART tests are complete overkill. I run short tests daily, but that's admittedly on the "high-frequency" end of the recommended range.

orddie · Jan 28, 2017

SweetAndLow said:
Are you sure the drive that is degraded is da0. Double check using glabel status.

Sent from my Nexus 5X using Tapatalk

what started as da0 moved to da1 which is most likely when I moved the HD to a different slot. I think I did a long test on the wrong drive. will do a long test on the correct one now.

Code:

root@freenas] ~# zpool status -v
  pool: Vmware
state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
		attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
		using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 2h58m with 0 errors on Sat Jan 21 10:58:15 2017
config:

		NAME											STATE	 READ WRITE CKSUM
		Vmware										  DEGRADED	 0	 0	 0
		  raidz1-0									  DEGRADED	 0	 0	 0
			gptid/a830a765-ae8d-11e6-b089-001b21a7c63c  DEGRADED	 0	 0 5.00K  too many errors
			gptid/a89e4ab3-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
			gptid/a90d594d-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
			gptid/a979fb28-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
		  raidz1-1									  ONLINE	   0	 0	 0
			gptid/a9e56fc8-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
			gptid/aa5673dc-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
			gptid/aac48004-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
			gptid/ab336f0d-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0

errors: No known data errors

  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0h1m with 0 errors on Thu Dec 29 03:46:41 2016
config:

		NAME										  STATE	 READ WRITE CKSUM
		freenas-boot								  ONLINE	   0	 0	 0
		  gptid/87474e74-b21a-11e6-a53f-001b21a7c63c  ONLINE	   0	 0	 0

errors: No known data errors
[root@freenas] ~# glabel status | grep gptid/a830a765-ae8d-11e6-b089-001b21a7c63c
gptid/a830a765-ae8d-11e6-b089-001b21a7c63c	 N/A  da0p1
[root@freenas] ~# smartctl -a /dev/da1
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:			   HITACHI
Product:			  HUC109045CSS600
Revision:			 A2B0
Compliance:		   SPC-4
User Capacity:		450,098,159,616 bytes [450 GB]
Logical block size:   512 bytes
Rotation Rate:		10020 rpm
Form Factor:		  2.5 inches
Logical Unit id:	  0x5000cca0160794a8
Serial number:		KMG457JF
Device type:		  disk
Transport protocol:   SAS (SPL-3)
Local Time is:		Sat Jan 28 08:56:36 2017 EST
SMART support is:	 Available - device has SMART capability.
SMART support is:	 Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:	 31 C
Drive Trip Temperature:		85 C

Manufactured in week 50 of year 2012
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  11
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  73
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 1932938170073088

Error counter log:
		   Errors Corrected by		   Total   Correction	 Gigabytes	Total
			   ECC		  rereads/	errors   algorithm	  processed	uncorrected
		   fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:		  0   130372		 0	130372	 837320		398.966		   0
write:		 0   288857		 0	288857	  98277	   1163.567		   0
verify:		0		0		 0		 0	 141324		  0.000		   0

Non-medium error count:		0

SMART Self-test log
Num  Test			  Status				 segment  LifeTime  LBA_first_err [SK ASC ASQ]
	 Description							  number   (hours)
# 1  Background short  Completed				   -	1573				 - [-   -	-]
# 2  Background short  Completed				   -	1572				 - [-   -	-]
# 3  Background short  Completed				   -	1571				 - [-   -	-]
# 4  Background short  Completed				   -	1570				 - [-   -	-]
# 5  Background short  Completed				   -	1569				 - [-   -	-]
# 6  Background short  Completed				   -	1568				 - [-   -	-]
# 7  Background short  Completed				   -	1567				 - [-   -	-]
# 8  Background short  Completed				   -	1566				 - [-   -	-]
# 9  Background short  Completed				   -	1565				 - [-   -	-]
#10  Background short  Completed				   -	1564				 - [-   -	-]
#11  Background short  Completed				   -	1563				 - [-   -	-]
#12  Background short  Completed				   -	1562				 - [-   -	-]
#13  Background short  Completed				   -	1561				 - [-   -	-]
#14  Background short  Completed				   -	1560				 - [-   -	-]
#15  Background short  Completed				   -	1559				 - [-   -	-]
#16  Background short  Completed				   -	1558				 - [-   -	-]
#17  Background short  Completed				   -	1557				 - [-   -	-]
#18  Background short  Completed				   -	1556				 - [-   -	-]
#19  Background short  Completed				   -	1555				 - [-   -	-]
#20  Background short  Completed				   -	1554				 - [-   -	-]

Long (extended) Self Test duration: 3670 seconds [61.2 minutes]

SweetAndLow · Jan 28, 2017

Da0 looks like the correct drive. I just wanted to double check because the da labels are random and have nothing to do with drive or slot location.

Sent from my Nexus 5X using Tapatalk

orddie · Jan 28, 2017

SweetAndLow said:
Da0 looks like the correct drive. I just wanted to double check because the da labels are random and have nothing to do with drive or slot location.

Sent from my Nexus 5X using Tapatalk

ah. I see my mistake. thanks!

Robert Trevellyan · Jan 28, 2017

danb35 said:
I doubt they'd cause any significant wear (there's going to be plenty of I/O going on anyway), but agree that hourly SMART tests are complete overkill

It's not just the extra wear, but the fact that you fill up the logs and age out results that you might find useful.

Important Announcement for the TrueNAS Community.

Whelp.. One or more devices error has finally hit me

orddie

Contributor

Spearfoot

He of the long foot

orddie

Contributor

Spearfoot

He of the long foot

orddie

Contributor

rs225

Guru

orddie

Contributor

rs225

Guru

Arwen

MVP

orddie

Contributor

SweetAndLow

Sweet'NASty

danb35

Hall of Famer

orddie

Contributor

SweetAndLow

Sweet'NASty

orddie

Contributor

Robert Trevellyan

Pony Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

Whelp.. One or more devices error has finally hit me

Contributor

He of the long foot

Contributor

He of the long foot

Contributor

Guru

Contributor

Guru

MVP

Contributor

Sweet'NASty

Hall of Famer

Contributor

Sweet'NASty

Contributor

Pony Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Whelp.. One or more devices error has finally hit me"

Similar threads