Whelp.. One or more devices error has finally hit me

Status
Not open for further replies.

orddie

Contributor
Joined
Jun 11, 2016
Messages
104
Lots of good info on the forums and I thank you all for this thus far. Im mainly looking for confirmation. I moved the drives from one case to another and SOME of the reviews of the case say... it killed my hard drive. So I did a stupid thing and moved the drive to another port in the case to see if the error moves. Apparently free-nas is smart enough to not care where the drive is... it's still da0.


error started last night, While running last night, it claimed to 2.5k errors. so I rebooted to confirm the drive was seated correctly. brought it back up, saw errors climb again, shutdown and moved the drive. below is the output after booting the server the 3rd time.

Code:
[root@freenas] ~# zpool status -v
  pool: Vmware
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
		attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
		using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 2h58m with 0 errors on Sat Jan 21 10:58:15 2017
config:

		NAME											STATE	 READ WRITE CKSUM
		Vmware										  DEGRADED	 0	 0	 0
		  raidz1-0									  DEGRADED	 0	 0	 0
			gptid/a830a765-ae8d-11e6-b089-001b21a7c63c  DEGRADED	 0	 0	33  too many errors
			gptid/a89e4ab3-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
			gptid/a90d594d-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
			gptid/a979fb28-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
		  raidz1-1									  ONLINE	   0	 0	 0
			gptid/a9e56fc8-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
			gptid/aa5673dc-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
			gptid/aac48004-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
			gptid/ab336f0d-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0h1m with 0 errors on Thu Dec 29 03:46:41 2016
config:

		NAME										  STATE	 READ WRITE CKSUM
		freenas-boot								  ONLINE	   0	 0	 0
		  gptid/87474e74-b21a-11e6-a53f-001b21a7c63c  ONLINE	   0	 0	 0





I'm going to disable my hourly S.M.A.R.T test and run a Long test on the drive. but here is the short output

Code:
[root@freenas] ~# smartctl -a /dev/da0
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:			   HITACHI
Product:			  HUC109045CSS600
Revision:			 A2B0
Compliance:		   SPC-4
User Capacity:		450,098,159,616 bytes [450 GB]
Logical block size:   512 bytes
Rotation Rate:		10020 rpm
Form Factor:		  2.5 inches
Logical Unit id:	  0x5000cca0162c6f04
Serial number:		KMGTEJXF
Device type:		  disk
Transport protocol:   SAS (SPL-3)
Local Time is:		Fri Jan 27 09:08:20 2017 EST
SMART support is:	 Available - device has SMART capability.
SMART support is:	 Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:	 36 C
Drive Trip Temperature:		85 C

Manufactured in week 50 of year 2012
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  11
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  76
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 1951387957067776

Error counter log:
		   Errors Corrected by		   Total   Correction	 Gigabytes	Total
			   ECC		  rereads/	errors   algorithm	  processed	uncorrected
		   fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:		  0	44227		 0	 44227	 143934		407.599		   0
write:		 0   262436		 0	262436	  83490	   1158.285		   0
verify:		0		0		 0		 0	  27039		  0.000		   0

Non-medium error count:		0

SMART Self-test log
Num  Test			  Status				 segment  LifeTime  LBA_first_err [SK ASC ASQ]
	 Description							  number   (hours)
# 1  Background short  Completed				   -	1575				 - [-   -	-]
# 2  Background short  Completed				   -	1574				 - [-   -	-]
# 3  Background short  Completed				   -	1573				 - [-   -	-]
# 4  Background short  Completed				   -	1572				 - [-   -	-]
# 5  Background short  Completed				   -	1571				 - [-   -	-]
# 6  Background short  Completed				   -	1570				 - [-   -	-]
# 7  Background short  Completed				   -	1569				 - [-   -	-]
# 8  Background long   Completed				   -	1568				 - [-   -	-]
# 9  Background short  Completed				   -	1566				 - [-   -	-]
#10  Background short  Completed				   -	1565				 - [-   -	-]
#11  Background short  Completed				   -	1564				 - [-   -	-]
#12  Background short  Completed				   -	1563				 - [-   -	-]
#13  Background short  Completed				   -	1562				 - [-   -	-]
#14  Background short  Completed				   -	1561				 - [-   -	-]
#15  Background short  Completed				   -	1560				 - [-   -	-]
#16  Background short  Completed				   -	1559				 - [-   -	-]
#17  Background short  Completed				   -	1558				 - [-   -	-]
#18  Background short  Completed				   -	1557				 - [-   -	-]
#19  Background short  Completed				   -	1556				 - [-   -	-]
#20  Background short  Completed				   -	1555				 - [-   -	-]

Long (extended) Self Test duration: 3670 seconds [61.2 minutes]




so... I gotta replace the SPIN drive. which is okay. but I would THINK smart would tell me something!
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
so... I gotta replace the SPIN drive. which is okay. but I would THINK smart would tell me something!
Indeed! I've had drives suddenly die without any kind of warning from SMART tests... but you're seeing a bunch of FreeNAS errors w/ no SMART warnings at all. I'm not as familiar with the SMART output from SAS drives; is this section is our clue?
Code:
Error counter log:
		 Errors Corrected by		 Total Correction	Gigabytes	Total
			 ECC		 rereads/	errors algorithm	 processed	uncorrected
		 fast | delayed rewrites corrected invocations [10^9 bytes] errors
read:		 0	44227		0	44227	143934		407.599		 0
write:		0 262436		0	262436	 83490	 1158.285		 0
verify:		0		0		0		0	 27039		 0.000		 0

Did you try replacing the cable? :)
 

orddie

Contributor
Joined
Jun 11, 2016
Messages
104
Indeed! I've had drives suddenly die without any kind of warning from SMART tests... but you're seeing a bunch of FreeNAS errors w/ no SMART warnings at all. I'm not as familiar with the SMART output from SAS drives; is this section is our clue?
Code:
Error counter log:
		 Errors Corrected by		 Total Correction	Gigabytes	Total
			 ECC		 rereads/	errors algorithm	 processed	uncorrected
		 fast | delayed rewrites corrected invocations [10^9 bytes] errors
read:		 0	44227		0	44227	143934		407.599		 0
write:		0 262436		0	262436	 83490	 1158.285		 0
verify:		0		0		0		0	 27039		 0.000		 0

Did you try replacing the cable? :)


remember, I moved the drive into another port / slot in the case. so I would have expected the error to move drives if it was cable / port on the raid card.

to answer your question, no I did not try and replace the cable.
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
remember, I moved the drive into another port / slot in the case. so I would have expected the error to move drives if it was cable / port on the raid card.

to answer your question, no I did not try and replace the cable.
Doooh! :rolleyes:
 

orddie

Contributor
Joined
Jun 11, 2016
Messages
104
so.. long test is done! looks like it passed?

Code:
smartctl -t long /dev/da0


Code:


[root@freenas] ~# smartctl -a /dev/da0
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:			   HITACHI
Product:			  HUC109045CSS600
Revision:			 A2B0
Compliance:		   SPC-4
User Capacity:		450,098,159,616 bytes [450 GB]
Logical block size:   512 bytes
Rotation Rate:		10020 rpm
Form Factor:		  2.5 inches
Logical Unit id:	  0x5000cca0162c6f04
Serial number:		KMGTEJXF
Device type:		  disk
Transport protocol:   SAS (SPL-3)
Local Time is:		Fri Jan 27 11:14:51 2017 EST
SMART support is:	 Available - device has SMART capability.
SMART support is:	 Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:	 34 C
Drive Trip Temperature:		85 C

Manufactured in week 50 of year 2012
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  11
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  76
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 1952980047757312

Error counter log:
		   Errors Corrected by		   Total   Correction	 Gigabytes	Total
			   ECC		  rereads/	errors   algorithm	  processed	uncorrected
		   fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:		  0	44363		 0	 44363	 144697		408.321		   0
write:		 0   263471		 0	263471	  83533	   1160.020		   0
verify:		0		0		 0		 0	  27640		  0.000		   0

Non-medium error count:		0

SMART Self-test log
Num  Test			  Status				 segment  LifeTime  LBA_first_err [SK ASC ASQ]
	 Description							  number   (hours)
# 1  Background long   Completed				   -	1578				 - [-   -	-]
# 2  Background short  Completed				   -	1575				 - [-   -	-]
# 3  Background short  Completed				   -	1574				 - [-   -	-]
# 4  Background short  Completed				   -	1573				 - [-   -	-]
# 5  Background short  Completed				   -	1572				 - [-   -	-]
# 6  Background short  Completed				   -	1571				 - [-   -	-]
# 7  Background short  Completed				   -	1570				 - [-   -	-]
# 8  Background short  Completed				   -	1569				 - [-   -	-]
# 9  Background long   Completed				   -	1568				 - [-   -	-]
#10  Background short  Completed				   -	1566				 - [-   -	-]
#11  Background short  Completed				   -	1565				 - [-   -	-]
#12  Background short  Completed				   -	1564				 - [-   -	-]
#13  Background short  Completed				   -	1563				 - [-   -	-]
#14  Background short  Completed				   -	1562				 - [-   -	-]
#15  Background short  Completed				   -	1561				 - [-   -	-]
#16  Background short  Completed				   -	1560				 - [-   -	-]
#17  Background short  Completed				   -	1559				 - [-   -	-]
#18  Background short  Completed				   -	1558				 - [-   -	-]
#19  Background short  Completed				   -	1557				 - [-   -	-]
#20  Background short  Completed				   -	1556				 - [-   -	-]

Long (extended) Self Test duration: 3670 seconds [61.2 minutes]

 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
If this problem is following the drive, then it could be a drive electronics problem (or possibly power). It may not show up in SMART, because those are generally looking for errors either on the cable, or in the drive mechanics. This problem might be right in the middle. Only ZFS checksums seem to be detecting it.
 

orddie

Contributor
Joined
Jun 11, 2016
Messages
104
If this problem is following the drive, then it could be a drive electronics problem (or possibly power). It may not show up in SMART, because those are generally looking for errors either on the cable, or in the drive mechanics. This problem might be right in the middle. Only ZFS checksums seem to be detecting it.
either way.. replace the drive.. yes?
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
Replace the drive, but if the problem continues, then the drive isn't bad after all. :)
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Uh, I would think running hourly short SMART tests would be both overkill and potentially causing a little un-needed wear.
My own opinion is that even daily short SMART tests would be too much. Weekly or several times a week perhaps.
 

orddie

Contributor
Joined
Jun 11, 2016
Messages
104
Replace the drive, but if the problem continues, then the drive isn't bad after all. :)
HA.. Thanks!

Uh, I would think running hourly short SMART tests would be both overkill and potentially causing a little un-needed wear.
My own opinion is that even daily short SMART tests would be too much. Weekly or several times a week perhaps.


we have different views on how important storing porn is
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Are you sure the drive that is degraded is da0. Double check using glabel status.

Sent from my Nexus 5X using Tapatalk
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Uh, I would think running hourly short SMART tests would be both overkill and potentially causing a little un-needed wear.
I doubt they'd cause any significant wear (there's going to be plenty of I/O going on anyway), but agree that hourly SMART tests are complete overkill. I run short tests daily, but that's admittedly on the "high-frequency" end of the recommended range.
 

orddie

Contributor
Joined
Jun 11, 2016
Messages
104
Are you sure the drive that is degraded is da0. Double check using glabel status.

Sent from my Nexus 5X using Tapatalk


what started as da0 moved to da1 which is most likely when I moved the HD to a different slot. I think I did a long test on the wrong drive. will do a long test on the correct one now.

Code:
root@freenas] ~# zpool status -v
  pool: Vmware
state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
		attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
		using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 2h58m with 0 errors on Sat Jan 21 10:58:15 2017
config:

		NAME											STATE	 READ WRITE CKSUM
		Vmware										  DEGRADED	 0	 0	 0
		  raidz1-0									  DEGRADED	 0	 0	 0
			gptid/a830a765-ae8d-11e6-b089-001b21a7c63c  DEGRADED	 0	 0 5.00K  too many errors
			gptid/a89e4ab3-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
			gptid/a90d594d-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
			gptid/a979fb28-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
		  raidz1-1									  ONLINE	   0	 0	 0
			gptid/a9e56fc8-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
			gptid/aa5673dc-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
			gptid/aac48004-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0
			gptid/ab336f0d-ae8d-11e6-b089-001b21a7c63c  ONLINE	   0	 0	 0

errors: No known data errors

  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0h1m with 0 errors on Thu Dec 29 03:46:41 2016
config:

		NAME										  STATE	 READ WRITE CKSUM
		freenas-boot								  ONLINE	   0	 0	 0
		  gptid/87474e74-b21a-11e6-a53f-001b21a7c63c  ONLINE	   0	 0	 0

errors: No known data errors
[root@freenas] ~# glabel status | grep gptid/a830a765-ae8d-11e6-b089-001b21a7c63c
gptid/a830a765-ae8d-11e6-b089-001b21a7c63c	 N/A  da0p1
[root@freenas] ~# smartctl -a /dev/da1
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:			   HITACHI
Product:			  HUC109045CSS600
Revision:			 A2B0
Compliance:		   SPC-4
User Capacity:		450,098,159,616 bytes [450 GB]
Logical block size:   512 bytes
Rotation Rate:		10020 rpm
Form Factor:		  2.5 inches
Logical Unit id:	  0x5000cca0160794a8
Serial number:		KMG457JF
Device type:		  disk
Transport protocol:   SAS (SPL-3)
Local Time is:		Sat Jan 28 08:56:36 2017 EST
SMART support is:	 Available - device has SMART capability.
SMART support is:	 Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:	 31 C
Drive Trip Temperature:		85 C

Manufactured in week 50 of year 2012
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  11
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  73
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 1932938170073088

Error counter log:
		   Errors Corrected by		   Total   Correction	 Gigabytes	Total
			   ECC		  rereads/	errors   algorithm	  processed	uncorrected
		   fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:		  0   130372		 0	130372	 837320		398.966		   0
write:		 0   288857		 0	288857	  98277	   1163.567		   0
verify:		0		0		 0		 0	 141324		  0.000		   0

Non-medium error count:		0

SMART Self-test log
Num  Test			  Status				 segment  LifeTime  LBA_first_err [SK ASC ASQ]
	 Description							  number   (hours)
# 1  Background short  Completed				   -	1573				 - [-   -	-]
# 2  Background short  Completed				   -	1572				 - [-   -	-]
# 3  Background short  Completed				   -	1571				 - [-   -	-]
# 4  Background short  Completed				   -	1570				 - [-   -	-]
# 5  Background short  Completed				   -	1569				 - [-   -	-]
# 6  Background short  Completed				   -	1568				 - [-   -	-]
# 7  Background short  Completed				   -	1567				 - [-   -	-]
# 8  Background short  Completed				   -	1566				 - [-   -	-]
# 9  Background short  Completed				   -	1565				 - [-   -	-]
#10  Background short  Completed				   -	1564				 - [-   -	-]
#11  Background short  Completed				   -	1563				 - [-   -	-]
#12  Background short  Completed				   -	1562				 - [-   -	-]
#13  Background short  Completed				   -	1561				 - [-   -	-]
#14  Background short  Completed				   -	1560				 - [-   -	-]
#15  Background short  Completed				   -	1559				 - [-   -	-]
#16  Background short  Completed				   -	1558				 - [-   -	-]
#17  Background short  Completed				   -	1557				 - [-   -	-]
#18  Background short  Completed				   -	1556				 - [-   -	-]
#19  Background short  Completed				   -	1555				 - [-   -	-]
#20  Background short  Completed				   -	1554				 - [-   -	-]

Long (extended) Self Test duration: 3670 seconds [61.2 minutes]

 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Da0 looks like the correct drive. I just wanted to double check because the da labels are random and have nothing to do with drive or slot location.

Sent from my Nexus 5X using Tapatalk
 

orddie

Contributor
Joined
Jun 11, 2016
Messages
104
Da0 looks like the correct drive. I just wanted to double check because the da labels are random and have nothing to do with drive or slot location.

Sent from my Nexus 5X using Tapatalk


ah. I see my mistake. thanks!
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
I doubt they'd cause any significant wear (there's going to be plenty of I/O going on anyway), but agree that hourly SMART tests are complete overkill
It's not just the extra wear, but the fact that you fill up the logs and age out results that you might find useful.
 
Status
Not open for further replies.
Top