SOLVED Various SCSI sense errors during scrubbing

Status
Not open for further replies.

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
Is the behavior the same on a different OS?

I have booted the machine off of a USB stick with the latest Ubuntu installer. I have installed the package zfsutils-linux, imported the zpool and started a scrub. I will report back.
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
I have booted the machine off of a USB stick with the latest Ubuntu installer. I have installed the package zfsutils-linux, imported the zpool and started a scrub. I will report back.

The scrub mysteriously finishes in 19/20 minutes (Scrubbed twice). I see no errors in dmesg or in the log at /var/log/syslog.

Code:
  pool: ultraman
state: ONLINE
  scan: scrub repaired 0 in 0h19m with 0 errors on Mon Aug 28 11:51:56 2017
...
...


zpool history makes me none the wiser (Reveals nothing wrong):
Code:
...
...
2017-08-28.10:44:49 zpool import -f ultraman
2017-08-28.10:45:20 zpool clear ultraman
2017-08-28.10:45:37 zpool scrub ultraman
2017-08-28.11:32:26 zpool scrub ultraman
2017-08-28.12:41:35 zpool export ultraman
...
...
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
Perhaps you should tabulate which drives in which bays have which firmware?

I made a script listing the most interesting (in this regard) data from each drive. I think I have finally found something (I'm exited!).

The drive with the highest occurence of errors (da3/WX11DC61YDPV) is the ONLY drive of device type WDC WD60EFRX-68L0BN1. But wait, there's more: 3 drives i have previously pulled from the server because of high frequency of errors, are ALL of that same type (WDC WD60EFRX-68L0BN1).

I have offlined the disk i suspect (da3/WX11DC61YDPV) and started a scrub.

These are the data (from smartctl) for all my disks (In the system):
Code:
/dev/da0
Model Family:	 Western Digital Gold
Device Model:	 WDC WD6002FRYZ-01WD5B0
Firmware Version: 01.01M02
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
/dev/da1
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68MYMN1
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
/dev/da2
Model Family:	 Western Digital Gold
Device Model:	 WDC WD6002FRYZ-01WD5B0
Firmware Version: 01.01M02
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
/dev/da3
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68L0BN1
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
/dev/da4
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68MYMN1
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
/dev/da5
Device Model:	 WDC WD6002FFWX-68TZ4N0
Firmware Version: 83.H0A83
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
/dev/da6
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68MYMN1
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
/dev/da7
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68MYMN1
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
/dev/da8
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68MYMN1
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
/dev/da9
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68MYMN1
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
/dev/da10
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68MYMN1
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
/dev/da11
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68MYMN1
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
/dev/da12
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68MYMN1
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
/dev/da13
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68MYMN1
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
/dev/da14
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68MYMN1
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
/dev/da15
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68MYMN1
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
/dev/da16
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68MYMN1
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
/dev/da17
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68MYMN1
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
/dev/da18
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68MYMN1
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
/dev/da19
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68MYMN1
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
/dev/da20
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68MYMN1
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
/dev/da21
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68MYMN1
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
/dev/da22
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68MYMN1
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
/dev/da23
Model Family:	 Western Digital Gold
Device Model:	 WDC WD6002FRYZ-01WD5B0
Firmware Version: 01.01M02
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
-----------
 
Last edited:

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
Armed with my new theory of disks of type (WD60EFRX-68L0BN1) being the culprit, I have found several other threads on the forum describing the same symptoms on the same type of disk. I have posted to the other threads hoping to bring people with the same issue together.

Maybe I should start a new thread focusing on the type of disk (WD60EFRX-68L0BN1)?
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
I think that is a good idea. So much hardware swapped, when it could've been the disks/firmware the whole time :(
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Maybe I should start a new thread focusing on the type of disk (WD60EFRX-68L0BN1)?
Feel free to compile whatever you find in a thread. I'd be interested in finding out what exactly is going on.
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
I have offlined the disk i suspect (da3/WX11DC61YDPV) and started a scrub.

The scrub finished with 0 errors and no entries for SCSI errors in dmesg. This happened only once before, when I ran on 12 degraded mirrors (12, not 24 disks):
Code:
scan: scrub repaired 0 in 14h52m with 0 errors on Tue Aug 29 03:47:00 2017


I'm now running on a degraded mirror. I will add a spare disk of type WD60EFRX-68L0BN1 (The type of disk which I suspect is causing the problems) to the pool. Having done that, I expect errors to occur again.

I will order a 6TB WD disk, making sure it is NOT of type WD60EFRX-68L0BN1 for replacement. The same model disk, of type WD60EFRX-68MYMN1 appears not to cause problems in my configuration.

Unsolved: Scrubbing a pool with a disk of type WD60EFRX-68L0BN1 in it will cause SCSI errors. BUT, errors (few in comparison ) will also appear on drives of other types (See previously posted list of disks in my zpool). Having no disk of type WD60EFRX-68L0BN1 in the zpool, results in error free scrubs. That is, there appears to be side effects of the problems caused by disks of type WD60EFRX-68L0BN1.
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
I'm aware of three types of the WD60EFRX.

This is the status of each type in regards to this thread:
WD60EFRX-68L0BN1 (Bad in current system)
WD60EFRX-68MYMN1 (No problems in current system)
WD60EFRX-68TGBN1 (Have not tried in current system)

Does anyone know of other types of WD60EFRX (This should really be a new thread)?
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
I'm aware of three types of the WD60EFRX.

This is the status of each type in regards to this thread:
WD60EFRX-68L0BN1 (Bad in current system)
WD60EFRX-68MYMN1 (No problems in current system)
WD60EFRX-68TGBN1 (Have not tried in current system)

Does anyone know of other types of WD60EFRX (This should really be a new thread)?

68MYMN1 is the Firmware after the update in this page: https://support.wdc.com/knowledgebase/answer.aspx?ID=11756
  • Verify that the CCC in the reported model has changed from "0" to "1".
  • WDC WD60EFRX-68MYMN0 changes to WDC WD60EFRX-68MYMN1

Perhaps try that firmware update procedure. 68L0bN1 could be an older firmware than 68MYMN1, since L comes before M ;)
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
Perhaps try that firmware update procedure. 68L0bN1 could be an older firmware than 68MYMN1, since L comes before M ;)

It does not appear to be the case. I have updated my list of known drive types with rough dates of manufacture. The problematic drives are the newest:

The udated list of drive types:

I'm aware of three types of the WD60EFRX.

This is the status of each type in regards to this thread:
  • WD60EFRX-68L0BN1 (Bad in current system. Manufactured august/september 2016)
  • WD60EFRX-68MYMN1 (No problems in current system. Manufactured before june 2015)
  • WD60EFRX-68TGBN1 (Have not tried in current system. Manufactured before june 2015)
 
Last edited:

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Makes this a live issue. And would explain why you managed to find half a dozen people possibly the same results. Good thing that most threads contain the SMART info which contains the extended model number.

Might have to get WD to admit there is an issue.
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
Makes this a live issue. And would explain why you managed to find half a dozen people possibly the same results..

I think I will create a new post regarding these drives once I can (hopefully) document the behaviour.

I'm planning to:
  1. Re-adding a bad drive to my zpool (Expect errors to reappear, not only on bad drive).
  2. Replace bad drive with new WD Red Pro (Expect errors on all drives to disappear).
  3. Create zpool of only bad drives (Expect lots of errors. Only on the zpool of bad drives?).
  4. Move bad zpool to other machine with same backplanes (Expect errors to happen)
  5. Connect zpool of bad drives directly to motherboard, bypassing backplanes (Expect errors to disapper since I have another FreeNAS machine with a single bad disk with no errors running off of an older backplane with no SAS expander).
Any suggestions of things to try or what to expect?
 
Last edited:

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Single drive zpool?
 

Morpheus187

Explorer
Joined
Mar 11, 2016
Messages
61
I have the same disks in use as you

7x
Code:
Model Family:  Western Digital Red
Device Model:  WDC WD60EFRX-68L0BN1
LU WWN Device Id: 5 0014ee 0596e991f
Firmware Version: 82.00A82

and 1x
Code:
Model Family:  Western Digital Red
Device Model:  WDC WD60EFRX-68MYMN1
LU WWN Device Id: 5 0014ee 2b745e38b
Firmware Version: 82.00A82


Everything runs fine since ~12'000 hours I just got one error so far on da2 ( which is a 68L0BN1 drive )

Code:
(da2:mpr0:0:2:0): READ(16). CDB: 88 00 00 00 00 01 37 a5 79 a0 00 00 00 38 00 00
(da2:mpr0:0:2:0): CAM status: SCSI Status Error
(da2:mpr0:0:2:0): SCSI status: Check Condition
(da2:mpr0:0:2:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da2:mpr0:0:2:0): Info: 0x137a579a0
(da2:mpr0:0:2:0): Error 5, Unretryable error


But I guess that's just a normal hdd hickup. But I will keep your findings in my mind, maybe there is an issue with that disk. Maybe the issue only shows up with more than 8 drives?
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
I have the same disks in use as you

...
But I guess that's just a normal hdd hickup. But I will keep your findings in my mind, maybe there is an issue with that disk. Maybe the issue only shows up with more than 8 drives?

Hard drive hiccups are not normal.
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
For testing purposes, I have installed a disk of type WD60EFRX-68L0BN1 ("bad") in the machine with the 24 drive zpool. The disk is not part of the zpool. It is connected through one of the backplanes. The disk is currently at /dev/da3.

I have started at run with badblocks on it to see what happens. If running badblocks on the drive results in no SCSI errors being thrown, would it not suggest that the error is ZFS related? Let me hear your thoughts.

Code:
[root@ultraman] ~# badblocks -p 4 -b 4096 -wsv /dev/da3

Checking for bad blocks in read-write mode
From block 0 to 1465130645
Testing with pattern 0xaa: set_o_direct: Inappropriate ioctl for device
...
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
It could be related to how deep the queues get. dd will not do much to it.

You could also try solnet array tester
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
I'm working on documenting the issue. I finished the 1st test. My expectations held true.
1. Re-adding a bad drive to my zpool (Expect errors to reappear, not only on bad drive).

I added a "bad" drive (da26) to the zpool and scrubbed. These were the SCSI errors reported in dmesg.
Code:
[root@ultraman] ~# dmesg | grep -i scsi\ status\ error | sort | uniq -c
   1 (da10:mps0:0:22:0): CAM status: SCSI Status Error
   1 (da11:mps0:0:23:0): CAM status: SCSI Status Error
   1 (da12:mps0:0:24:0): CAM status: SCSI Status Error
   1 (da16:mps0:0:28:0): CAM status: SCSI Status Error
   1 (da17:mps0:0:29:0): CAM status: SCSI Status Error
   1 (da19:mps0:0:31:0): CAM status: SCSI Status Error
   1 (da20:mps0:0:32:0): CAM status: SCSI Status Error
   1 (da22:mps0:0:34:0): CAM status: SCSI Status Error
   3 (da26:mps0:0:9:0): CAM status: SCSI Status Error
   1 (da7:mps0:0:19:0): CAM status: SCSI Status Error
   1 (da9:mps0:0:21:0): CAM status: SCSI Status Error


As I expected, most errors were reported on the "bad" drive (da26). But, errors were also reported on other drives.

This is the state of the zpool after the scrub. The drive in mirror-5 with 3 in CKSUM (a2851364-184c-11e7-ae9d-0025901ef244) is drive da10 which had an error in dmesg. That drive is of the "good" type WD60EFRX-68MYMN1:
Code:
  pool: ultraman
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 1.04M in 15h11m with 0 errors on Wed Aug 30 23:55:26 2017
config:

	NAME											STATE	 READ WRITE CKSUM
	ultraman										ONLINE	   0	 0	 0
	  mirror-0									  ONLINE	   0	 0	 0
		gptid/2e00ac23-183d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/6e71919e-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
	  mirror-1									  ONLINE	   0	 0	 0
		gptid/6f22c98c-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/6fe54bfe-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
	  mirror-2									  ONLINE	   0	 0	 0
		gptid/70bfd5c6-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/a20e4ec0-8c8d-11e7-ac97-0cc47a5312f0  ONLINE	   0	 0	 0
	  mirror-3									  ONLINE	   0	 0	 0
		gptid/7ad0f185-1619-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/9e899578-183c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-4									  ONLINE	   0	 0	 0
		gptid/427c2189-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/4342a98c-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-5									  ONLINE	   0	 0	 0
		gptid/a1ab9a69-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/a2851364-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 3
	  mirror-6									  ONLINE	   0	 0	 0
		gptid/0dcbcccd-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/0e9ed582-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-7									  ONLINE	   0	 0	 0
		gptid/2b56cf1c-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/2c2a1c62-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-8									  ONLINE	   0	 0	 0
		gptid/69718320-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/6a4e6afa-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-9									  ONLINE	   0	 0	 0
		gptid/8fccb2c6-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/90b48d70-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-10									 ONLINE	   0	 0	 0
		gptid/aeda4f88-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/afb5bfc3-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-11									 ONLINE	   0	 0	 0
		gptid/bca6e4f0-1b71-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/d06f5f86-7b79-11e7-b91b-0cc47a5312f0  ONLINE	   0	 0	 0

errors: No known data errors
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
For it to be a valid failure, it needs to be an 8 drive system... since WD only say that their drives are certified for 8 drive systems :(

They could wave it away if it were more as "drive vibrations not being able to be handled" or something... and who knows... maybe this whole drive vibration thing and "8 drive limit" is not totally marketing bs.

https://www.wdc.com/products/internal-storage/wd-red.html

Screen Shot 2017-08-31 at 5.05.22 PM.png
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
BTW, how are your drives mounted? In trays? Are you using all 4 screws?
 
Status
Not open for further replies.
Top