SOLVED Various SCSI sense errors during scrubbing

Status
Not open for further replies.

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
For it to be a valid failure, it needs to be an 8 drive system... since WD only say that their drives are certified for 8 drive systems

Great point. However, the older version of the same drive (WD60EFRX-68MYMN1), does not appear to have this problem. The WD Red Pro (Which I have ordered as a replacement) does not seem to have this limitation in it's description.

In my list of things to try, there is an experiment with a zpool of all "bad" drives. We'll see what happens.

I could also try to connect the bad drive directly to the motherboard, and place it somewhere vibration proof? However, this would also change these factors: Backplane not used, HBA not used.

BTW, how are your drives mounted? In trays? Are you using all 4 screws?

All drives are mounted in trays like these. All are mounted with 4 screws to the trays. 6 screws can be used. Maybe I should use all 6 mounting holes.
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
They could wave it away if it were more as "drive vibrations not being able to be handled" or something...

In my front backplane, I have 24 bays of which only 12 are in use. I Could put the single "bad" drive in one side, and all the other drives on the other side. Would that not (At least intuitively) minimize vibrations on the drive?
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
I removed the single "bad" drive from the zpool and scrubbed. There was not a single SCSI error reported during that scrub. So, the issue is repeatable.

I then looked at the 11 drives with SCSI Errors from the scrub with the "bad" drive in the zpool.

Code:
   1 (da10:mps0:0:22:0): CAM status: SCSI Status Error
   1 (da11:mps0:0:23:0): CAM status: SCSI Status Error
   1 (da12:mps0:0:24:0): CAM status: SCSI Status Error
   1 (da16:mps0:0:28:0): CAM status: SCSI Status Error
   1 (da17:mps0:0:29:0): CAM status: SCSI Status Error
   1 (da19:mps0:0:31:0): CAM status: SCSI Status Error
   1 (da20:mps0:0:32:0): CAM status: SCSI Status Error
   1 (da22:mps0:0:34:0): CAM status: SCSI Status Error
   3 (da26:mps0:0:9:0): CAM status: SCSI Status Error
   1 (da7:mps0:0:19:0): CAM status: SCSI Status Error
   1 (da9:mps0:0:21:0): CAM status: SCSI Status Error


Drive da26 in the list (3 errors) is the "bad" drive of type WD60EFRX-68L0BN1. All the other 10 disks are of type WD60EFRX-68MYMN1 (Older type of the "bad" drive it seems). Each of these had a single error occur on them during the scrub.

Let's see a list of drives I have in the 24 disk zpool:
Code:
   1 Device Model:	 WDC WD6002FFWX-68TZ4N0
   3 Device Model:	 WDC WD6002FRYZ-01WD5B0
   1 Device Model:	 WDC WD60EFRX-68L0BN1
  19 Device Model:	 WDC WD60EFRX-68MYMN1


I'll ignore the "bad" disk, because errors always occurs on that type of drive during a scrub (It's 100% likely). So now, I have a zpool were 19 out of 23 disks are of type WD60EFRX-68MYMN1. That is, 19/23 = 83% are drives of the type were all errors occured (Ignoring the "bad" disk).

What is the likelihood, of 10 errors occuring on only drives of type WD60EFRX-68MYMN1? I think the calculation would be 0.83^10. The result is 15%. So what I'm seeing is relatively unlikely (15%), assuming all non "bad" drives are equally likely to have SCSI Errors happen on them during a scrub with a "bad" drive in the zpool.

I will re add the "bad" drive to the zpool, run several scrubs, and see how the errors are distributed. My hypothesis is, that only drives of model WD60EFRX (WD Red) will have errors, while WD6002FFWX (WD Red Pro) & WD6002FRYZ (WD Gold) will not be affected.

These are my expectations:
  • Many errors on the single "bad" drive of type WD60EFRX-68L0BN1 (WD Red).
  • Some errors on drives of type WD60EFRX-68MYMN1 (WD Red).
  • No errors on other types of drive (Red Pro & Gold).
Note to self: I should document which backplane discs with errors are attached to. The issue could be mechanical (Vibration). Maybe the "bad" drive only affects drives on backplanes it itself is attached to?
 
Last edited:

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
The problem is now solved. I get not SCSI-errors when scrubbing (I have been for weeks).

I have removed all disks of type WD60EFRX-68L0BN1.

I have not seen any problems with type WD60EFRX-68MYMN1. As WD60EFRX-68L0BN1 seems to be the newer model, I suspect performace was sacrificed for lower manufacturing costs. Note, that WD do not rate these drives in pools of the size I'm using. WD Red is for 1 to 8 bay raids according to the datasheet.

From now on, I'll only use WD Gold disks. In my zpools.
 
Last edited:

BlueMagician

Explorer
Joined
Apr 24, 2015
Messages
56
Not to crash/revive a thread un-necessarily, but I ust wanted to add...

I have been haunted by a recurring
CAM status / SCSI Status / SCSI sense MEDIUM ERROR (URE) on one drive, for a long time.

It always occurs during scrubs, same drive, same error, always resulting in a small fragment of data being (luckily) repaired by ZFS. I've replaced cables, and moved the drives to different ports - and the error always follows the drive.

The drive in question is (yes, you guessed it) the only 68L0BN1 in my pool.

I had kind of resigned myself to the fact that this error was just one of those things - but having found this thread, I think I'll be digging a little deeper and having a frank conversation with WD about an RMA...

Sorry you had to go through all your trauma, but thank you for the fantastic commentary and perseverance.

Simon.
 
Last edited:

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
The drive in question is (yes, you guessed it) the only L8L0BN1 in my pool.

Interesting. It seems I'm not just imagining things.


I had kind of resigned myself to the fact that this error was just one of those things - but having found this thread, I think I'll be digging a little deeper and having a frank conversation with WD about an RMA...

How many drives do you have in your pool? WD says the drives are for 1-8 bay RAIDs.

Maybe we should write a post warning about WD60EFRX-68L0BN1 drives?
 
Last edited:

BlueMagician

Explorer
Joined
Apr 24, 2015
Messages
56
Interesting. It seems I'm not just imagining things.
The experiences of just two users are a little thin to jump to immediate conclusions, but I was very intrigued to read your findings - and it was enough to prompt me to log an advance RMA with WD, and specifically request a non-68L0BN1 as a replacement for my drive that's generating these MEDIUM ERRORS.

How many drives do you have in your pool? WD says the drives are for 1-8 bay RAIDs.
My full system spec is available in my signature as an expandable element when browsing the forum in Desktop mode.

But to answer your question, I am running just 6 x WD60EFRX's. So my usage couldn't be dismissed as 'out of support band'.

Apart from one 68MYMN1 that started failing SMART tests early in its life but showing physical sector issues (and was replaced with the 68L0BN1), none of the others have ever caused any problems.

So to reiterate, my system has three different revisions of WD60EFRX in it, all running the same firmware. And it's only the 68L0BN1 that's causing problems - and at almost every scrub cycle - regardless of what HBA port it's attached to.

Take from that what you will - but I'd be very interested to hear if anyone else has had a similar experience...
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
It’s more than a couple. I think Tobias found a few other threads where people had similar issues, and it turned out they were 68L0BN1 as well.

Beginning to look like an issue to me.
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
I am running just 6 x WD60EFRX's. So my usage couldn't be dismissed as 'out of support band'.

Important point. I have retired all my 68L0BN1s. Maybe I should set up some sort of experiment....

Note that in my setup, a single 68L0BN1 (The "bad" drive") caused errors (fewer) to also appear on drives of type 68MYMN1. I have seen no errors on disk of type 68MYMN1 after I removed all disk of type 68L0BN1.

Speculation: Maybe a wildly vibrating 68L0BN1 can throw off other drives in the same enclosure?
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
For the record: I have not had a single error after having replaced all WD Red drives with WD Golds.
 

StarkJohan

Explorer
Joined
Mar 27, 2015
Messages
62
I'm looking for the firmware file to be able to upgrade my BPN-SAS2-826-EL1 from 55.07.23.00
As the ftp site seems to be gone or at least down. Can someone please provide the FW_55.14.18.0.zip or newer if available? I'm having issues as my server has two of these backplanes with different firmwares.
 

StarkJohan

Explorer
Joined
Mar 27, 2015
Messages
62
I'm answering this myself as someone else might run into the same issue.

I contacted Supermicro EU Support. The responded within a few days with links to the latest firmware "55.14.18.00" for my backplanes and a link to a windows tool. As I prefer Linux for this kind of task I managed to find the Linux version of the tool at the Supermicro "web ftp": https://www.supermicro.com/wftp/utility/ExpanderXtools_Lite/Linux/64bits/ It also has the Windows version. In addition I received a word document describing the process.

Note that v1.5 of the ExpanderXTools Lite (smc) with xflash is needed for the SAS2 backplanes. The newer g3xflash is not backwards compatible for flashing.

Also note that the BPN-SAS2-826-EL1 has two different revisions. 1.01 and 1.01A which needs different firmwares when flashing.

The linux tool has to be run as super user. I also had to reboot between the fw flash and the mfg flash for some reason.
 

Attachments

  • SAS2 expander backplane firmware update.pdf
    1.7 MB · Views: 1,911
Status
Not open for further replies.
Top