Resilvering notifications for no reason. WD60EFAX to blame?

metalliqaz

Dabbler
Joined
Aug 25, 2016
Messages
13
So, overnight I received a notification that the system was resilvering ("Pool SATA_ARRAY state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.. ") Only a few minutes later the notification went away. I obviously did not replace any drives.

This system has a WD60EFAX drive in it. (I bought it before it was generally known to be DM-SMR) Could that have something to do with this behavior?
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
yup. there is a reason there is so much stink about WD sneaking them in. STH measured resilver times of like 9 days on SMR drives vs like 12 hours on CMR. any time the drive responds too slowly for normal pool performance and gets out of sync zfs will resilver it automatically by design. you probably want to seriously consider replacing it

also, your sig seems to indicate you have a stripe pool, and thus no redundancy anyway....?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
also, your sig seems to indicate you have a stripe pool, and thus no redundancy anyway....?

For those on mobile here's the line:

1x WD Red 6TB + 3x WD Red 3TB RAIDZ (12TB)

Assuming you mean RAIDZ1, the only way you're getting "12TB usable" from this collection of drives is if you cut the 6TB into two 3TB partitions, and then built a RAIDZ1 out of "all five" of the 3TB chunks. Please tell me I'm misunderstanding the situation here.
 

metalliqaz

Dabbler
Joined
Aug 25, 2016
Messages
13
yup. there is a reason there is so much stink about WD sneaking them in. STH measured resilver times of like 9 days on SMR drives vs like 12 hours on CMR. any time the drive responds too slowly for normal pool performance and gets out of sync zfs will resilver it automatically by design. you probably want to seriously consider replacing it

also, your sig seems to indicate you have a stripe pool, and thus no redundancy anyway....?

It is a normal RAIDZ1 pool, initially 4 3TB drives, but one failed. I replaced it with the WD60EFAX which resilvered fine. So, in reality I'm only using half the drive for now. I originally intended to eventually replace all the drives, thus doubling the final capacity. Obviously i'm not going to buy any more WD60EFAX crap.
 

metalliqaz

Dabbler
Joined
Aug 25, 2016
Messages
13
Assuming you mean RAIDZ1, the only way you're getting "12TB usable" from this collection of drives is if you cut the 6TB into two 3TB partitions, and then built a RAIDZ1 out of "all five" of the 3TB chunks. Please tell me I'm misunderstanding the situation here.

I updated the signature. I explained it above. It's only 9TB usable.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I updated the signature. I explained it above. It's only 9TB usable.
Okay, that makes more sense. I figured either I was reading something wrong or you'd accidentally written the raw space before parity.

And yes, you definitely want to avoid SMR in the future. There's a (partial) list and a statement from iXsystems here:


A potentially relevant part is bullet point 4:

At least one of the WD Red DM-SMR models (the 4TB WD40EFAX with firmware rev 82.00A82) does have a ZFS compatibility issue which can cause it to enter a faulty state under heavy write loads, including resilvering. This was confirmed in our labs this week during testing, causing this drive model to be disqualified from our products. We expect that the other WD Red DM-SMR drives with the same firmware will have the same issue, but testing is still ongoing to validate that assumption.

If your SMR drive had a hiccup and went into a "took too long to respond" sort of state, that could have caused ZFS to kick it out momentarily, and then when it reappeared it resilvered the data and everything was good again. But in your case I'd pull a smartctl -a of all drives and take a look for any signs of early failure (eg: non-zero values in Reallocated_Sector_Ct, Current_Pending_Sector, or Offline_Uncorrectable) - if it was one of your other drives that choked up, you might need to buy two new (non-SMR) drives to get yourself back to full health and full speed.
 
Top