SOLVED Disk replacement questions.

Joined
Mar 9, 2022
Messages
6
Hi all,

I've got a weird situation and I can't work out what i'm doing wrong.

First some information:
I'm running Truenas Core 12.0 U7, installed as freenas couple of years ago and kept it up to date.
I got 1 pool, called Storage. It's a RAIDZ2, 8 drives. 2TB each. Installed in a supermicro case with 24 front bays.
All this is connected to 3 LSI cards in the PCI slots.
It's also connected to a UPS, and as far as I know there was no power failure.

A couple of weeks ago I started getting emails that 1 disk is getting more and more errors. 2 weeks ago it died. Truenas reported a degraded state for my pool.
This drive was labeled da7 in Truenas disk management. It reported it as Faulted.

Ordered a replacement drive, WD 2TB NAS drive, same as before.
Tried to offline the faulted disk for replacement, but it didnt go offline. Quick google search shows that a faulted drive is already a offline drive, so I could just continue with the replacement steps. I took the old drive out, new drive in. Clicked replace, new drive showed up. Ticked the force option (so its formatted fist as i understand it) and it started formatting and adding the drive to the pool. It also started resilvering, and all disks are online. Great!

No.

When the resilvering was at about 15%, I got a email that I got a faulted drive. The new drive that I just replaced. Weird.
I let the resilvering finish, thinking/hoping this was some sort of weird thing because the drives are more stressed during this proces.
But no, when the resilvering was finished the new disk was still faulted.
I started a scrub, nothing changes when that was done.
I tried
Code:
zpool clear Storage
which seemed to be the trick, all drive showed online. Nothing was degraded. But it started resilvering again, and when it hit about 15% I got the same e-mail. Storage pool is degraded with 1 drive faulted. (still the same new drive).

Ok, at this point I was thinking the new drive is a DOA. Send it back to the store and I got a new replacement drive. (different serial numbers, so brand new). Thinking I was correct in thinking the first drive was a DOA (death on arrival) because they accepted the replacement.

Got the new drive, did the same steps as above but I got the same result. New drive again faulted.
I also tried a different slot in my case. Different backplane and different LSI card. No luck.

At this point I have no idea what to do. Am i doing anything wrong here? Is there a bug in the software? Hoping you can answer this.

Thanks in advance!
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Ticked the force option (so its formatted fist as i understand it)
You understand it incorrectly; the force option will add it to the pool even if it's already been partitioned/formatted, which the system otherwise won't do. The system always partitions disks that are added to the pool, and "formatting" is kind of a foreign concept to ZFS (but to the extent it's relevant, it does that too). That checkbox was unnecessary in this case, but not the cause of your problem.

at this point I was thinking the new drive is a DOA
If you'd burned in and tested the drive before putting it into your pool, you could rule out this possibility. Since you didn't say you've done that, I'd recommend doing it. Start with a long SMART test; if that completes without errors, run through badblocks.
 
Joined
Mar 9, 2022
Messages
6
You understand it incorrectly; the force option will add it to the pool even if it's already been partitioned/formatted, which the system otherwise won't do. The system always partitions disks that are added to the pool, and "formatting" is kind of a foreign concept to ZFS (but to the extent it's relevant, it does that too). That checkbox was unnecessary in this case, but not the cause of your problem.


If you'd burned in and tested the drive before putting it into your pool, you could rule out this possibility. Since you didn't say you've done that, I'd recommend doing it. Start with a long SMART test; if that completes without errors, run through badblocks.

I have actually done that with the 2nd replacement drive. Long SMART test has run with no errors. Haven't got the result right now, im at work.
How can I start a badblocks test?

On another note, I was just thinking. This could also be a SMR/CMR issue. All my previous drive are 3 / 4 years old, the new ones could be SMR.
I'm gonna check that as well when I get home.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
are you positive you have genuine LSI cards and didn't end up with fakes? I had fakes when I first was getting into freenas, and they just one day started giving me errors all the over the place and I could never figure out why. replaced them with a true blue LSI card and poof. no errors.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
Is the new drive a CMR or SMR one? The latter does not work with ZFS.
 
Joined
Mar 9, 2022
Messages
6
are you positive you have genuine LSI cards and didn't end up with fakes? I had fakes when I first was getting into freenas, and they just one day started giving me errors all the over the place and I could never figure out why. replaced them with a true blue LSI card and poof. no errors.
Quite sure there genuine. It's been running for four years now, maybe 4,5 years. Cant really imaging thats the issue.

Is the new drive a CMR or SMR one? The latter does not work with ZFS.

As it turns out, it's an SMR drive.
When it happened first I just quickly ordered a replacement drive, thinking WD RED is a good choice. Had them for years without real problems.

Yesterday evening I contacted the store with my issue, asking if I can return this one for a WD RED PRO, which uses CMR.

I'm guessing it's gonna arrive early next week, I'll keep you posted.

But just to clarify, other then ordering the wrong drive, I did do everything correct right? With the offline and replacing in Truenas.
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
WD RED is a good choice
this used to be true, but WD decided to try and stealthily torpedo the line. there are whole articles on it.
only WD red PLUS and PRO are CMR now, wd red is mixed (like, under 2TB is CMR and 2TB and over is SMR, or something). you have to check the model number, if it ends in certain things its CMR or SMR.
and since they tried to sneak in bad tech once, they can't be trusted to not do it again.
I wont buy WD drives anymore if i can at all avoid them. seagate came right out and said "SMR is not appropriate for NAS drives"
The latter does not work with ZFS
technically SMR does "work" with ZFS...just very badly. the OP is lucky they got alerts so fast, some people only find out when the pool fails to resilver and dies 2 weeks later...
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
But just to clarify, other then ordering the wrong drive, I did do everything correct right? With the offline and replacing in Truenas.
seemed like. drive dead. replace drive. resilver.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
and since they tried to sneak in bad tech once, they can't be trusted to not do it again.
WD can absolutely be trusted… to be untrustworthy.
Sneaking SMR in the NAS line, while knowing that it failed under ZFS load, was only the beginning.
Then, there was the marketing department labelling 7200 rpm drives as "5400-rpm class".
And now WD proudly releases new Red Pro drives whose official endurance rating is lower than that of some consumer SSDs and barely allows to read the whole drive once a month. (Yes: Endurance rating on READS!)
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
And now WD proudly releases new Red Pro drives whose official endurance rating is lower than that of some consumer SSDs and barely allows to read the whole drive once a month. (Yes: Endurance rating on READS!)

watching the video, it seems like you are conflating things together in a way that isn't accurate.
the workload rating in HDDs is not the same as the TBW in SSDs, because SSD's have virtually unlimited read, but HDD's still have to spin the platters to read.
you cannot compare TBW endurance and workload endurance directly, because there are more factors.

what it looks like is happening is that the HDD workload rating is not able to increase at the same rate as capacity, which would be a limiter in the technology, not necessarily anything WD is doing.

right in the comments on that page:
Seagate’s Ironwolf Pro series 12-20TB Helium CMR are all, also rated at 300 Worlkload Rate. See the current datasheet pdf. [1]

it also looks like 300TB workload rating IS an actual improvement, with other drives shown being 180TB or whatever.

WD is definitely does still seem to be working on making everything more confusing than could possibly be needed.
 
Last edited:
Joined
Mar 9, 2022
Messages
6
I'll give you guys a update on my issue. It's fixed now.

Ordered a WD RED PRO drive, with CMR. Replaced drive in the os, resilver, sweat for 6 hours. And finally I see my pool online with no errors.

Thanks for all the advice.

tl;dr for people who have the same issue. See if your drive is SMR of CMR.
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
WD sneaking SMR into NAS drives created so much PITA for everyone for absolutely no benefit to them. so dumb.
 
Top