HDD to SDD pool questions

A Bull With Yogurts · Jan 3, 2022

Have been having issues with my RAIDZ2 pool of 8x WD Red 3TB since I originally put my NAS box together in 2016. In that time I've had 3 disks SMART fail and swapped them out under warranty. However, now I've had a 4th disk SMART fail (but this time out of warranty) and am getting a bit fed up.

The box sits in my office at home and has a very light workload as a media server and storage for family data. The disks are always spinning using power setting 128 and experience max temps of <40C in the few weeks of summer we get here in the UK.

I really don't like these failure stats. Most the disks failed around power-on 32000-35000 hours, but one of the disks failed at only 23000 hours. We've got drives in machines at work that have been spinning for 4+ years under significant load during the working day and very few have failed. Something ain't right here. I've done as much diagnosis as I can, and am now considering the possibility of a vibration issue in my Fractal Design Node 804 case.

With that (and my high read/write ratio) in mind I'm thinking of swapping out my HDDs for SDDs one by one as they continue to fail. My questions are:

1. Are there any known issues having SDDs running alongside HDDs in the pool?
2. Is it in any way possible to shrink the size of a pool without a rebuild? I don't need the capacity offered by the 3TB HDDs and could easily get by with just 2TB SDDs.

TIA.

sretalla · Jan 3, 2022

A Bull With Yogurts said:
We've got drives in machines at work that have been spinning for 4+ years under significant load during the working day and very few have failed. Something ain't right here.

Could it be that you buy consumer (perhaps "NAS designated") drives at home and Enterprise rated drives for the office? It's no accident that those enterprise rated drives fail less frequently if that's the case.

I think 4 years tends to be the MTBF for consumer drives, so anything over that is bonus time for having kept them in good condition (but can't be guaranteed... as evidenced by the warranty period expiring).

A Bull With Yogurts said:
now considering the possibility of a vibration issue in my Fractal Design Node 804 case.

Not impossible... perhaps not the most likely explanation.

A Bull With Yogurts said:
1. Are there any known issues having SDDs running alongside HDDs in the pool?

Pools/VDEVs run at speeds determined by the slowest member disks, so that... until all the HDDs are out, you'll see no performance benefit.

Some SSDs have a different block size so would need a different ashift value to the HDDs (which I don't think you can do inside a VDEV... not sure if you can even mix that in a pool)

A Bull With Yogurts said:
2. Is it in any way possible to shrink the size of a pool without a rebuild? I don't need the capacity offered by the 3TB HDDs and could easily get by with just 2TB SDDs.

Growing is easy, shrinking not so much. A rebuild may actually be the simplest option to do that.

rvassar · Jan 3, 2022

A Bull With Yogurts said:
I really don't like these failure stats. Most the disks failed around power-on 32000-35000 hours, but one of the disks failed at only 23000 hours. We've got drives in machines at work that have been spinning for 4+ years under significant load during the working day and very few have failed. Something ain't right here. I've done as much diagnosis as I can, and am now considering the possibility of a vibration issue in my Fractal Design Node 804 case.

2.6 years seems a bit low, but not entirely unusual. I start not trusting drives above roughly 40k hours. You mentioned the thermal environment. But on the topic of vibration, how protected is the system? Are there any potential external sources? Rail lines, pets, upstairs neighbor's heavy metal band, etc... Also, are the drives rated for use in a single chassis with 8 devices? It's could be activity coupling.

A Bull With Yogurts said:
With that (and my high read/write ratio) in mind I'm thinking of swapping out my HDDs for SDDs one by one as they continue to fail. My questions are:

1. Are there any known issues having SDDs running alongside HDDs in the pool?
2. Is it in any way possible to shrink the size of a pool without a rebuild? I don't need the capacity offered by the 3TB HDDs and could easily get by with just 2TB SDDs.

TIA.

1. I can't give you a definitive answer here. There might be some unusual geometry problems, as SSD's basically fabricate this info. But I'm under the impression it works.

2. No. You can expand a pool, but there is no provision for shrinking them. Moving from 3Tb HDD's to 2Tb SSD's will likely require you to build a second pool and migrate. There are some fiendishly large SSD's in the works. I've personally had the opportunity to test a 15 Tb U.2 NVMe device, just don't ask what it costs. Since you've stated you don't need all the space, have you considered moving to a smaller configuration? 4 x 8Tb devices in RAIDz2 perhaps? Might cut your vibration problem and save some electricity as well.

A Bull With Yogurts · Jan 3, 2022

sretalla said:
Could it be that you buy consumer (perhaps "NAS designated") drives at home and Enterprise rated drives for the office? It's no accident that those enterprise rated drives fail less frequently if that's the case.

I think 4 years tends to be the MTBF for consumer drives, so anything over that is bonus time for having kept them in good condition (but can't be guaranteed... as evidenced by the warranty period expiring).

Thanks for your reply.

Good point. I'm running standard WD Red at home which have a 3 year warranty. We're definitely running WD Reds as well at work but they may be the Pro version with the 5 year warranty. Shall check tomorrow.

sretalla said:
Pools/VDEVs run at speeds determined by the slowest member disks, so that... until all the HDDs are out, you'll see no performance benefit.

Some SSDs have a different block size so would need a different ashift value to the HDDs (which I don't think you can do inside a VDEV... not sure if you can even mix that in a pool)

Not fussed about improving performance. The box performs really well for our needs as it stands.

Regarding the ashift issue. Are you referring to the 512 / 4096 sector size they discuss here?

Optimizing Performance of SSDs and Advanced Format Drives - Lustre Wiki

wiki.lustre.org

A Bull With Yogurts · Jan 3, 2022

rvassar said:
2.6 years seems a bit low, but not entirely unusual. I start not trusting drives above roughly 40k hours. You mentioned the thermal environment. But on the topic of vibration, how protected is the system? Are there any potential external sources? Rail lines, pets, upstairs neighbor's heavy metal band, etc... Also, are the drives rated for use in a single chassis with 8 devices? It's could be activity coupling.

Thanks for your reply.

The box is actually well protected from vibration. No signficant external vibrations. The case itself seems well designed to dampen vibrations between the disks within each of the two 4-bay cages and I can't hear any nasty resonances etc. However, I can't rule out that being the problem.

rvassar said:
Since you've stated you don't need all the space, have you considered moving to a smaller configuration? 4 x 8Tb devices in RAIDz2 perhaps? Might cut your vibration problem and save some electricity as well.

Wouldn't that require me to swap out all the drives for 8TB models at the same time though?

rvassar · Jan 3, 2022

A Bull With Yogurts said:
Wouldn't that require me to swap out all the drives for 8TB models at the same time though?

No that would require a new pool, you'd have to migrate the data to cut the device count. I have no idea what your electrical costs are, or how much drives cost in the UK. I just wanted to point it out as an option for cutting device count. Less devices, less vibration, fewer things to fail. It also free's up ports for an SSD performance pool if you're so inclined, or if you're using an HBA and have enough MB ports, allows you to ditch the HBA, which might save a few more watts...

A Bull With Yogurts · Jan 4, 2022

allpurpbox said:
The Node 804 case uses rubber grommets like my define 7. I have read that using rubber grommets is not good for your hard drives. It makes sense to me anyway, not to isolate a hard drive allowing it to resonate and instead use the mass of the case to suck up the vibration. I got appropriately sized metal washers and used them in place of the rubber grommets to mount my drives. I snug them up tight.

That makes no sense to me. The aim is to minimise vibrational transmission between disks. What you don't want is a sympathetic resonance being induced by multiple drives performing the same operation at the same time.

Where did you read this?

A Bull With Yogurts · Jan 4, 2022

allpurpbox said:
Also what power supply do you have? Can you monitor the voltages?

Seasonic SS-550RM (550W 80+ Gold).

Power is clean and voltages rock solid.

allpurpbox said:
12.5% annualized failure rate seems high to me.

Yes. That's my feeling too. Hence the post.

allpurpbox said:
Are the 3 TB models from 2016 speced for 8 in a bay?

WD Reds are specced for up to 8 bays. That's why I used them in my setup.

A Bull With Yogurts · Jan 6, 2022

allpurpbox said:
In regards to power is there anything else on that breaker with high starting/running watts like perhaps a fridge, freezer, toaster, or space heater that is clicking on/off.

Nope.

allpurpbox said:
Otherwise, your sample size is small so it could just be random chance, as a 12.5% annulaized failure rate is above industry norm.

There are large data centers that report 1%, 2%, 3% annualized failure rates. And those places use the cheapest drives they can get their hands on and rely on redundancy.

Places like backblaze are reporting 6.5% failure rates for drives 7 years old.

Exactly. Something iffy. But as you say, the sample set is so small I can't really derive any significance.

I think I'll actually just slowly swap out for Ironwolf Pro 4TB disks. That's the easiest solution. They are 120 quid a pop and come with a 5 year warranty. That's only £6 per TB per warranty year.

rvassar · Jan 6, 2022

A Bull With Yogurts said:
I think I'll actually just slowly swap out for Ironwolf Pro 4TB disks. That's the easiest solution. They are 120 quid a pop and come with a 5 year warranty. That's only £6 per TB per warranty year.

That sounds like a good plan. Part of the reason I suggested 4 x 8Tb was those drives are only $130 on sale here, but it would force a pool rebuild and all drives bought up front.

Take advantage of sales, etc.... Once they're all replaced, you get to expand your pool. In the mean time, you can buy a drive ahead and do proper burn-in. I might also suggest avoiding drive homogeneity. If Ironwolf's are not on sale and Red Pro's or Toshiba N300's are... You may even find a sale where a 6Tb drive is cheap, again ZFS will just use what it can and waste the space on the larger device. As long as you avoid SMR, ZFS will for the most part not care. By avoiding a heterogeneous pool you can't be bitten by a single bad production batch or firmware.

A Bull With Yogurts · Jan 6, 2022

rvassar said:
Part of the reason I suggested 4 x 8Tb was those drives are only $130 on sale here, but it would force a pool rebuild and all drives bought up front.

If only kit were that cheap here...

rvassar said:
As long as you avoid SMR, ZFS will for the most part not care.

Luckily my disks hark from before WD introduced SMR, but I am loathed to buy WD disks again after the way they handled that debacle. I'd rather give my money to a competitor like Seagate even if WD Red Pros happen to be on sale. I know Seagate use the technology in some of their consumer product lines too, but for WD to silently introduce into their NAS product line is pretty unforgivable in my book.

rvassar said:
By avoiding a heterogeneous pool you can't be bitten by a single bad production batch or firmware.

Yeah, thought I'd managed to do that the first time by buying from different suppliers. Perhaps not though.

awasb · Jan 6, 2022

Maybe I didn't get it: But what SMART errors exactly did the drive(s) show?

joeschmuck · Jan 6, 2022

A Bull With Yogurts said:
That makes no sense to me. The aim is to minimise vibrational transmission between disks. What you don't want is a sympathetic resonance being induced by multiple drives performing the same operation at the same time.

Where did you read this?

That was a fad going on years ago. While I understand what was being attempted, I didn't agree with it either. The case does not have enough mass to force the vibrations to remain within the hard drive alone. I prefer to use rubber bushings as this will minimize noise transmission between drives and the case.

As for drive failures, you know that is a hit and miss type of thing. Sometimes you have a drive that fails prematurely, sometimes you have a drive that lasts and extra year or two. Additionally things like power on/off or dirty power will cause an early death. And lastly, TrueNAS/FreeNAS tests the drives routinely for drive failures/data integrity so if there are any issues then it gets flagged. The systems at your work may not do that or even as intensive. If those system only look tor the status word "PASSED" from the hard drive status message then it's a terrible check. Remember that the goal of that message it an attempt to tell the user that the hard drive will likely fail in less than 24 hours. With the routine SMART testing that FreeNAS/TrueNAS performs, we far exceed that level of testing and find the failure indicators much earlier. And I hope this makes sense, I was interrupted about 6 times while trying to write this.

A Bull With Yogurts · Jan 6, 2022

awasb said:
Maybe I didn't get it: But what SMART errors exactly did the drive(s) show?

One of the disks that got RMAed at only 23000hrs:

Older disk that was completely done by 36000hrs:

A Bull With Yogurts · Jan 6, 2022

joeschmuck said:
That was a fad going on years ago. While I understand what was being attempted, I didn't agree with it either. The case does not have enough mass to force the vibrations to remain within the hard drive alone. I prefer to use rubber bushings as this will minimize noise transmission between drives and the case.

As for drive failures, you know that is a hit and miss type of thing. Sometimes you have a drive that fails prematurely, sometimes you have a drive that lasts and extra year or two. Additionally things like power on/off or dirty power will cause an early death. And lastly, TrueNAS/FreeNAS tests the drives routinely for drive failures/data integrity so if there are any issues then it gets flagged. The systems at your work may not do that or even as intensive. If those system only look tor the status word "PASSED" from the hard drive status message then it's a terrible check. Remember that the goal of that message it an attempt to tell the user that the hard drive will likely fail in less than 24 hours. With the routine SMART testing that FreeNAS/TrueNAS performs, we far exceed that level of testing and find the failure indicators much earlier. And I hope this makes sense, I was interrupted about 6 times while trying to write this.

Perfect sense thanks. The vast bulk of the weight in my box is the disks. No way the case is heavy enough to meaningfully dampen anything.

Found out that the few disks we have still have onprem at work are enterprise models. Probably explains the better MTBF.

rvassar · Jan 6, 2022

A Bull With Yogurts said:
Example:

View attachment 52067

The 200 Multi_Zone_Error_Rate? I don't know if that's a permanent error. I'm not sure you could warranty on that with a single count. I'd run a extended SMART test and watch it for a count increase before getting too worried.

A Bull With Yogurts · Jan 6, 2022

rvassar said:
The 200 Multi_Zone_Error_Rate? I don't know if that's a permanent error. I'm not sure you could warranty on that with a single count. I'd run a extended SMART test and watch it for a count increase before getting too worried.

Multi_Zone_Error_Rate is a write error. It'll get cleaned up by ZFS, but any SMART errors are grounds for an RMA.

WD swapped it out without any fuss.

EDIT:

As was pointed out by allpurpbox in a post they have since deleted, a single error doesn't mean the disk is totally done. I should have clarified that this particular disk went on to subsequently experience further errors over next few days. However, I only have this single screenshot of the SMART output for that disk.

That said though, any consistently increasing error counts means a drive is unreliable and done in my book. I don't wait for heads to crash into platters or bearings to completely go before I consider a disk as failed. If it consistently errors during read/write operations then it isn't doing the job it's supposed to and it needs to go.

Mark Holtz · Jan 6, 2022

A Bull With Yogurts said:
Have been having issues with my RAIDZ2 pool of 8x WD Red 3TB since I originally put my NAS box together in 2016. In that time I've had 3 disks SMART fail and swapped them out under warranty. However, now I've had a 4th disk SMART fail (but this time out of warranty) and am getting a bit fed up.

My TrueNAS (formerly FreeNAS) server has been running continuously since mid-2016 with some minor exceptions such as a cross-country move in 2019, a half-dozen power outages, and reboots. I do have a UPS attached to my TrueNAS. I did experience a drive failure in mid-September, 2021 after five years of continuous use, and during the replacement/resilvering process, two more drives failed, but thankfully, it was not at the same time, and I had RAIDZ2 set up as part of ZFS.

The problem is that, for raw storage space, HDD are going to be cheaper than SSD drives. In checking Amazon, a 2TB NAS HDD is $60-$80, while a 2TB SSD goes for $180-$230. A 4TB SSD starts at $380, which is slightly higher than a 10TB NAS HDD.

Can you check your power supply?

rvassar · Jan 6, 2022

A Bull With Yogurts said:
That said though, any consistently increasing error counts means a drive is unreliable and done in my book. I don't wait for heads to crash into platters or bearings to completely go before I consider a disk as failed. If it consistently errors during read/write operations then it isn't doing the job it's supposed to and it needs to go.

Ahhh... Ok. Completely agree. If it was counting upwards, it's toast.

Jessep · Jan 6, 2022

If all drives were purchased at the same time, same batch, same firmware, etc. it's not unlikely to have multiple failures.

It could simply be a batch issue rather than environmental.

If you are concerned about it you can switch to enterprise, WD DC series, Seagate EXOS series, etc.

OPINION: NAS lines seem to have gone the way of a lot of prosumer/gamer/etc. and jacked up prices and cut quality. More an more these days it's seemingly cheaper in stress and hands on hours to go with enterprise gear.

Switching to SDD is an option, however consumer SSDs would certainly not be my first choice for long term critical storage.

Important Announcement for the TrueNAS Community.

HDD to SDD pool questions

Explorer

Powered by Neutrality

Guru

Explorer

Explorer

Guru

Explorer

Explorer

Explorer

Guru

Explorer

Patron

Old Man

Explorer

Explorer

Guru

Explorer

Contributor

Guru

Patron

Similar threads