HDD to SDD pool questions

Joined
Jul 15, 2017
Messages
55
Have been having issues with my RAIDZ2 pool of 8x WD Red 3TB since I originally put my NAS box together in 2016. In that time I've had 3 disks SMART fail and swapped them out under warranty. However, now I've had a 4th disk SMART fail (but this time out of warranty) and am getting a bit fed up.

The box sits in my office at home and has a very light workload as a media server and storage for family data. The disks are always spinning using power setting 128 and experience max temps of <40C in the few weeks of summer we get here in the UK.

I really don't like these failure stats. Most the disks failed around power-on 32000-35000 hours, but one of the disks failed at only 23000 hours. We've got drives in machines at work that have been spinning for 4+ years under significant load during the working day and very few have failed. Something ain't right here. I've done as much diagnosis as I can, and am now considering the possibility of a vibration issue in my Fractal Design Node 804 case.

With that (and my high read/write ratio) in mind I'm thinking of swapping out my HDDs for SDDs one by one as they continue to fail. My questions are:

1. Are there any known issues having SDDs running alongside HDDs in the pool?
2. Is it in any way possible to shrink the size of a pool without a rebuild? I don't need the capacity offered by the 3TB HDDs and could easily get by with just 2TB SDDs.

TIA.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
We've got drives in machines at work that have been spinning for 4+ years under significant load during the working day and very few have failed. Something ain't right here.
Could it be that you buy consumer (perhaps "NAS designated") drives at home and Enterprise rated drives for the office? It's no accident that those enterprise rated drives fail less frequently if that's the case.

I think 4 years tends to be the MTBF for consumer drives, so anything over that is bonus time for having kept them in good condition (but can't be guaranteed... as evidenced by the warranty period expiring).

now considering the possibility of a vibration issue in my Fractal Design Node 804 case.
Not impossible... perhaps not the most likely explanation.

1. Are there any known issues having SDDs running alongside HDDs in the pool?
Pools/VDEVs run at speeds determined by the slowest member disks, so that... until all the HDDs are out, you'll see no performance benefit.

Some SSDs have a different block size so would need a different ashift value to the HDDs (which I don't think you can do inside a VDEV... not sure if you can even mix that in a pool)

2. Is it in any way possible to shrink the size of a pool without a rebuild? I don't need the capacity offered by the 3TB HDDs and could easily get by with just 2TB SDDs.
Growing is easy, shrinking not so much. A rebuild may actually be the simplest option to do that.
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
I really don't like these failure stats. Most the disks failed around power-on 32000-35000 hours, but one of the disks failed at only 23000 hours. We've got drives in machines at work that have been spinning for 4+ years under significant load during the working day and very few have failed. Something ain't right here. I've done as much diagnosis as I can, and am now considering the possibility of a vibration issue in my Fractal Design Node 804 case.

2.6 years seems a bit low, but not entirely unusual. I start not trusting drives above roughly 40k hours. You mentioned the thermal environment. But on the topic of vibration, how protected is the system? Are there any potential external sources? Rail lines, pets, upstairs neighbor's heavy metal band, etc... Also, are the drives rated for use in a single chassis with 8 devices? It's could be activity coupling.
With that (and my high read/write ratio) in mind I'm thinking of swapping out my HDDs for SDDs one by one as they continue to fail. My questions are:

1. Are there any known issues having SDDs running alongside HDDs in the pool?
2. Is it in any way possible to shrink the size of a pool without a rebuild? I don't need the capacity offered by the 3TB HDDs and could easily get by with just 2TB SDDs.

TIA.

1. I can't give you a definitive answer here. There might be some unusual geometry problems, as SSD's basically fabricate this info. But I'm under the impression it works.

2. No. You can expand a pool, but there is no provision for shrinking them. Moving from 3Tb HDD's to 2Tb SSD's will likely require you to build a second pool and migrate. There are some fiendishly large SSD's in the works. I've personally had the opportunity to test a 15 Tb U.2 NVMe device, just don't ask what it costs. Since you've stated you don't need all the space, have you considered moving to a smaller configuration? 4 x 8Tb devices in RAIDz2 perhaps? Might cut your vibration problem and save some electricity as well.
 
Joined
Jul 15, 2017
Messages
55
Could it be that you buy consumer (perhaps "NAS designated") drives at home and Enterprise rated drives for the office? It's no accident that those enterprise rated drives fail less frequently if that's the case.

I think 4 years tends to be the MTBF for consumer drives, so anything over that is bonus time for having kept them in good condition (but can't be guaranteed... as evidenced by the warranty period expiring).

Thanks for your reply.

Good point. I'm running standard WD Red at home which have a 3 year warranty. We're definitely running WD Reds as well at work but they may be the Pro version with the 5 year warranty. Shall check tomorrow.

Pools/VDEVs run at speeds determined by the slowest member disks, so that... until all the HDDs are out, you'll see no performance benefit.

Some SSDs have a different block size so would need a different ashift value to the HDDs (which I don't think you can do inside a VDEV... not sure if you can even mix that in a pool)

Not fussed about improving performance. The box performs really well for our needs as it stands.

Regarding the ashift issue. Are you referring to the 512 / 4096 sector size they discuss here?

 
Joined
Jul 15, 2017
Messages
55
2.6 years seems a bit low, but not entirely unusual. I start not trusting drives above roughly 40k hours. You mentioned the thermal environment. But on the topic of vibration, how protected is the system? Are there any potential external sources? Rail lines, pets, upstairs neighbor's heavy metal band, etc... Also, are the drives rated for use in a single chassis with 8 devices? It's could be activity coupling.

Thanks for your reply.

The box is actually well protected from vibration. No signficant external vibrations. The case itself seems well designed to dampen vibrations between the disks within each of the two 4-bay cages and I can't hear any nasty resonances etc. However, I can't rule out that being the problem.

Since you've stated you don't need all the space, have you considered moving to a smaller configuration? 4 x 8Tb devices in RAIDz2 perhaps? Might cut your vibration problem and save some electricity as well.

Wouldn't that require me to swap out all the drives for 8TB models at the same time though?
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
Wouldn't that require me to swap out all the drives for 8TB models at the same time though?

No that would require a new pool, you'd have to migrate the data to cut the device count. I have no idea what your electrical costs are, or how much drives cost in the UK. I just wanted to point it out as an option for cutting device count. Less devices, less vibration, fewer things to fail. It also free's up ports for an SSD performance pool if you're so inclined, or if you're using an HBA and have enough MB ports, allows you to ditch the HBA, which might save a few more watts...
 
Joined
Jul 15, 2017
Messages
55
The Node 804 case uses rubber grommets like my define 7. I have read that using rubber grommets is not good for your hard drives. It makes sense to me anyway, not to isolate a hard drive allowing it to resonate and instead use the mass of the case to suck up the vibration. I got appropriately sized metal washers and used them in place of the rubber grommets to mount my drives. I snug them up tight.

That makes no sense to me. The aim is to minimise vibrational transmission between disks. What you don't want is a sympathetic resonance being induced by multiple drives performing the same operation at the same time.

Where did you read this?
 
Joined
Jul 15, 2017
Messages
55
Joined
Jul 15, 2017
Messages
55
In regards to power is there anything else on that breaker with high starting/running watts like perhaps a fridge, freezer, toaster, or space heater that is clicking on/off.

Nope.

Otherwise, your sample size is small so it could just be random chance, as a 12.5% annulaized failure rate is above industry norm.

There are large data centers that report 1%, 2%, 3% annualized failure rates. And those places use the cheapest drives they can get their hands on and rely on redundancy.

Places like backblaze are reporting 6.5% failure rates for drives 7 years old.

Exactly. Something iffy. But as you say, the sample set is so small I can't really derive any significance.

I think I'll actually just slowly swap out for Ironwolf Pro 4TB disks. That's the easiest solution. They are 120 quid a pop and come with a 5 year warranty. That's only £6 per TB per warranty year.
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
I think I'll actually just slowly swap out for Ironwolf Pro 4TB disks. That's the easiest solution. They are 120 quid a pop and come with a 5 year warranty. That's only £6 per TB per warranty year.

That sounds like a good plan. Part of the reason I suggested 4 x 8Tb was those drives are only $130 on sale here, but it would force a pool rebuild and all drives bought up front.

Take advantage of sales, etc.... Once they're all replaced, you get to expand your pool. In the mean time, you can buy a drive ahead and do proper burn-in. I might also suggest avoiding drive homogeneity. If Ironwolf's are not on sale and Red Pro's or Toshiba N300's are... You may even find a sale where a 6Tb drive is cheap, again ZFS will just use what it can and waste the space on the larger device. As long as you avoid SMR, ZFS will for the most part not care. By avoiding a heterogeneous pool you can't be bitten by a single bad production batch or firmware.
 
Joined
Jul 15, 2017
Messages
55
Part of the reason I suggested 4 x 8Tb was those drives are only $130 on sale here, but it would force a pool rebuild and all drives bought up front.

If only kit were that cheap here...

As long as you avoid SMR, ZFS will for the most part not care.

Luckily my disks hark from before WD introduced SMR, but I am loathed to buy WD disks again after the way they handled that debacle. I'd rather give my money to a competitor like Seagate even if WD Red Pros happen to be on sale. I know Seagate use the technology in some of their consumer product lines too, but for WD to silently introduce into their NAS product line is pretty unforgivable in my book.

By avoiding a heterogeneous pool you can't be bitten by a single bad production batch or firmware.

Yeah, thought I'd managed to do that the first time by buying from different suppliers. Perhaps not though.
 

awasb

Patron
Joined
Jan 11, 2021
Messages
415
Maybe I didn't get it: But what SMART errors exactly did the drive(s) show?
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
That makes no sense to me. The aim is to minimise vibrational transmission between disks. What you don't want is a sympathetic resonance being induced by multiple drives performing the same operation at the same time.

Where did you read this?
That was a fad going on years ago. While I understand what was being attempted, I didn't agree with it either. The case does not have enough mass to force the vibrations to remain within the hard drive alone. I prefer to use rubber bushings as this will minimize noise transmission between drives and the case.

As for drive failures, you know that is a hit and miss type of thing. Sometimes you have a drive that fails prematurely, sometimes you have a drive that lasts and extra year or two. Additionally things like power on/off or dirty power will cause an early death. And lastly, TrueNAS/FreeNAS tests the drives routinely for drive failures/data integrity so if there are any issues then it gets flagged. The systems at your work may not do that or even as intensive. If those system only look tor the status word "PASSED" from the hard drive status message then it's a terrible check. Remember that the goal of that message it an attempt to tell the user that the hard drive will likely fail in less than 24 hours. With the routine SMART testing that FreeNAS/TrueNAS performs, we far exceed that level of testing and find the failure indicators much earlier. And I hope this makes sense, I was interrupted about 6 times while trying to write this.
 
Joined
Jul 15, 2017
Messages
55
Maybe I didn't get it: But what SMART errors exactly did the drive(s) show?

One of the disks that got RMAed at only 23000hrs:

1641486569159.png


Older disk that was completely done by 36000hrs:

1641487123098.png
 
Last edited:
Joined
Jul 15, 2017
Messages
55
That was a fad going on years ago. While I understand what was being attempted, I didn't agree with it either. The case does not have enough mass to force the vibrations to remain within the hard drive alone. I prefer to use rubber bushings as this will minimize noise transmission between drives and the case.

As for drive failures, you know that is a hit and miss type of thing. Sometimes you have a drive that fails prematurely, sometimes you have a drive that lasts and extra year or two. Additionally things like power on/off or dirty power will cause an early death. And lastly, TrueNAS/FreeNAS tests the drives routinely for drive failures/data integrity so if there are any issues then it gets flagged. The systems at your work may not do that or even as intensive. If those system only look tor the status word "PASSED" from the hard drive status message then it's a terrible check. Remember that the goal of that message it an attempt to tell the user that the hard drive will likely fail in less than 24 hours. With the routine SMART testing that FreeNAS/TrueNAS performs, we far exceed that level of testing and find the failure indicators much earlier. And I hope this makes sense, I was interrupted about 6 times while trying to write this.

Perfect sense thanks. The vast bulk of the weight in my box is the disks. No way the case is heavy enough to meaningfully dampen anything.

Found out that the few disks we have still have onprem at work are enterprise models. Probably explains the better MTBF.
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
Joined
Jul 15, 2017
Messages
55
The 200 Multi_Zone_Error_Rate? I don't know if that's a permanent error. I'm not sure you could warranty on that with a single count. I'd run a extended SMART test and watch it for a count increase before getting too worried.

Multi_Zone_Error_Rate is a write error. It'll get cleaned up by ZFS, but any SMART errors are grounds for an RMA.

WD swapped it out without any fuss.

EDIT:

As was pointed out by allpurpbox in a post they have since deleted, a single error doesn't mean the disk is totally done. I should have clarified that this particular disk went on to subsequently experience further errors over next few days. However, I only have this single screenshot of the SMART output for that disk.

That said though, any consistently increasing error counts means a drive is unreliable and done in my book. I don't wait for heads to crash into platters or bearings to completely go before I consider a disk as failed. If it consistently errors during read/write operations then it isn't doing the job it's supposed to and it needs to go.
 
Last edited:

Mark Holtz

Contributor
Joined
Feb 3, 2015
Messages
124
Have been having issues with my RAIDZ2 pool of 8x WD Red 3TB since I originally put my NAS box together in 2016. In that time I've had 3 disks SMART fail and swapped them out under warranty. However, now I've had a 4th disk SMART fail (but this time out of warranty) and am getting a bit fed up.

My TrueNAS (formerly FreeNAS) server has been running continuously since mid-2016 with some minor exceptions such as a cross-country move in 2019, a half-dozen power outages, and reboots. I do have a UPS attached to my TrueNAS. I did experience a drive failure in mid-September, 2021 after five years of continuous use, and during the replacement/resilvering process, two more drives failed, but thankfully, it was not at the same time, and I had RAIDZ2 set up as part of ZFS.

The problem is that, for raw storage space, HDD are going to be cheaper than SSD drives. In checking Amazon, a 2TB NAS HDD is $60-$80, while a 2TB SSD goes for $180-$230. A 4TB SSD starts at $380, which is slightly higher than a 10TB NAS HDD.

Can you check your power supply?
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
That said though, any consistently increasing error counts means a drive is unreliable and done in my book. I don't wait for heads to crash into platters or bearings to completely go before I consider a disk as failed. If it consistently errors during read/write operations then it isn't doing the job it's supposed to and it needs to go.

Ahhh... Ok. Completely agree. If it was counting upwards, it's toast.
 

Jessep

Patron
Joined
Aug 19, 2018
Messages
379
If all drives were purchased at the same time, same batch, same firmware, etc. it's not unlikely to have multiple failures.

It could simply be a batch issue rather than environmental.

If you are concerned about it you can switch to enterprise, WD DC series, Seagate EXOS series, etc.

OPINION: NAS lines seem to have gone the way of a lot of prosumer/gamer/etc. and jacked up prices and cut quality. More an more these days it's seemingly cheaper in stress and hands on hours to go with enterprise gear.

Switching to SDD is an option, however consumer SSDs would certainly not be my first choice for long term critical storage.
 
Top