Best practice

Gandalf Corvotempesta · Jan 3, 2017

Ericloewe said:
It's not something I would worry about.

Why not?

Gandalf Corvotempesta · Jan 3, 2017

Robert Trevellyan said:
One benefit of ZFS is that is only resilvers allocated blocks. Combine this with the requirement to keep very high free space percentage for a block storage application.

Yes but if have a 90% used pool, resilvering time is still huge

Ericloewe · Jan 3, 2017

Gandalf Corvotempesta said:
Why not?

It's distributed over several drives
NAND flash reliability has proven itself to be better than estimated
No modern controllers rely on minimizing with compression for performance and reliability

Robert Trevellyan · Jan 3, 2017

Gandalf Corvotempesta said:
if have a 90% used pool, resilvering time is still huge

That would be a mistake with a block storage workload.

You might find this thread instructive:
https://forums.freenas.org/index.ph...d-why-we-use-mirrors-for-block-storage.44068/

Gandalf Corvotempesta · Jan 3, 2017

Ericloewe said:
It's distributed over several drives

NAND flash reliability has proven itself to be better than estimated

No modern controllers rely on minimizing with compression for performance and reliability

1: false: with mirrors, all disks are getting the same writes. With parities raid, all disks get all the writes to write each stripe. Thus, with mirrors or parity raids, you are writing the same amount of data, at the same time, in the same way, on each SSD. There is a HUGE probability that multiple failures could happen in the same time (or in a very very short of time)
2: false: no vendor is able to guarantee the reliability for each SSD. Look at Intel DC3700 specs [1]. Intel says 10 write drives per day, for 5 years. Do you think that Intel has made a 5-years long test by writing 24/7 on an SSD? Absolutely not, when the test finished, that kind of SSD is already old.

[1] http://download.intel.com/newsroom/kits/ssd/pdfs/Intel_SSD_DC_S3700_Product_Specification.pdf

tvsjr · Jan 3, 2017

With parity RAID, all disks get the write, but only 1/(n-z) of the overall amount of data written. You'll observe that drives are rated based on data written... not the number of writes. So, if you're writing 6GB of data to a RAID-Z, each drive only sees 1/(6-2)=1.25GB of data written.

SSDs also typically fail in a predictable way, from a media wearout perspective. They exhibit a slowly increasing relocation count as the weakest cells wear out. The good controllers typically render the drive read-only when the wear reaches a certain level.

Let's assume a 6-drive RAIDZ2 array of 800GB S3700s. Each drive is rated to 10 DWPD, or 8TB/day. You would be able to completely fill the array 10 times in a day for 5 years without going outside the rated drive expectancy... and there's no guarantee that the drive will up and die as soon as one byte over that limit is written. In certain big data workloads, perhaps that might be a limiting factor... but it's unlikely that's your use case.

Finally, RAID isn't backup. If you're trying to build massive redundancy in and don't have an offline, offsite backup, you're doing it wrong.

Robert Trevellyan · Jan 3, 2017

Gandalf Corvotempesta said:
Intel says 10 write drives per day, for 5 years. Do you think that Intel has made a 5-years long test by writing 24/7 on an SSD?

Of course not. Testing would be done by writing at a much higher rate to a small sample of NAND, then doing arithmetic on the results. But I doubt the real tests are anywhere near as simplistic as that.

Ericloewe · Jan 3, 2017

Even at an atrociously slow 400MB/s, a (large) 2TB drive would do upwards of 17 DWPD, so endurance testing is easy to accelerate. The limiting factor is going to be the data retention part of the test, since that one would require simulated accelerated aging.

Gandalf Corvotempesta · Jan 4, 2017

tvsjr said:
With parity RAID, all disks get the write, but only 1/(n-z) of the overall amount of data written. You'll observe that drives are rated based on data written... not the number of writes. So, if you're writing 6GB of data to a RAID-Z, each drive only sees 1/(6-2)=1.25GB of data written.

So, in other words, a raidz (1 or 2) would improve SSD reliability because writes are distributed across multiple disks
on the other hand, a simple mirror could be risky because all disks would get the same writes pattern and thus could fail at the same time.

SSDs also typically fail in a predictable way, from a media wearout perspective. They exhibit a slowly increasing relocation count as the weakest cells wear out. The good controllers typically render the drive read-only when the wear reaches a certain level.

I've always heard the opposite: ssds tends to fail in a catastrophic way and most of time without any "warning" signs

Let's assume a 6-drive RAIDZ2 array of 800GB S3700s. Each drive is rated to 10 DWPD, or 8TB/day. You would be able to completely fill the array 10 times in a day for 5 years without going outside the rated drive expectancy... and there's no guarantee that the drive will up and die as soon as one byte over that limit is written. In certain big data workloads, perhaps that might be a limiting factor... but it's unlikely that's your use case.

But there is also no guarantee that the drive would be able to reach that value of DWPD.

Finally, RAID isn't backup. If you're trying to build massive redundancy in and don't have an offline, offsite backup, you're doing it wrong.

I know, but i have to host VM images.as you can imagine, saying to customers that they have lost a day full of data due to a wrong raid configuration is not good for any business

I'm trying to get the most reliable configuration as possible, and after that I'll also have backups

tvsjr · Jan 4, 2017

Gandalf Corvotempesta said:
So, in other words, a raidz (1 or 2) would improve SSD reliability because writes are distributed across multiple disks
on the other hand, a simple mirror could be risky because all disks would get the same writes pattern and thus could fail at the same time.

No, because when you write a specific sector, you aren't always writing the same cell. The SSD's firmware has quite a bit of magic in it to do wear leveling in real time. Thus, it really doesn't matter what the write pattern is... it's getting evenly distributed across the memory cells.

Gandalf Corvotempesta said:
I've always heard the opposite: ssds tends to fail in a catastrophic way and most of time without any "warning" signs

I'm referring specifically to media wear-out. If the controller takes a dump, you're hosed... whether SSD or spinning rust.

Gandalf Corvotempesta said:
But there is also no guarantee that the drive would be able to reach that value of DWPD.

I doubt Intel just pulls the DWPD number out of their collective arse. There's a lot of science and testing that goes into that number... and the reality is, the majority of the drives will probably last FAR longer than the minimum rating. Keep in mind the Tom's Hardware SSD endurance testing... 2.1PB before the drive failed. That's about 4.7DWPD... for a consumer drive that's not even officially rated.

Gandalf Corvotempesta said:
I know, but i have to host VM images.as you can imagine, saying to customers that they have lost a day full of data due to a wrong raid configuration is not good for any business

I'm trying to get the most reliable configuration as possible, and after that I'll also have backups

You have to decide what "good enough" is. As we all know, each "9" is a huge increase in cost. If you're going for 4, 5, 6 nines uptime... you'd better have fully redundant chassis, redundant datacenters, etc.

Gandalf Corvotempesta · Jan 5, 2017

Ok so as i have a ten slot chassis, if i'll go with a 5 disks raidz2 and, when needed i'll add another raidz2 , should be ok and at the sake time guarantee faster resilvering

With zfs is not possible to grow a raidz2 by adding single disks like with mdadm or any hardware controller so i have to use multiple raidz2 to keep costs low (starting immediatly with an 8 or 10 ssds raidz2 would be too expensive)

MatthewSteinhoff · Jan 5, 2017

Gandalf Corvotempesta said:
Any suggestion to archieve Max reliability?

Let's not even consider 'max reliability'. That's a fool's errand.

1. How much disk space is required?
2. What level of performance is required?
3. What are you promising your clients in terms of reliability?

We went from 10K SAS drives to cheap, consumer-grade SSDs for our XenServer VM storage. Four SSDs, mirrored stripe. Huge performance increase. FreeNAS snapshots and replicates the VMs throughout the day. If we lost all four primary drives, we could be back up and running on the replicated copies in under an hour. Probably a lot less but an hour is what we promise. We have a cold spare so a single drive failure is a non-event. We will replace the drives long before they wear out.

I'm not afraid of mirrors in a well-monitored environment with reliable backups. You could go triple but I wouldn't.

Cheers,
Matt

Gandalf Corvotempesta · Jan 5, 2017

I really hate 2way mirrors
If you have to replace a drive, another failed drive will bring you down.

Performance side, an SSD raid6 should not be too bad as i'm moving from an SAS raid6. Current performance (6xSAS raid6) are good so anything better than this would be ok and even an SSD raid7 would be way faster than this

So performance are not an issue but reliability is and a raid6 is much more reliable than a mirror

Probably a 5disk raid6 plus hotspare is the way to go.

Important Announcement for the TrueNAS Community.

Best practice

Gandalf Corvotempesta

Dabbler

Gandalf Corvotempesta

Dabbler

Ericloewe

Server Wrangler

Robert Trevellyan

Pony Wrangler

Gandalf Corvotempesta

Dabbler

tvsjr

Guru

Robert Trevellyan

Pony Wrangler

Ericloewe

Server Wrangler

Gandalf Corvotempesta

Dabbler

tvsjr

Guru

Gandalf Corvotempesta

Dabbler

MatthewSteinhoff

Guru

Gandalf Corvotempesta

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Best practice

Dabbler

Dabbler

Server Wrangler

Pony Wrangler

Dabbler

Guru

Pony Wrangler

Server Wrangler

Dabbler

Guru

Dabbler

Guru

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Best practice"

Similar threads