Raid Z1 - 4 drives = bad?

Trianian · Feb 21, 2012

I've been buying up the components for my first RAID-Z1 home server and just got a good deal on 4 2TB drives, (Seagate ST2000DL003 with 4K sectors).

Now I'm searching the forums and it seems there are strong advisories against using 4, 4K drives in a ZFS RaidZ1 build, but I haven't found a clear explanation as to why this is bad.

I guess I could go Z2, but I'd really like to use the space for strorage. Is it really that bad to use a 4 drive system? Would the best option be to purchase a fifth drive? Why?

Thanks

Kimba · Feb 21, 2012

It has to do with your comfort level and the space you need.
A Z1 gives you the ability for only 1 drive to fail. It is better than not having any raid. Mathematically you will not likely have two drives failed unless you bought the drives together and they come from a bad batch.

As with anything, it is comfort vs. space. In your configuration I would have a good level of comfort though I would still back up more critical items.

Personally I have 3 drives for my Z1 which makes the odds slightly more in my favor but I lose a drive of storage.

b1ghen · Feb 21, 2012

My recommendation is to try it first if you already have the hardware, compare RAIDZ1 vs. RAIDZ2 results with 4 drives, people have different results and what is fast enough for someone might be too slow for someone else. I did some tinkering around with my drives (have 6 of the same drives you have) and was happy with the results either way I tried them but went for the "recommended" 6 drive RAIDZ2 setup because I like to keep my data safe, not because of the performance.

Gnome · Feb 21, 2012

The problem isn't with the amount of disks. It has to do with how the data is written to the disks. Without going into large discussion about why (which is very technical), let me just give you the optimal disk count formula:

RAID-Z1 = 2^n + 1 Disks. IE. 3,5,9
RAID-Z2 = 2^n + 2 Disks. IE. 4,6,10
RAID-Z3 = 2^n + 3 Disks. IE. 5,7,11

Going beyond 9,10 or 11 disks (for the respective RAID-Zx) IIRC, is not recommended considering the increased possibility of multi-disk failure which would result in an unrecoverable volume.

You can use disks in other configurations but those are the optimal if you want the best possible performance (speed wise).

Trianian · Feb 21, 2012

Thanks for all the replies.

So I suppose I should either buy a 5th drive, or just use 3 of the drives and keep the 4th as a cold spare. From what I've read, there is no way to upgrade a pool once it's created.

HAL 9000 · Feb 21, 2012

Gnome said:
optimal disk count formula:
RAID-Z1 = 2^n + 1 Disks. IE. 3,5,9

Why 4 or 6 disks are not optimal? Could you explain please?

louisk · Feb 21, 2012

HAL 9000 said:
Why 4 or 6 disks are not optimal? Could you explain please?

My understanding is that you want the data drives in a vdev to be a power of 2. For example, with RAIDZ, you want either 4 or 8 data drives, and 1 parity drive, giving you totals of either 5 or 9 drives in a vdev. If you want to use RAIDZ2, your totals would become 6 or 10.

If you want more details, I would suggest you start with the ZFS admin guide: http://docs.oracle.com/cd/E19253-01/819-5461/index.html

HAL 9000 · Feb 21, 2012

louisk said:
My understanding is that you want the data drives in a vdev to be a power of 2. For example, with RAIDZ, you want either 4 or 8 data drives, and 1 parity drive, giving you totals of either 5 or 9 drives in a vdev. If you want to use RAIDZ2, your totals would become 6 or 10.

Ok, but why exactly power of 2?
And there is no such thing as 'parity drive' in ZFS - parity data can be written to any disk.

louisk · Feb 21, 2012

Because that's how ZFS was designed.

You are correct. There is no actual parity drive. There is 1 or 2 drive's worth of parity spread across the vdev. Its easier to talk about to say the parity disk.

ProtoSD · Feb 21, 2012

This thread has some useful info also:

http://forums.freenas.org/showthread.php?16-Getting-the-most-out-of-ZFS-pools!

Gnome · Feb 21, 2012

HAL 9000 said:
Ok, but why exactly power of 2?
And there is no such thing as 'parity drive' in ZFS - parity data can be written to any disk.

It comes back to how the actual data is written across the disks. I don't know the exact ZFS nomenclature but I'll try explain (in RAID terms):

Just like with RAID5/RAID6, RAID-z writes data in "stripes" meaning that for example that your physical disk sector size is 4096 and you filesystem block size is 128k. Now this 128k must be split among the members. If the 128k block size is divided among the members and it isn't a multiple of 4096 you will have degraded performance because a partial write is much slower. It is slower because the disk needs to read the old sector, then update it and then write it back instead of just writing the new sector.

So say you have 5 disk RAID-z, the stripe is written across 4 members:
eg. 128k / 4 = 32k / 4096 = 8 sectors

But let's say you have a 6 disk RAID-z, the stripe is written across 5 members:
eg. 128k / 5 = 25.6k / 4096 = 6.4 sectors
Notice how the second option with the less optimal disk configuration will have a partial sector update per disk for every block written.

This is just to illustrate the problem, I haven't had a chance to look at the exact block size ZFS uses (IIRC it was indeed 128k but please don't quote me on that).

HAL 9000 · Feb 22, 2012

Gnome said:
But let's say you have a 6 disk RAID-z, the stripe is written across 5 members:
eg. 128k / 5 = 25.6k / 4096 = 6.4 sectors
Notice how the second option with the less optimal disk configuration will have a partial sector update per disk for every block written.

This would mean than filesystem block occupy less space then multiple of disk sectors. There are to possibilities here - unused space in sector is wasted or this space is reused for next filesystem block.
Adding data to partially used sector would of course result in read-modify-write cycle which would brake fundamental ZFS rule that all RAID-Z writes are full-stripe writes.
Also ZFS record size (size of max 128k) can be variable when data compression is turned on and compressed record will never take multiple of disk sectors.
And one more thing that puzzles me - ZFS write data to disks in transactions groups which before flushing do disk are held if memory ARC so actual amount of data written do disks is much more then one 128k record.
What I'm missing here?

Gnome · Feb 22, 2012

HAL 9000 said:
This would mean than filesystem block occupy less space then multiple of disk sectors. There are to possibilities here - unused space in sector is wasted or this space is reused for next filesystem block.
Adding data to partially used sector would of course result in read-modify-write cycle which would brake fundamental ZFS rule that all RAID-Z writes are full-stripe writes.

No it will not break the ZFS rule because it is still a full-stripe write, just not a full sector write. There is a difference.

There isn't going to be much space wasted instead for every stripe there will be a sector that is shared.

HAL 9000 said:
Also ZFS record size (size of max 128k) can be variable when data compression is turned on and compressed record will never take multiple of disk sectors.
And one more thing that puzzles me - ZFS write data to disks in transactions groups which before flushing do disk are held if memory ARC so actual amount of data written do disks is much more then one 128k record.
What I'm missing here?

I have no idea how the compression works (haven't had the time to look at it), you'll have to do the research yourself, all I know is, if you have the correct amount of disks your writes are always aligned.

You can start here: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

You'll notice they mention the rule 2^n + x, where x is the RAID-z level, on that page.

HAL 9000 · Feb 22, 2012

Gnome said:
No it will not break the ZFS rule because it is still a full-stripe write, just not a full sector write. There is a difference.
There isn't going to be much space wasted instead for every stripe there will be a sector that is shared.

So if there is no read-modify-write cycles and just a little space is wasted what is the problem with 4 or 6 disks RAIDZ configuration?

Gnome said:
You can start here: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
You'll notice they mention the rule 2^n + x, where x is the RAID-z level, on that page.

Hmmm...
What is very interesting in this document is:

Start a triple-parity RAIDZ (raidz3) configuration at 9 disks (6+3)

128K / 6 = 21.33333K / 4096 = 5.33333 sectors

It looks like that there is no "2^n + x" rule and in case of RAIDZ-3 "best practice" is to make exactly opposite than you suggested in case od RAIDZ-1.
Maybe there is some "golden rule" of choosing numer of disks for RAIDZ but clearly it is not "power of 2 data disks"...

Gnome · Feb 22, 2012

HAL 9000 said:
So if there is no read-modify-write cycles and just a little space is wasted what is the problem with 4 or 6 disks RAIDZ configuration?

There is a read-modify-write, that is what I'm trying to say. Even if it did "waste space" the hard-drive would always read-modify-write on a partial sector write regardless of what that sector contains. There is no way for the OS to tell the hard-drive on a partial sector write to discard the rest of the data in the sector (even if it is 0s).

HAL 9000 said:
Hmmm...
What is very interesting in this document is:

128K / 6 = 21.33333K / 4096 = 5.33333 sectors

It looks like that there is no "2^n + x" rule and in case of RAIDZ-3 "best practice" is to make exactly opposite than you suggested in case od RAIDZ-1.
Maybe there is some "golden rule" of choosing numer of disks for RAIDZ but clearly it is not "power of 2 data disks"...

They say start at that, not use it. There is a difference between it being possible and it being recommended.

HAL 9000 · Feb 22, 2012

Gnome said:
There is a read-modify-write, that is what I'm trying to say.

So this is not true: (http://hub.opensolaris.org/bin/view/Community+Group+zfs/whatis)?

ZFS introduces a new data replication model called RAID-Z. It is similar to RAID-5 but uses variable stripe width to eliminate the RAID-5 write hole (stripe corruption due to loss of power between data and parity updates). All RAID-Z writes are full-stripe writes. There's no read-modify-write tax, no write hole, and — the best part — no need for NVRAM in hardware.

Gnome said:
They say start at that, not use it. There is a difference between it being possible and it being recommended.

So were is the "2^n + x" rule in "ZFS Best Practices Guide"? I can't see it.

Gnome · Feb 22, 2012

HAL 9000 said:
So this is not true: (http://hub.opensolaris.org/bin/view/Community+Group+zfs/whatis)?

You are confusing the full-stripe writes and full sector writes. A stripe is a virtual unit. A sector is a physical unit on the disk.

You can have full stripe writes without having full sector writes.

Each stripe is a full write that is true, they say nothing of the underlying sectors on the disk itself.

HAL 9000 said:
So were is the "2^n + x" rule in "ZFS Best Practices Guide"? I can't see it.

You are correct, that is the wrong guide, I'll search for it again when I'm home from work.

peterh · Feb 22, 2012

On the other hand using 4 drives in raidz1 works well.
There is no stopper here, you can have 3 to a very large number of disks in raidz1, but more then 6-8 in one vdev
is bad for both performance and especially if a rebuild is needed.

Go for 4 !

HAL 9000 · Feb 22, 2012

peterh said:
On the other hand using 4 drives in raidz1 works well.

Actually I have 4 disks and pool performance is very good (150MB/s write and 250MB/s read using slow "green" disks).
I've made benchmarks using 1, 2 (mirror), 3, and 4 disks and 4 disks RAIDZ-1 is fastest.

Trianian · Feb 22, 2012

So the 4 drives I have for my RAIDZ-1 will work great? Fantastic!

Thanks for all the input.

Important Announcement for the TrueNAS Community.

Raid Z1 - 4 drives = bad?

Explorer

Dabbler

Contributor

Explorer

Explorer

Dabbler

Patron

Dabbler

Patron

MVP

Explorer

Dabbler

Explorer

Dabbler

Explorer

Dabbler

Explorer

Patron

Dabbler

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Raid Z1 - 4 drives = bad?"

Similar threads