RaidZ3 has much smaller available disk space than expected

Status
Not open for further replies.

MarcusXP

Dabbler
Joined
Nov 13, 2011
Messages
12
I've recently set-up a ZFS box using 23 x 4TB Hitachi drives in RaidZ3
According to this calculator:
http://www.servethehome.com/raid-calculator/
the expected size of the pool should be about 72.8TB (taking in consideration the 3 drives lost for parity and the conversion between 1000 and 1024 - TB/TiB)
However, I am only getting 65.7TB...
This is the listing showing the available space (removed some unrelated information):

# zfs list
NAME USED AVAIL REFER MOUNTPOINT
Pool-Storage1 65.7T 2.85G 460K /Pool-Storage1
Pool-Storage1/Volume-Storage1 65.7T 65.7T 230K -
rpool 16.8G 2.78G 47K /rpool

This is my pool with 23 x 4TB drives:
Pool-Storage1 5000 14237568027059623914 vdevs: 1
vdev 1: raidz3 12 92.02 TB 12843423673337650988
c10t5000CCA22EC008ADd0 11897191478853965569 Hitachi HDS5C404
/scsi_vhci/disk@g5000cca22ec008ad:a
id1,sd@n5000cca22ec008ad/a
PL1310LAG029NA
c10t5000CCA22EC00A7Cd0 4700561813297779659 Hitachi HDS5C404
/scsi_vhci/disk@g5000cca22ec00a7c:a
id1,sd@n5000cca22ec00a7c/a
PL1310LAG02TLA
c10t5000CCA22EC00F4Fd0 18078600867002041605 Hitachi HDS5C404
/scsi_vhci/disk@g5000cca22ec00f4f:a
id1,sd@n5000cca22ec00f4f/a
PL1310LAG042EA
c10t5000CCA22EC01A8Dd0 17529955265994712780 Hitachi HDS5C404
/scsi_vhci/disk@g5000cca22ec01a8d:a
id1,sd@n5000cca22ec01a8d/a
PL1310LAG0728A
c10t5000CCA22EC0210Ed0 521233862270374400 Hitachi HDS5C404
/scsi_vhci/disk@g5000cca22ec0210e:a
id1,sd@n5000cca22ec0210e/a
PL1310LAG08TZA
c10t5000CCA22EC021D9d0 16504663500950716529 Hitachi HDS5C404
/scsi_vhci/disk@g5000cca22ec021d9:a
id1,sd@n5000cca22ec021d9/a
PL1310LAG090JA
c10t5000CCA22EC02208d0 17286534736261518309 Hitachi HDS5C404
/scsi_vhci/disk@g5000cca22ec02208:a
id1,sd@n5000cca22ec02208/a
PL1310LAG0921A
c10t5000CCA22EC0221Bd0 11471414523034938154 Hitachi HDS5C404
/scsi_vhci/disk@g5000cca22ec0221b:a
id1,sd@n5000cca22ec0221b/a
PL1310LAG092NA
c10t5000CCA22EC0222Bd0 10137454426417349138 Hitachi HDS5C404
/scsi_vhci/disk@g5000cca22ec0222b:a
id1,sd@n5000cca22ec0222b/a
PL1310LAG0935A
c10t5000CCA22EC02234d0 9661598762332469726 Hitachi HDS5C404
/scsi_vhci/disk@g5000cca22ec02234:a
id1,sd@n5000cca22ec02234/a
PL1310LAG093GA
c10t5000CCA22EC02287d0 2707340434424007771 Hitachi HDS5C404
/scsi_vhci/disk@g5000cca22ec02287:a
id1,sd@n5000cca22ec02287/a
PL1310LAG0964A
c10t5000CCA22EC022ACd0 1409056393389450611 Hitachi HDS5C404
/scsi_vhci/disk@g5000cca22ec022ac:a
id1,sd@n5000cca22ec022ac/a
PL1310LAG097AA
c10t5000CCA22EC022E7d0 13341020487462083677 Hitachi HDS5C404
/scsi_vhci/disk@g5000cca22ec022e7:a
id1,sd@n5000cca22ec022e7/a
PL1310LAG0997A
c10t5000CCA22EC02D19d0 7239933575572325830 Hitachi HDS5C404
/scsi_vhci/disk@g5000cca22ec02d19:a
id1,sd@n5000cca22ec02d19/a
PL1310LAG0D0EA
c10t5000CCA22EC032E8d0 11341704217867462188 Hitachi HDS5C404
/scsi_vhci/disk@g5000cca22ec032e8:a
id1,sd@n5000cca22ec032e8/a
PL1310LAG0EKDA
c10t5000CCA22EC032F8d0 14003683354051041944 Hitachi HDS5C404
/scsi_vhci/disk@g5000cca22ec032f8:a
id1,sd@n5000cca22ec032f8/a
PL1310LAG0EKXA
c10t5000CCA22EC0330Bd0 15602634426552806117 Hitachi HDS5C404
/scsi_vhci/disk@g5000cca22ec0330b:a
id1,sd@n5000cca22ec0330b/a
PL1310LAG0ELJA
c10t5000CCA22EC0340Bd0 1306860508072491531 Hitachi HDS5C404
/scsi_vhci/disk@g5000cca22ec0340b:a
id1,sd@n5000cca22ec0340b/a
PL1310LAG0EVTA
c10t5000CCA22EC04C15d0 16571289578325581979 Hitachi HDS5C404
/scsi_vhci/disk@g5000cca22ec04c15:a
id1,sd@n5000cca22ec04c15/a
PL1310LAG0N89A
c10t5000CCA22EC04C26d0 14950040610680530891 Hitachi HDS5C404
/scsi_vhci/disk@g5000cca22ec04c26:a
id1,sd@n5000cca22ec04c26/a
PL1310LAG0N8VA
c10t5000CCA22EC04C35d0 6692930116809901728 Hitachi HDS5C404
/scsi_vhci/disk@g5000cca22ec04c35:a
id1,sd@n5000cca22ec04c35/a
PL1310LAG0N9AA
c10t5000CCA22EC04C48d0 10360558934820987983 Hitachi HDS5C404
/scsi_vhci/disk@g5000cca22ec04c48:a
id1,sd@n5000cca22ec04c48/a
PL1310LAG0N9YA
c10t5000CCA22EC04C87d0 5707996551731273821 Hitachi HDS5C404
/scsi_vhci/disk@g5000cca22ec04c87:a
id1,sd@n5000cca22ec04c87/a
PL1310LAG0NBZA

Can anyone more experienced with ZFS tell me what is going on?
Where is the ~7TB lost space? Is it reserved for snapshots, or anything else? Can I recover it?

thanks a lot,
-Marcus
 

pbucher

Contributor
Joined
Oct 15, 2012
Messages
180
I few suggestions...if you need performance vs raw storage you are better getting off with making multiple 4 drive + parity drive(s) volumes. It's not a hard and fast rule, try a few different configs and benchmark them to get the balance you need. Unlike hardware raid ZFS is happy to use multiple RAID volumes in a pool, it uses them like a raid 0 stripe.

Also you are creating a ton of swap space, while the OS needs some you might not want the default 2G space on all 23 drives, so you are losing 46G for swap space. Maybe try setting it to 1G for something more reasonable.

As for the missing space I don't have a good answer, could you post the result of "zpool iostat -v" which gives a nice break down of disk space with the bonus being useful for monitoring performance if you add a another parameter for the # of seconds to collect data before displaying it again.
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
Swap would be some of it, although even all 2 gigs per drive would be only 46 gig of total swap.

I agree, that much swap is not needed. You don't say how much ram is in the machine, but you may be able to disable swap completely.

Zfs also reserves 1/64 the pool size for Copy on Write. But even taking into account the 1000 -> 1024 thing, and 1/64, resulting zfs sizes do seem a little less than expected. Based on my previous experience, a 'quick and dirty' number that works for me is 85% of the TB capacity of the disks, after subtracting parity. So in your case I would have estimated 20*4*0.85. This gives 68tb, and zfs is reporting 65.7. Close.

Also, I would say a 23 drive z3 setup is way too many drives in a single vdev. I hope you have good backups. Resilvering is going to be really slow, as all 23 drives are going to be 'touched' during resilver. 11 disks is as big as I'd go for vdevs. What about 2x vdevs of 11 disks in z3? And one 'warm' spare. This would give the capacity of 16 disks, but would be far more resilient. If you were able to use 24 disks, I'd go with 3x vdevs of 8 drives in z2. This would be the capacity of 18, and would likely be the fastest of the options. And resilver times would only involve accessing 8 disks.

Also, are you running freenas? Doing some googling on your pool status is giving me solaris indicators. If it is freenas, how are the drives connected? What are the specs of the server?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
Also, I would say a 23 drive z3 setup is way too many drives in a single vdev. I hope you have good backups. Resilvering is going to be really slow, as all 23 drives are going to be 'touched' during resilver. 11 disks is as big as I'd go for vdevs. What about 2x vdevs of 11 disks in z3? And one 'warm' spare. This would give the capacity of 16 disks, but would be far more resilient. If you were able to use 24 disks, I'd go with 3x vdevs of 8 drives in z2. This would be the capacity of 18, and would likely be the fastest of the options. And resilver times would only involve accessing 8 disks.

I agree with this 100%. I have the widest zpool and vdev I've seen in these forums. I have an 18 drive RAIDZ3. Quite a few of the more senior guys disagree with having a vdev even as wide as mine. Generally, going above 8+3(3 parity drives in RAIDZ3) is considered borderline dangerous. If I had the chance to do it over again I'd change it. But right now I can't so there's no point in arguing with it.

For you, I wouldn't go any wider than 11 drives in a RAIDZ3.
 

pbucher

Contributor
Joined
Oct 15, 2012
Messages
180

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
Technically, that's true, that if the number of 'usable' drives (total - raidZ parity) evenly divides into 128, it's 'optimal'.

However, unless the workload is extremely high, with lots of random access, I wouldn't worry too much about it. If you're going to be hosting an extremely busy sql database on zfs or something, I'd worry about optimal sized vdevs. In such a case, you probably wouldn't even use vdevs as large as we're talking here.

I'd say for the vast majority of workloads, an 'optimal' number of disks per vdev is not critical.
 

MarcusXP

Dabbler
Joined
Nov 13, 2011
Messages
12
First of all, thank you to everyone that replied and sorry for my late response!
I am running Openindiana + Napp-it GUI. However, some things I have to do in the command line (SSH).

It made sense to me to try posting my question on this forum, since there seem to be many guys that know a lot about ZFS, so I thought I might get help (since my problem is not specific to Openindiana in particular).

After doing some research I came to conclusion that the lost space comes from the ashift parameter (block size used when formatting the hard drive).
My drives are 4K block (legacy is 512bytes) and seems that when using 4K blocks, ZFS consumes quite a bit more space for it's metadata and whatnot:
http://freebsd.1045724.n5.nabble.co...ift-12-vs-ashift-9-raidz2-pool-td5590057.html

My problem now is that I cannot (I did not find a way to) format these drives with 512b block, so I can verify the assumption... and this problem is OpenIndiana specific, so I am not sure anyone here can help :(

- - - Updated - - -

I found this guide which explains how to force the 512b block size, however the guide is assuming driver "sd" is used for hard drives:
http://wiki.illumos.org/display/illumos/ZFS+and+Advanced+Format+disks

In my case, the format command shows that my disks are using scsi_vhci driver:

root@openindiana:~# format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
0. c5t0d0 <ATA-MZ-5EA1000-0D3-7D3Q cyl 2607 alt 2 hd 255 sec 63>
/pci@0,0/pci1043,8362@1f,2/disk@0,0
1. c10t5000CCA22EC00A7Cd0 <ATA-Hitachi HDS5C404-A250-3.64TB>
/scsi_vhci/disk@g5000cca22ec00a7c
2. c10t5000CCA22EC00F4Fd0 <ATA-Hitachi HDS5C404-A250-3.64TB>
/scsi_vhci/disk@g5000cca22ec00f4f
3. c10t5000CCA22EC01A8Dd0 <ATA-Hitachi HDS5C404-A250-3.64TB>
/scsi_vhci/disk@g5000cca22ec01a8d
4. c10t5000CCA22EC02D19d0 <ATA-Hitachi HDS5C404-A250-3.64TB>
/scsi_vhci/disk@g5000cca22ec02d19
5. c10t5000CCA22EC04C15d0 <ATA-Hitachi HDS5C404-A250-3.64TB>
/scsi_vhci/disk@g5000cca22ec04c15
6. c10t5000CCA22EC04C26d0 <ATA-Hitachi HDS5C404-A250-3.64TB>
/scsi_vhci/disk@g5000cca22ec04c26
7. c10t5000CCA22EC04C35d0 <ATA-Hitachi HDS5C404-A250-3.64TB>
/scsi_vhci/disk@g5000cca22ec04c35
8. c10t5000CCA22EC04C48d0 <ATA-Hitachi HDS5C404-A250-3.64TB>
/scsi_vhci/disk@g5000cca22ec04c48
9. c10t5000CCA22EC04C87d0 <ATA-Hitachi HDS5C404-A250-3.64TB>
/scsi_vhci/disk@g5000cca22ec04c87
10. c10t5000CCA22EC008ADd0 <ATA-Hitachi HDS5C404-A250-3.64TB>
/scsi_vhci/disk@g5000cca22ec008ad
11. c10t5000CCA22EC021D9d0 <ATA-Hitachi HDS5C404-A250-3.64TB>
/scsi_vhci/disk@g5000cca22ec021d9
12. c10t5000CCA22EC022ACd0 <ATA-Hitachi HDS5C404-A250-3.64TB>
/scsi_vhci/disk@g5000cca22ec022ac
13. c10t5000CCA22EC022E7d0 <ATA-Hitachi HDS5C404-A250-3.64TB>
/scsi_vhci/disk@g5000cca22ec022e7
14. c10t5000CCA22EC032E8d0 <ATA-Hitachi HDS5C404-A250-3.64TB>
/scsi_vhci/disk@g5000cca22ec032e8
15. c10t5000CCA22EC032F8d0 <ATA-Hitachi HDS5C404-A250-3.64TB>
/scsi_vhci/disk@g5000cca22ec032f8
16. c10t5000CCA22EC0210Ed0 <ATA-Hitachi HDS5C404-A250-3.64TB>
/scsi_vhci/disk@g5000cca22ec0210e
17. c10t5000CCA22EC0221Bd0 <ATA-Hitachi HDS5C404-A250-3.64TB>
/scsi_vhci/disk@g5000cca22ec0221b
18. c10t5000CCA22EC0222Bd0 <ATA-Hitachi HDS5C404-A250-3.64TB>
/scsi_vhci/disk@g5000cca22ec0222b
19. c10t5000CCA22EC0330Bd0 <ATA-Hitachi HDS5C404-A250-3.64TB>
/scsi_vhci/disk@g5000cca22ec0330b
20. c10t5000CCA22EC0340Bd0 <ATA-Hitachi HDS5C404-A250-3.64TB>
/scsi_vhci/disk@g5000cca22ec0340b
21. c10t5000CCA22EC02208d0 <ATA-Hitachi HDS5C404-A250-3.64TB>
/scsi_vhci/disk@g5000cca22ec02208
22. c10t5000CCA22EC02234d0 <ATA-Hitachi HDS5C404-A250-3.64TB>
/scsi_vhci/disk@g5000cca22ec02234
23. c10t5000CCA22EC02287d0 <ATA-Hitachi HDS5C404-A250-3.64TB>
/scsi_vhci/disk@g5000cca22ec02287


so it made sense (to me, at least) to try to modify the file /kernel/drv/scsi_vhci.conf

So I put at the end of the file the following:
device-type-scsi-options-list =
"Hitachi HDS5C404", "physical-block-size:512";

then I rebooted the system, created a new pool, but it still has the same ashift=12, not ashift=9 as I wanted :(
root@openindiana:~# zdb Pool-Storage1 | grep ashift
ashift: 12
ashift: 12

So probably what I put in the scsi_vhci.conf didn't work... any suggestions?

- - - Updated - - -

I have one more question.. a user on [H] suggested that when the pool is not using the optimal number of drives (e.g RAIDZ3 8+3 drives), there will be wasted disk space, in addition to the performance loss.. is that correct?

In my case (23 drives in RAIDZ3), he is saying the following:
I assumed your disks are 4k sector.

If that is the case, zfs will use ashift=12 when it makes the pool, meaning the smallest write it does will be 4k.

When you stripe the 128k block of data over 20 disks, that will be 6.4k per disk, so it needs two 4k blocks to store it on, meaning a 128k will use 160k of diskspace.

So I don't know what to believe, will it be wasted disk space or not, when non-optimal number of disks are used?
Any insight on this would be really appreciated :)
 

pbucher

Contributor
Joined
Oct 15, 2012
Messages
180
I have one more question.. a user on [H] suggested that when the pool is not using the optimal number of drives (e.g RAIDZ3 8+3 drives), there will be wasted disk space, in addition to the performance loss.. is that correct?

In my case (23 drives in RAIDZ3), he is saying the following:


So I don't know what to believe, will it be wasted disk space or not, when non-optimal number of disks are used?
Any insight on this would be really appreciated :)

I can't say 100% about the wasted space(it sounds correct though) but I can say 100% that performance will suffer greatly.
 

MarcusXP

Dabbler
Joined
Nov 13, 2011
Messages
12
So it seems that there are at least two things that contribute to waste of disk space on ZFS:
- disk sector size (4k vs 512b)
- number of data disks in a vdev (if it's not a number 2^n, more disk space is wasted).
This should have been documented somewhere.. but I haven't heard of this until I actually set-up a ZFS box and had all these issues... duh!

I'm gonna try to set-up a pool with 4 x VDEVS of 6 drives each RAIDZ2 (4 data + 2 parity on each VDEV). This should work pretty well I guess.
The problem is my system (not sure exactly which component) only sees 23 disks through the integrated SAS expander on the Supermicro case... although 24 disks are plugged :(
I'm doing more investigation to see how I can get this solved.. trying other raid controllers (HBAs) and different motherboards.. hopefully I will get it sorted soon(ish).
 

MarcusXP

Dabbler
Joined
Nov 13, 2011
Messages
12
After many hours of testing, the problem went away after I changed the motherboard.
I am now using a Tyan dual LGA1366 motherboard (I was using Asus dual LGA1366 motherboard before), and I can see all 24 drives (plus the one drive for OS).
The weird part is, I even tried with another controller (Areca 1680IX-24 which I changed to JBOD mode --- kind of like IT mode for LSI) and on the Areca, only 23 drives would show up as well, not 24.
And it seems that it was random, sometimes another drive was missing, not same drive every time.
Looked like some kind of resource conflict, but I didn't have the patience troubleshooting it further. Plus, I like the Tyan board better
I am using OpenIndiana, and I checked the drives in Napp-it and in the SSH (running "format" utility, it would only list 23 drives + the OS drive, not 24 drives + the OS drive, as expected).

- - - Updated - - -

I ended up with the following config:

Pool1: 1 VDEV of 19 drives (16 + 3) RaidZ3 -- About 54TB usable space
Pool2: 1 VDEV of 5 drives (3 + 2) RaidZ2 --- About 10.2TB usable space

Pool 2 not optimal, optimal would have been 4 +1 drives RaidZ, but I wanted better redundancy. So I sacrificed some disk space and performance for it.

I created 1 VDEV on each Pool so I can separate the 5 drive VDEV which has weaker redundancy.
So I have now 2 volumes (one on each pool), assigned to 1 FibreChannel port.. working pretty nice.

I get about 360-370MB/sec read and 120-160MB/sec write speed through FiberChannel on each volume, tested with Windows and Linux clients (virtual machines which have the FC card physically assigned to them in ESXi).

I did try the following config:
4 VDEVS of 6 Drives each (4 + 2) RaidZ2, in one Pool, and the performance results 360-370MB/sec read and 250-260MB/sec write.
I loose some write speed, but gained over 10TB more of disk space... while still having very good redundancy.
I am pretty happy so far.. I will set-up a 2nd similar storage server in the next week or two I hope
 

Firebug24k

Cadet
Joined
Aug 6, 2013
Messages
4
I'm about ready to set up a similar configuration as you - 16+3 raidZ3 with 4TB drives. After you've used your setup for 4 months, any regrets?
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,402
I'm about ready to set up a similar configuration as you - 16+3 raidZ3 with 4TB drives.
A single 19 drive vdev with 4 TB drives no less? No thanks. Use a single pool with two vdev's of your choice, raidz2 or higher.
 

MarcusXP

Dabbler
Joined
Nov 13, 2011
Messages
12
I'm about ready to set up a similar configuration as you - 16+3 raidZ3 with 4TB drives. After you've used your setup for 4 months, any regrets?

I haven't checked this thread in a long time.. so here's an update:
I ended up ditching ZFS alltogether and using two Raid-6 arrays of 12 x 4TB drives on Areca 1882i
Working fust fine so far, it's not as 'safe' as ZFS (no scrubbing) but I have scheduled regular raid health check and that was good enough for me so far.
ZFS was just too much hassle for me, plus I had some issues configuring fiberchannel card (so I went with 10Gb NIC instead of FC)
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
So you went with UFS I take it?
 
Status
Not open for further replies.
Top