request for comments on 8 disks layout

metebalci

Dabbler
Joined
Jan 18, 2023
Messages
28
I read some about ZFS layouts, and want to ask for validation. At the moment, I have a non-ZFS 5x18TB raid6 pool (so 54TB capacity, 5 disks is hard limit). I use only around 10TB right now (VM backups, large media/video files and also small media files, some not too important personal backups and some temporary files), considering my use I expect this to reach max 30TB or so. There are usually only one concurrent user of the pool, and most of the use is large files. I do not plan to have an active dataset on this pool, I am using local storage for VMs or processing media files etc. I can increase the capacity of local storage if more needed.

I am planning to upgrade this to TrueNAS soon, and I can have 8 disks (8 is hard limit again, and I already have 8 same disks, HC550) but I have possibility of adding more RAM (planning to have 16-32GB at the moment), adding a few SATA and/or NVMe SSDs or Optane. Besides the reason I would like to run an open platform, the main reason I am upgrading is I cannot saturate 10G network link with my current pool, I guess it is not very surprising since 3x drive is max. 750MB/s. I can sometimes reach around 500MB/s but not always, that is probably due to the maximum capability of the (embedded) storage unit I am currently using. So I am considering striped mirrors or raidz2. I do not need too much redundancy, 1 disk redundancy is OK for me (I have offsite backup for more important things), I say raidz2 only because of the large capacity of the disks, to avoid problems during rebuild/resilver. I understand ZFS performance decrease with use, so I plan to use 50% of available capacity as my max capacity.

I use SMB and NFS. No iSCSI other than just experimenting with it. My read and write patterns are not too asymmetrical, it is not like 10:1, so I prefer to have similar speed for both.

My questions and I hope I calculate this right:

(assumption streaming read/write speed of single drive is 250MB/s)

striped mirrors:
- 4x 2-way mirror has 72TB, 50% is 36TB which is acceptable but not ideal. I dont think I need extra IOPS.
- the streaming read speed of this (2000MB/s) is more than 10G, but write speed (1000MB/s) is a little less than 10G.

raidz2:
- 8-wide raidz2 has 108TB and half is 54TB which is quite ideal. I understand IOPS will be 1/4 of the striped mirrors or less, but I guess it will not matter for me.
- both the read and write speeds are at 10G range (1500MB/s).

Looking at these, raidz2 looks better. Two issues I dont have a clear idea:

- I read about optimum width of various raidzs, and I think 6 is optimum for raidz2. If I use 6 disks, raidz2 becomes not this much attractive compared to striped mirrors, it has everything worse than better redundancy (comparing to 4x 2-way mirrors) which I dont think I need. Or should I not consider this at all and keep it 8-wide raidz2 ?

- I do not need it to be very quickly re-available after a failure but also I do not want it to be not-available for a day or more. I guess it makes striped mirrors attractive ? or a resilver on raidz2 would not be that big a problem ?
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Two things.

An 8 wide RAID-Z2 should be fine. The old, original suggested widths for RAID-Zx applied if the ZFS datasets were uncompressed. Either leave compression turned on, or change to another suitable algorithm, (LZ4 is a good default).

ZFS does the re-sync, (aka resilver), of replacement disks live. The NAS might be a bit slower, but should still be able to serve up files and accept files.

I don't know about the speed numbers... so I have not commented on that part.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
I would go for at least 32 GB of RAM, probably even 64 GB.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
Striped mirrors are good for IOPS and for the flexibility to add or remove vdevs. If high IOPS are not required and flexibility impossible because all slots are filled, go for 8-wide raidz2.
By the way, the 50% limit is a guidance for block storage (iSCSI, zvols). For regular file storage, up to 70-80% should be fine (though maybe not file if files are frequently overwritten).

32 GB RAM is not enough for a L2ARC, and may not be enough to make full use of a 10 GbE link. Only sync writes (NFS, but not SMB) would benefit from a SLOG.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919

metebalci

Dabbler
Joined
Jan 18, 2023
Messages
28
Two things.

An 8 wide RAID-Z2 should be fine. The old, original suggested widths for RAID-Zx applied if the ZFS datasets were uncompressed. Either leave compression turned on, or change to another suitable algorithm, (LZ4 is a good default).

ZFS does the re-sync, (aka resilver), of replacement disks live. The NAS might be a bit slower, but should still be able to serve up files and accept files.

I don't know about the speed numbers... so I have not commented on that part.

Yes I will enable the compression, so this eliminates the optimum width concern I had.

Live re-sync sounds OK. If possible, it would be nice to know an approx. time it would take to re-sync a very idling 8x18TB raidz2 (lets say when 1/4 is full).
 

metebalci

Dabbler
Joined
Jan 18, 2023
Messages
28
Striped mirrors are good for IOPS and for the flexibility to add or remove vdevs. If high IOPS are not required and flexibility impossible because all slots are filled, go for 8-wide raidz2.
By the way, the 50% limit is a guidance for block storage (iSCSI, zvols). For regular file storage, up to 70-80% should be fine (though maybe not file if files are frequently overwritten).

32 GB RAM is not enough for a L2ARC, and may not be enough to make full use of a 10 GbE link. Only sync writes (NFS, but not SMB) would benefit from a SLOG.

OK, I think I am going in raidz2 direction. I didnt know about the guidance, good to know it will be more than 50%.

I dont think I need (explicit) sync writes, so probably no SLOG.

How do you correlate network speed to RAM ? What is the calculation there ?
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
I dont think I need (explicit) sync writes, so probably no SLOG.
You don't need an SLOG for doing sync writes. An SLOG, if it is sufficiently faster than the "regular" drive, will speed up sync writes. But they are possible without one.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Please read the following resources:

Is it possible to make all requests from SMB async on TrueNAS side ?
Yes.

Live re-sync sounds OK.
It's a scheduled task meant for backup and restoration from such: you can't do Google Drive instant sync things with that. Just making sure you understand this here since we've had a few cases recently.
 
Last edited:

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
Is it possible to make all requests from SMB async on TrueNAS side ?
Yes, you can set sync=never on relevant datasets.

How do you correlate network speed to RAM ? What is the calculation there ?
RAM is used as read and write cache, and the only sure way to saturate a high-speed link is to feed it from (or to) RAM.
Writes are cached as "transaction groups" ('txg', typically 5 s) and then flushed to disk as a sequential operation encompassing everything which was received during these 5 seconds. 5 seconds at 10 Gb/s takes up more space than 5 s at 1 Gb/s.
Because of this nice write-accelerating mechanism, BIGFILE.MP4, which took several transaction groups to receive and write, has been sliced among several groups and peppered with other blocks which were written in the same groups, so serving back BIGFILE, which looks like a single and simple sequential read operation, actually involves random reads to different txgs. And if ZFS does not have the corresponding metadata in RAM—because it has been evicted to make way for a large write cache, for instance…— and first needs to read the metadata from disk to find where these txgs are, then what looks like a simple large sequential read operation actually involves a lot of IOPS and random access and will certainly NOT proceed at the theoretical read throughput of the pool.

Welcome to the wonderful world of "ZFS does not behave the way you think"!
 

metebalci

Dabbler
Joined
Jan 18, 2023
Messages
28
Please read the following resources:


Yes.


It's a scheduled task meant for backup and restoration from such: you can't do Google Drive instant sync things with that. Just making sure you understand this here since we've had a few cases recently.

I think I checked these and other resources as well but not particularly for re-sync, I will do.
 

metebalci

Dabbler
Joined
Jan 18, 2023
Messages
28
You don't need an SLOG for doing sync writes. An SLOG, if it is sufficiently faster than the "regular" drive, will speed up sync writes. But they are possible without one.

Yes, what I mean is I will not explicitly ask for sync write but I didnt know if it can be disabled in all protocols (and in all clients etc.), but it seems it can be disabled at TrueNAS side, so that works perfect for me.
 

metebalci

Dabbler
Joined
Jan 18, 2023
Messages
28
Yes, you can set sync=never on relevant datasets.


RAM is used as read and write cache, and the only sure way to saturate a high-speed link is to feed it from (or to) RAM.
Writes are cached as "transaction groups" ('txg', typically 5 s) and then flushed to disk as a sequential operation encompassing everything which was received during these 5 seconds. 5 seconds at 10 Gb/s takes up more space than 5 s at 1 Gb/s.
Because of this nice write-accelerating mechanism, BIGFILE.MP4, which took several transaction groups to receive and write, has been sliced among several groups and peppered with other blocks which were written in the same groups, so serving back BIGFILE, which looks like a single and simple sequential read operation, actually involves random reads to different txgs. And if ZFS does not have the corresponding metadata in RAM—because it has been evicted to make way for a large write cache, for instance…— and first needs to read the metadata from disk to find where these txgs are, then what looks like a simple large sequential read operation actually involves a lot of IOPS and random access and will certainly NOT proceed at the theoretical read throughput of the pool.

Welcome to the wonderful world of "ZFS does not behave the way you think"!

Ok I read about the 5 seconds trx groups. 10Gb/s as at max 1.25GB/s, so it is max 6.25GB. I was thinking this is way below lets say 32GB, so it might be OK. I should say it is rare multiple things will concurrently happen, so when it is a bigfile write, it is just that. But I understand what you mean, it is not that straightforward to calculate. It seems this deserves a test, since it is my main purpose to saturate the link. Since I cant directly move the files (I have to reuse the disks), I will backup and restore so I can do the test with different setups/layouts as well.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
Ok I read about the 5 seconds trx groups. 10Gb/s as at max 1.25GB/s, so it is max 6.25GB.
Make that 12.5 GB for the two txgs which ZFS caches and you've taken out a sizeable chunk out of 32 GB.
Related to the issue of holding "enough" metadata in RAM, there's a rule of thumb of 1 GB per 1 TB of pool storage (which admittedly relaxes for large RAM sizes, but 32 GB is not yet "large" for ZFS) and your 8*18 TB pool falls short of that. You probably need 64 GB RAM or more to reliably saturate a 10 GbE link.
 

metebalci

Dabbler
Joined
Jan 18, 2023
Messages
28
Make that 12.5 GB for the two txgs which ZFS caches and you've taken out a sizeable chunk out of 32 GB.
Related to the issue of holding "enough" metadata in RAM, there's a rule of thumb of 1 GB per 1 TB of pool storage (which admittedly relaxes for large RAM sizes, but 32 GB is not yet "large" for ZFS) and your 8*18 TB pool falls short of that. You probably need 64 GB RAM or more to reliably saturate a 10 GbE link.

Thanks, I will try.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
...

Live re-sync sounds OK. If possible, it would be nice to know an approx. time it would take to re-sync a very idling 8x18TB raidz2 (lets say when 1/4 is full).
Whence you start the disk replacement re-sync / re-silver, a zpool status command will give an estimate of when it will complete. Very early in the re-sync / re-silver the times will be way off. So you generally want to wait 5, 10 or even 30 minutes. By then, things may have stabilized as far as estimated completion time, (but it is still an ESTIMATE!).
 

metebalci

Dabbler
Joined
Jan 18, 2023
Messages
28
I was thinking and then I saw some posts.

- Almost never more than 50GB will be written in a short time frame, and this is like 40 seconds with 10G. If I have enough RAM, by increasing the txg timeout from 5 seconds to lets say 45 seconds, would this guarantee the link will be saturated independent of the speed of the pool ? I guess I am talking about ~256GB RAM.

- Instead of writing directly to the main pool consisting HDDs, I can write to a pool consisting only NVMe SSDs and then move the data to the main pool in the background. I just dont know how easy this would be operationally.

Are these setups common for particular use cases ?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
If I have enough RAM, by increasing the txg timeout from 5 seconds to lets say 45 seconds, would this guarantee the link will be saturated independent of the speed of the pool ?

No, it will not. Messing with the txg timeout is not recommended, since it is an integral part of the ZFS write throttle to make sure that your ZFS pool does not stall. Some of us actually lower it to 1s in order to encourage more consistent behaviour on workloads that require it.
 
Top