request for comments on 8 disks layout

metebalci · Jan 22, 2023

I read some about ZFS layouts, and want to ask for validation. At the moment, I have a non-ZFS 5x18TB raid6 pool (so 54TB capacity, 5 disks is hard limit). I use only around 10TB right now (VM backups, large media/video files and also small media files, some not too important personal backups and some temporary files), considering my use I expect this to reach max 30TB or so. There are usually only one concurrent user of the pool, and most of the use is large files. I do not plan to have an active dataset on this pool, I am using local storage for VMs or processing media files etc. I can increase the capacity of local storage if more needed.

I am planning to upgrade this to TrueNAS soon, and I can have 8 disks (8 is hard limit again, and I already have 8 same disks, HC550) but I have possibility of adding more RAM (planning to have 16-32GB at the moment), adding a few SATA and/or NVMe SSDs or Optane. Besides the reason I would like to run an open platform, the main reason I am upgrading is I cannot saturate 10G network link with my current pool, I guess it is not very surprising since 3x drive is max. 750MB/s. I can sometimes reach around 500MB/s but not always, that is probably due to the maximum capability of the (embedded) storage unit I am currently using. So I am considering striped mirrors or raidz2. I do not need too much redundancy, 1 disk redundancy is OK for me (I have offsite backup for more important things), I say raidz2 only because of the large capacity of the disks, to avoid problems during rebuild/resilver. I understand ZFS performance decrease with use, so I plan to use 50% of available capacity as my max capacity.

I use SMB and NFS. No iSCSI other than just experimenting with it. My read and write patterns are not too asymmetrical, it is not like 10:1, so I prefer to have similar speed for both.

My questions and I hope I calculate this right:

(assumption streaming read/write speed of single drive is 250MB/s)

striped mirrors:
- 4x 2-way mirror has 72TB, 50% is 36TB which is acceptable but not ideal. I dont think I need extra IOPS.
- the streaming read speed of this (2000MB/s) is more than 10G, but write speed (1000MB/s) is a little less than 10G.

raidz2:
- 8-wide raidz2 has 108TB and half is 54TB which is quite ideal. I understand IOPS will be 1/4 of the striped mirrors or less, but I guess it will not matter for me.
- both the read and write speeds are at 10G range (1500MB/s).

Looking at these, raidz2 looks better. Two issues I dont have a clear idea:

- I read about optimum width of various raidzs, and I think 6 is optimum for raidz2. If I use 6 disks, raidz2 becomes not this much attractive compared to striped mirrors, it has everything worse than better redundancy (comparing to 4x 2-way mirrors) which I dont think I need. Or should I not consider this at all and keep it 8-wide raidz2 ?

- I do not need it to be very quickly re-available after a failure but also I do not want it to be not-available for a day or more. I guess it makes striped mirrors attractive ? or a resilver on raidz2 would not be that big a problem ?

Arwen · Jan 22, 2023

Two things.

An 8 wide RAID-Z2 should be fine. The old, original suggested widths for RAID-Zx applied if the ZFS datasets were uncompressed. Either leave compression turned on, or change to another suitable algorithm, (LZ4 is a good default).

ZFS does the re-sync, (aka resilver), of replacement disks live. The NAS might be a bit slower, but should still be able to serve up files and accept files.

I don't know about the speed numbers... so I have not commented on that part.

ChrisRJ · Jan 22, 2023

I would go for at least 32 GB of RAM, probably even 64 GB.

Etorix · Jan 22, 2023

Striped mirrors are good for IOPS and for the flexibility to add or remove vdevs. If high IOPS are not required and flexibility impossible because all slots are filled, go for 8-wide raidz2.
By the way, the 50% limit is a guidance for block storage (iSCSI, zvols). For regular file storage, up to 70-80% should be fine (though maybe not file if files are frequently overwritten).

32 GB RAM is not enough for a L2ARC, and may not be enough to make full use of a 10 GbE link. Only sync writes (NFS, but not SMB) would benefit from a SLOG.

ChrisRJ · Jan 22, 2023

Etorix said:
Only sync writes (NFS, but not SMB) would benefit from a SLOG.

SMB from MacOS would be a sync write by default.

metebalci · Jan 22, 2023

Arwen said:
Two things.

An 8 wide RAID-Z2 should be fine. The old, original suggested widths for RAID-Zx applied if the ZFS datasets were uncompressed. Either leave compression turned on, or change to another suitable algorithm, (LZ4 is a good default).

ZFS does the re-sync, (aka resilver), of replacement disks live. The NAS might be a bit slower, but should still be able to serve up files and accept files.

I don't know about the speed numbers... so I have not commented on that part.

Yes I will enable the compression, so this eliminates the optimum width concern I had.

Live re-sync sounds OK. If possible, it would be nice to know an approx. time it would take to re-sync a very idling 8x18TB raidz2 (lets say when 1/4 is full).

metebalci · Jan 22, 2023

ChrisRJ said:
I would go for at least 32 GB of RAM, probably even 64 GB.

OK, I will start with 32GB and see.

metebalci · Jan 22, 2023

Etorix said:
Striped mirrors are good for IOPS and for the flexibility to add or remove vdevs. If high IOPS are not required and flexibility impossible because all slots are filled, go for 8-wide raidz2.
By the way, the 50% limit is a guidance for block storage (iSCSI, zvols). For regular file storage, up to 70-80% should be fine (though maybe not file if files are frequently overwritten).

32 GB RAM is not enough for a L2ARC, and may not be enough to make full use of a 10 GbE link. Only sync writes (NFS, but not SMB) would benefit from a SLOG.

OK, I think I am going in raidz2 direction. I didnt know about the guidance, good to know it will be more than 50%.

I dont think I need (explicit) sync writes, so probably no SLOG.

How do you correlate network speed to RAM ? What is the calculation there ?

metebalci · Jan 22, 2023

ChrisRJ said:
SMB from MacOS would be a sync write by default.

Is it possible to make all requests from SMB async on TrueNAS side ?

ChrisRJ · Jan 22, 2023

metebalci said:
I dont think I need (explicit) sync writes, so probably no SLOG.

You don't need an SLOG for doing sync writes. An SLOG, if it is sufficiently faster than the "regular" drive, will speed up sync writes. But they are possible without one.

Davvo · Jan 22, 2023

Please read the following resources:

Introduction to ZFS

This is a short introduction to ZFS. It is really only intended to convey the bare minimum knowledge needed to start diving into ZFS and is in no way meant to cut Michael W. Lucas' and Allan Jude's book income. It is a bit of a spiritual...

www.truenas.com

ZFS Storage Pool Layout

This resource was originally created by user: @Davvo on the TrueNAS Community Forums Archive. https://www.truenas.com/community/resources/zfs-storage-pool-layout.201/download [1] This amazing document, created by iXsystems in February 2022 as a “White Paper”, cleanly explains how to qualify...

www.truenas.com

Some insights into SLOG/ZIL with ZFS on FreeNAS

What is the ZIL? POSIX provides a facility for the system or an application to make sure that data requested to be written is actually committed to stable storage: a synchronous write request. Upon completion of a sync write request, the underlying filesystem is supposed to guarantee that a...

www.truenas.com

metebalci said:
Is it possible to make all requests from SMB async on TrueNAS side ?

Yes.

metebalci said:
Live re-sync sounds OK.

It's a scheduled task meant for backup and restoration from such: you can't do Google Drive instant sync things with that. Just making sure you understand this here since we've had a few cases recently.

Etorix · Jan 22, 2023

metebalci said:
Is it possible to make all requests from SMB async on TrueNAS side ?

Yes, you can set sync=never on relevant datasets.

metebalci said:
How do you correlate network speed to RAM ? What is the calculation there ?

RAM is used as read and write cache, and the only sure way to saturate a high-speed link is to feed it from (or to) RAM.
Writes are cached as "transaction groups" ('txg', typically 5 s) and then flushed to disk as a sequential operation encompassing everything which was received during these 5 seconds. 5 seconds at 10 Gb/s takes up more space than 5 s at 1 Gb/s.
Because of this nice write-accelerating mechanism, BIGFILE.MP4, which took several transaction groups to receive and write, has been sliced among several groups and peppered with other blocks which were written in the same groups, so serving back BIGFILE, which looks like a single and simple sequential read operation, actually involves random reads to different txgs. And if ZFS does not have the corresponding metadata in RAM—because it has been evicted to make way for a large write cache, for instance…— and first needs to read the metadata from disk to find where these txgs are, then what looks like a simple large sequential read operation actually involves a lot of IOPS and random access and will certainly NOT proceed at the theoretical read throughput of the pool.

Welcome to the wonderful world of "ZFS does not behave the way you think"!

metebalci · Jan 22, 2023

Davvo said:
Please read the following resources:

Introduction to ZFS

This is a short introduction to ZFS. It is really only intended to convey the bare minimum knowledge needed to start diving into ZFS and is in no way meant to cut Michael W. Lucas' and Allan Jude's book income. It is a bit of a spiritual...

www.truenas.com

ZFS Storage Pool Layout

This resource was originally created by user: @Davvo on the TrueNAS Community Forums Archive. https://www.truenas.com/community/resources/zfs-storage-pool-layout.201/download [1] This amazing document, created by iXsystems in February 2022 as a “White Paper”, cleanly explains how to qualify...

www.truenas.com

Some insights into SLOG/ZIL with ZFS on FreeNAS

What is the ZIL? POSIX provides a facility for the system or an application to make sure that data requested to be written is actually committed to stable storage: a synchronous write request. Upon completion of a sync write request, the underlying filesystem is supposed to guarantee that a...

www.truenas.com

Yes.

It's a scheduled task meant for backup and restoration from such: you can't do Google Drive instant sync things with that. Just making sure you understand this here since we've had a few cases recently.

I think I checked these and other resources as well but not particularly for re-sync, I will do.

metebalci · Jan 22, 2023

ChrisRJ said:
You don't need an SLOG for doing sync writes. An SLOG, if it is sufficiently faster than the "regular" drive, will speed up sync writes. But they are possible without one.

Yes, what I mean is I will not explicitly ask for sync write but I didnt know if it can be disabled in all protocols (and in all clients etc.), but it seems it can be disabled at TrueNAS side, so that works perfect for me.

metebalci · Jan 22, 2023

Etorix said:
Yes, you can set sync=never on relevant datasets.

RAM is used as read and write cache, and the only sure way to saturate a high-speed link is to feed it from (or to) RAM.
Writes are cached as "transaction groups" ('txg', typically 5 s) and then flushed to disk as a sequential operation encompassing everything which was received during these 5 seconds. 5 seconds at 10 Gb/s takes up more space than 5 s at 1 Gb/s.
Because of this nice write-accelerating mechanism, BIGFILE.MP4, which took several transaction groups to receive and write, has been sliced among several groups and peppered with other blocks which were written in the same groups, so serving back BIGFILE, which looks like a single and simple sequential read operation, actually involves random reads to different txgs. And if ZFS does not have the corresponding metadata in RAM—because it has been evicted to make way for a large write cache, for instance…— and first needs to read the metadata from disk to find where these txgs are, then what looks like a simple large sequential read operation actually involves a lot of IOPS and random access and will certainly NOT proceed at the theoretical read throughput of the pool.

Welcome to the wonderful world of "ZFS does not behave the way you think"!

Ok I read about the 5 seconds trx groups. 10Gb/s as at max 1.25GB/s, so it is max 6.25GB. I was thinking this is way below lets say 32GB, so it might be OK. I should say it is rare multiple things will concurrently happen, so when it is a bigfile write, it is just that. But I understand what you mean, it is not that straightforward to calculate. It seems this deserves a test, since it is my main purpose to saturate the link. Since I cant directly move the files (I have to reuse the disks), I will backup and restore so I can do the test with different setups/layouts as well.

Etorix · Jan 22, 2023

metebalci said:
Ok I read about the 5 seconds trx groups. 10Gb/s as at max 1.25GB/s, so it is max 6.25GB.

Make that 12.5 GB for the two txgs which ZFS caches and you've taken out a sizeable chunk out of 32 GB.
Related to the issue of holding "enough" metadata in RAM, there's a rule of thumb of 1 GB per 1 TB of pool storage (which admittedly relaxes for large RAM sizes, but 32 GB is not yet "large" for ZFS) and your 8*18 TB pool falls short of that. You probably need 64 GB RAM or more to reliably saturate a 10 GbE link.

metebalci · Jan 22, 2023

Etorix said:
Make that 12.5 GB for the two txgs which ZFS caches and you've taken out a sizeable chunk out of 32 GB.
Related to the issue of holding "enough" metadata in RAM, there's a rule of thumb of 1 GB per 1 TB of pool storage (which admittedly relaxes for large RAM sizes, but 32 GB is not yet "large" for ZFS) and your 8*18 TB pool falls short of that. You probably need 64 GB RAM or more to reliably saturate a 10 GbE link.

Thanks, I will try.

Arwen · Jan 22, 2023

metebalci said:
...

Live re-sync sounds OK. If possible, it would be nice to know an approx. time it would take to re-sync a very idling 8x18TB raidz2 (lets say when 1/4 is full).

Whence you start the disk replacement re-sync / re-silver, a zpool status command will give an estimate of when it will complete. Very early in the re-sync / re-silver the times will be way off. So you generally want to wait 5, 10 or even 30 minutes. By then, things may have stabilized as far as estimated completion time, (but it is still an ESTIMATE!).

metebalci · Jan 23, 2023

I was thinking and then I saw some posts.

- Almost never more than 50GB will be written in a short time frame, and this is like 40 seconds with 10G. If I have enough RAM, by increasing the txg timeout from 5 seconds to lets say 45 seconds, would this guarantee the link will be saturated independent of the speed of the pool ? I guess I am talking about ~256GB RAM.

- Instead of writing directly to the main pool consisting HDDs, I can write to a pool consisting only NVMe SSDs and then move the data to the main pool in the background. I just dont know how easy this would be operationally.

Are these setups common for particular use cases ?

jgreco · Jan 23, 2023

metebalci said:
If I have enough RAM, by increasing the txg timeout from 5 seconds to lets say 45 seconds, would this guarantee the link will be saturated independent of the speed of the pool ?

No, it will not. Messing with the txg timeout is not recommended, since it is an integral part of the ZFS write throttle to make sure that your ZFS pool does not stall. Some of us actually lower it to 1s in order to encourage more consistent behaviour on workloads that require it.

Important Announcement for the TrueNAS Community.

request for comments on 8 disks layout

Dabbler

MVP

Wizard

Wizard

Wizard

Dabbler

Dabbler

Dabbler

Dabbler

Wizard

MVP

Wizard

Dabbler

Dabbler

Dabbler

Wizard

Dabbler

MVP

Dabbler

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "request for comments on 8 disks layout"

Similar threads