Fusion Pool layout

Michael Schefczyk · Nov 2, 2020

Dear All,

I would like to add SSDs to an existing NAS to benefit from a fusion pool for future writes. My current HDDs are four drives in RAIDZ1.

The doc article says:

"The redundancy of this device must match the redundancy of the other normal devices in the pool."

and

"When adding disks to the pool, select the number of hard disks desired to be in the VDEV and ensure the pool is set to Mirrored."

I am somewhat confused here. My ideal would be to add three NVMe SSDs in RAIDZ1. Would this be possible and advisable? The first statement: "match ... redundancy" might sound like: Needs to be four NVMe SSDs in RAIDZ1, like the HDDs. The second statement: "ensure ... mirrored" might sound like: Needs to be two NVMe SSDs.

Regards,

Michael

HoneyBadger · Nov 2, 2020

The important takeaway is the loss of the metadata vdev means the loss of your pool. Bonus that applies to this situation - because you have a top-level vdev that is a RAIDZ, you won't be able to remove a special vdev once it's attached.

"Match the redundancy" broadly speaking means "be able to lose the same number of drives without ill effect."

In your case, a RAIDZ1 can survive the loss of a single drive without any data being lost. This would align with a regular 2-way mirror. A RAIDZ2 can survive two drives lost in a vdev and it could be argued that is important enough to consider a mirror3 for metadata.

Metadata itself is small in size, and layouts like RAIDZn do poorly from an I/O and space utilization perspective with small records. That's why mirrors are generally recommended.

In your situation - go with mirrors. What are the NVMe SSDs you are choosing?

Michael Schefczyk · Nov 2, 2020

Thank you very much. My aim is to store metadata and small files. I am considering:

GIGABYTE AORUS NVMe Gen4 SSD 2TB GP-ASM2NE6200TTTD

XPG SPECTRIX NVMe AS40G-4TT-C 4TB

Does any of that make sense to you?

HoneyBadger · Nov 2, 2020

The Gigabyte SSD has a fairly generous TBW (3600TBW) rating. I can't find one for the ADATA, and in my opinion it loses points right away for having RGB lights. A metadata vdev will end up taking a fair amount of writes as every transaction will write to it, so make sure to set up SMART monitoring and be ready if it looks like you're burning through your NAND cells at an alarming rate.

As far as the actual space required, run the following command from a shell (use tmux in case it's long-running and you get disconnected) to walk through your pool and get an idea of the space needed.

Code:

tmux
zdb -LbbbA -U /data/zfs/zpool.cache poolname

You could redirect the output to a file with > /mnt/some/path/filename.txt for parsing later.

Sum up the ASIZE column for everything other than L0 zvol object and L0 ZFS directory and that should give you the result (there's a handy total near the bottom) - in your case, you have a RAIDZ1 pool so your ASIZE may be artifically larger than your LSIZE, but this will give you a bit of headroom.

Michael Schefczyk · Nov 3, 2020

Thank you very much for the detailed response!

I did generate the file - it is enclosed below. I am not certain, if I interpret it correctly. If I strictly add all ASIZE lines not starting with L0 (for example not L0 object arraray, not L0 bpobj and the like) and also not the lines which seem to just sum up previous lines, I get 256 GB.

Does that mean, that an SSD pool holding 512 GB would be 50 % filled? Does that include metadata AND small files in the definition of a fusion pool or just metadata?

If it is small files and metadata, then a pair of mirrored 2 TB GIGABYTE AORUS SSDs should be good to go both from a standpoint of capacity and TBW, I think. That would probably better than three of such SSDs in RAIDZ1 and also better than two or three or the probably slightly inferior 4 TB XPG SPECTRIX.

HoneyBadger · Nov 3, 2020

The numbers for this are just for metadata.

I've got about 284G when I add up the LSIZE for lines that aren't one of these three:

L0 ZFS plain file
L0 ZFS directory
L0 zvol object

Your ASIZE will be inflated for a lot of the smaller records due to the use of RAIDZ; in a mirror configuration you'll be a lot closer to the PSIZE total of the same which is significantly lower at roughly 60G, since you're getting solid wins from compression and no longer losing out due to parity.

Your small files though needs to be pulled from the block size histogram at the bottom, and it will depend on where you set your special_small_blocks threshold. But thankfully there's a "cumulative total" column.

Code:

Block Size Histogram

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:  3.38M  1.69G  1.69G   828K   414M   414M      0      0      0
     1K:  9.88M  12.3G  14.0G   766K   915M  1.30G      0      0      0
     2K:  89.2M   253G   267G   776K  1.87G  3.17G      0      0      0
     4K:  93.3M   464G   731G   378K  1.88G  5.05G      3    12K    12K
     8K:  34.6M   376G  1.08T   161K  1.74G  6.79G   126M  1008G  1008G
    16K:   117M  2.09T  3.17T   289M  4.51T  4.52T   187M  3.65T  4.64T
    32K:  43.5M  1.95T  5.13T  1.59M  52.3G  4.57T  49.6M  2.00T  6.64T
    64K:  53.3M  4.55T  9.67T   108K  9.44G  4.58T  57.7M  5.09T  11.7T
   128K:  95.9M  12.0T  21.7T   247M  30.9T  35.5T   120M  19.8T  31.5T
   256K:      0      0  21.7T      0      0  35.5T    121  44.7M  31.5T
   512K:      0      0  21.7T      0      0  35.5T      0      0  31.5T
     1M:      0      0  21.7T      0      0  35.5T      0      0  31.5T
     2M:      0      0  21.7T      0      0  35.5T      0      0  31.5T
     4M:      0      0  21.7T      0      0  35.5T      0      0  31.5T
     8M:      0      0  21.7T      0      0  35.5T      0      0  31.5T
    16M:      0      0  21.7T      0      0  35.5T      0      0  31.5T

I think something's a bit amiss here in the zdb output because of the size increase from LSIZE to PSIZE. I'll need to set up a RAIDZ setup and try stuffing some small blocks on it to see if I have the same results.

Looking at PSIZE again (because you'll be using mirrors) if you set special_small_blocks to grab anything of 8K or smaller that's already 1.08T of space required. Bump that to 16K and it's immediately 3.17T. That's a pretty hefty chunk of space.

Michael Schefczyk · Nov 3, 2020

Thanks again!

The actual pool setup is: CPU E5-2620 v4, 128 GB RAM, 4 x 18 TB WD181KRYZ in RAIDZ1, Samsung SSD 970 PRO 1TB as Cache, INTEL SSDPE21K100GA01 OPTANE SSD DC P4801x Series 100 GB as Log. Use case is mostly backup (often large files such as VMs) and user file sharing (SMB + nfs), some live access to Nextcloud files via nfs and some performance critical Microsoft SQL in an iSCSI share.

To me, the viable options sound like:
1) add mirror of 2 TB SSDs and stick to metadata
2a/b) add mirror of 4 TB SSDs or three 2 TB SSDs as RAIDZ1 and allow metadata plus 8K small files
3) add three 4 TB SSDa as RAIDZ1 and allow metadata plus 16K small files

Would you consider any of these three/four options smart and worthwhile?

An alternative approach (= both at the same time would not be possible with this hardware due to PCIe/lane restrictions) might be to add SSDs just for a separate pool for the iSCSI share with Microsoft SQL files. That would shift benefits away from user file sharing and Nextcloud files (usually larger than 16K, so they could just benefit from metadata) towards Microsoft SQL.

HoneyBadger · Nov 4, 2020

Michael Schefczyk said:
1) add mirror of 2 TB SSDs and stick to metadata
2a/b) add mirror of 4 TB SSDs or three 2 TB SSDs as RAIDZ1 and allow metadata plus 8K small files
3) add three 4 TB SSDa as RAIDZ1 and allow metadata plus 16K small files

Would you consider any of these three/four options smart and worthwhile?

Of these three, I like either 1) or 2a) - I'd shy away from RAIDZ for metadata personally just based on the poorer performance for random I/O, but if you're already willing to use HDDs in RAIDZ1 for iSCSI/block storage, that might be something you're able to endure. Again though - a top-level RAIDZ vdev existing means device removal isn't possible for you.

Your "Option 4" of using a pair of SSDs for iSCSI is obviously more beneficial for the VMFS/SQL workload. You could even trial this with a pair of SSDs that could also fulfill "Option 1/2A" duties if needed - create the pool, migrate (or copy) your data there, test it out. See if removing this I/O helps your other pool out; if not, move data back/destroy the pool, and add them as mirrored special vdevs for metadata.

Important Announcement for the TrueNAS Community.

Fusion Pool layout

Michael Schefczyk

Dabbler

HoneyBadger

actually does care

Michael Schefczyk

Dabbler

HoneyBadger

actually does care

Michael Schefczyk

Dabbler

Attachments

HoneyBadger

actually does care

Michael Schefczyk

Dabbler

HoneyBadger

actually does care

Similar threads

Important Announcement for the TrueNAS Community.

Fusion Pool layout

Dabbler

actually does care

Dabbler

actually does care

Dabbler

Attachments

actually does care

Dabbler

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Fusion Pool layout"

Similar threads