ZIL/L2ARC necessity + RAID choice

brams · Sep 8, 2022

Hi everyone,

I am fresh to TrueNAS. I have a question about the necessity to have ZIL/SLOG and L2ARC cache in my pool and how to setup my vdevs properly.
I have read the TrueNAS Scale guide and the Cyberjock noob guide. With all this reading, my mind is getting blurry…

Use case:
I got a server that I want to convert to an iSCSI and NFS shares server.

Software:
TrueNAS-SCALE-22.02.2.1

Hardware:
Motherboard: X10DRC-LN4+ SuperMicro
RAM quantity: 4 x 16GB RDIMM
Hard drives: 14 x 3TB HDD drives, 2 x 960GB SSD drives, 1x 1TB NVMe drive

My initial (noob) plan is the following layout:

1TB NVMe drive -> L2ARC
2 x 960 GB SSD drives -> ZFS LOG (Mirror)
13 HDD drives -> Data (RAIDZ-3)
1 HDD -> Spare

According to Cyberjock guide, I see limitations on the previous layout. My cache drives are oversized and the layout I have is not so good as 13 drives in a vdev is maybe too much.

Question:
About cache: Is it a rule of thumb to have as much SSD as possible and no vdev cache in a pool or having cache vdev is a must?

Also, I cannot make my mind about the data vdev layout. As I have 14 x 3 TB drives, the possibilities are multiple. Is it better to have a single vdev or multiple vdevs? On a single pool or in two pools?
Should I dedicate one vdev for ZFS shares and the other for iSCSI share or is it a bad idea?

HoneyBadger · Sep 8, 2022

Hello. Apologies in advance for the questions to an already blurry mind. :)

Your motherboard has a built-in SAS3108 "RAID" controller - it can be run in JBOD mode using the mrsas driver with apparent success, but it has significantly less hours than the mps or mpr drivers used by the "true HBAs". Make sure that you have your controller in JBOD mode at the least, or consider obtaining a "true HBA" on the SAS2308 or SAS3008 chipset.

brams said:
I got a server that I want to convert to an iSCSI and NFS shares server.

Question 1:
Can you describe the client systems you're going to connect to this, and what they're going to run?
For example: "I'm going to run two VMware ESXi hosts connected over 2x1Gbps iSCSI, and a Linux desktop connected to NFS for general file sharing."

iSCSI requires a good deal of resources, 64GB of RAM is a good start certainly, but your board has 24 DIMM slots. Often the best method for improved performance is "more RAM" - but the need for this will depend on your workload from Question 1 above, and "how fast you want to go." (See footnote [1])

Question 2:
Do you have information on the vendor and model of your SSDs (both the 1TB NVMe and the 960GB SAS/SATA ones) as this will help determine if they are suitable candidates for SLOG.

brams said:
About cache: Is it a rule of thumb to have as much SSD as possible and no vdev cache in a pool or having cache vdev is a must?

L2ARC "cache" is not necessary for every workload, but when the workload benefits it can help greatly. A similar rule applies to SLOG, but with the note that it can only accelerate the speed of synchronous writes. You may or may not have these, but the workload suggestion of iSCSI and NFS makes me think that you will. (See footnote [2])

brams said:
Also, I cannot make my mind about the data vdev layout. As I have 14 x 3 TB drives, the possibilities are multiple. Is it better to have a single vdev or multiple vdevs? On a single pool or in two pools?
Should I dedicate one vdev for ZFS shares and the other for iSCSI share or is it a bad idea?

Slight terminology correction - a "vdev" is the collection of disks that provides redundancy. Several "vdevs" are then collected together into a "pool" - so what you may mean here is "should you create multiple pools with different vdev configurations" - the answer is "possibly, depending on the workload."

Random access workloads (many users, small files, high level of responsiveness required) highly favor the use of mirrors. Single-user sequential workloads (a backup target, one user copying large files, sometimes several users playing video) can get better copy throughput with RAIDZ. With 14 drives, you might look at a solution of 8 drives in 2-way mirrors (4 vdevs of 2) and 6 drives in a single RAIDZ2 vdev - each pool will have roughly 12TB of usable space, but the first pool would have roughly 4x the "random" performance if the second (in a theoretical sense)

Hope this helps, feel free to ask more.

Resources as footnotes:

[1]:

Resource - Why iSCSI often requires more resources for the same result

iSCSI is a SAN protocol. NFS, CIFS, etc., are NAS protocols. For a NAS protocol, the client sends a command to the filer, such as "open this file", or "read ten blocks", or "remove this file." On the filer, the local NAS protocol daemon...

www.truenas.com

[2]:

Resource - Sync writes, or: Why is my ESXi NFS so slow, and why is iSCSI faster?

This post is not specific to ESXi, however, ESXi users typically experience more trouble with this topic due to the way VMware works. When an application on a client machine wishes to write something to storage, it issues a write request. On a...

www.truenas.com

brams · Sep 9, 2022

HoneyBadger said:
Your motherboard has a built-in SAS3108 "RAID" controller - it can be run in JBOD mode using the mrsas driver with apparent success, but it has significantly less hours than the mps or mpr drivers used by the "true HBAs". Make sure that you have your controller in JBOD mode at the least, or consider obtaining a "true HBA" on the SAS2308 or SAS3008 chipset

Thank you very much for your clear and concise explanations!

A quick thing I forgot to mention, TrueNAS Scale is installed and boots on a 16GB SATA DOM (SSD-DM016-SMCMVN1).
Before installing it, I made sure the drives were in JBOD mode (see picture enclosed). TrueNAS guide mentioned that using JBOD could for instance mask disk serial number and S.M.A.R.T. health information. The first observations on my system show SN are visible and S.M.A.R.T. tests can be performed. So everything seems to work fine although I couldn’t find the drivers you were talking about in my system.

root@truenas[~]# modinfo mrsas
modinfo: ERROR: Module mrsas not found.

root@truenas[~]# modinfo mps
modinfo: ERROR: Module mps not found.

 root@truenas[~]# modinfo mpr
modinfo: ERROR: Module mpr not found.

HoneyBadger said:
Question 1:
Can you describe the client systems you're going to connect to this, and what they're going to run?
For example: "I'm going to run two VMware ESXi hosts connected over 2x1Gbps iSCSI, and a Linux desktop connected to NFS for general file sharing."

The aim for me is to run a mix of containers and VMs that would access through a SFP+ on server side. In the long term, the server could be accessed by around 80 to 100 containers/VMs.

HoneyBadger said:
Question 2:
Do you have information on the vendor and model of your SSDs (both the 1TB NVMe and the 960GB SAS/SATA ones) as this will help determine if they are suitable candidates for SLOG.

The 960GB SSD drives is a SAMSUNG_MZ7KM960HMJP-00005. The 1TB NVMe drive is a KINGSTON SNVS1000G connected using a PCI adapter.

Two questions pops up after reading your answer.

-> Do you think it is beneficial to use SSDs that big for caching or to use them for data? When used for data, is it a good practice to have SSD mirrored disks and HDD mirror disks of different size in the same pool? I read that a pool adapts to the slowest vdev but my first understanding was that the vdevs in a pool were independant.

-> Using hot spare is a good practice in a pool (specially for mirroring) or having RAIDZ is enough?

Random access workloads (many users, small files, high level of responsiveness required) highly favor the use of mirrors. Single-user sequential workloads (a backup target, one user copying large files, sometimes several users playing video) can get better copy throughput with RAIDZ. With 14 drives, you might look at a solution of 8 drives in 2-way mirrors (4 vdevs of 2) and 6 drives in a single RAIDZ2 vdev - each pool will have roughly 12TB of usable space, but the first pool would have roughly 4x the "random" performance if the second (in a theoretical sense)

I think I will use two pools and dedicated each one to a different purpose. One could benefit from a SLOG as it will handle iSCSI and the other for sequential workloads. Does it make sense?

Thank you for your precious help!

jgreco · Sep 9, 2022

HoneyBadger said:
Your motherboard has a built-in SAS3108 "RAID" controller - it can be run in JBOD mode using the mrsas driver with apparent success,

I don't think this has been established to any significant degree of certainty. It's expected to be (merely) okay during normal operations, but it isn't normal operations that's a problem. It's stuff like bus timeouts and device hotswapping and SMART reporting and all the other finer points that are problematic. It is designed as a RAID card driver, and this can be problematic for at least some of the reasons outlined in

What's all the noise about HBAs, and why can't I use a RAID controller?

1) An HBA is a Host Bus Adapter. This is a controller that allows SAS and SATA devices to be attached to, and communicate directly with, a server. RAID controllers typically aggregate several disks into a Virtual Disk abstraction of some sort...

www.truenas.com

HoneyBadger · Sep 9, 2022

brams said:
So everything seems to work fine although I couldn’t find the drivers you were talking about in my system.

It would help if I had read the "SCALE" part - try probing for mpt2sas or mpt3sas to see if either of them are bound to PCI devices.

brams said:
The aim for me is to run a mix of containers and VMs that would access through a SFP+ on server side. In the long term, the server could be accessed by around 80 to 100 containers/VMs.

Definitely separate the "random" and "sequential" workloads here as you mentioned at the end of your post. NFS does generate synchronous writes, but if it's larger blocks then won't be as impactful as it would for smaller random records.

brams said:
The 960GB SSD drives is a SAMSUNG_MZ7KM960HMJP-00005. The 1TB NVMe drive is a KINGSTON SNVS1000G connected using a PCI adapter.

The Samsungs are coming back as SM863a which are suitable for SLOG use due to their power-loss-protection and high endurance (3.6 PBW I believe) and they seem like they'll perform around the level of the Intel DC S3520 units. The Kingston is an NV1 which I found an article about a "parts lottery" [1] in that they've been switching between Phison and SMI controllers, and TLC vs QLC NAND - so it might be suitable for L2ARC if it's TLC NAND, but the endurance is still low at 240TBW.

brams said:
-> Do you think it is beneficial to use SSDs that big for caching or to use them for data? When used for data, is it a good practice to have SSD mirrored disks and HDD mirror disks of different size in the same pool? I read that a pool adapts to the slowest vdev but my first understanding was that the vdevs in a pool were independant.

Don't mix SSDs and HDDs as data vdevs in the same pool - it will only result in your SSDs "slowing down to HDD speed" as you read about. They will be able to deliver faster reads and writes when addressed individually, but they'll still have the metaphorical handbrake on. They're definitely large enough to be used as an individual pool, if you can fit a workload onto them, but then you may need a different SLOG for your iSCSI pool.

brams said:
-> Using hot spare is a good practice in a pool (specially for mirroring) or having RAIDZ is enough?

With many RAIDZ setups, it's often better to just bump up the parity level (eg: RAIDZ2 instead of RAIDZ1+hotspare) - but for mirrors, hot spares start to make sense when you have a delayed ability to physically replace a drive (eg: a hosted solution, greater than four hours to be onsite to physically swap devices) and you have a large number of vdevs. Assuming you have a 16-bay chassis, I would buy another 3TB drive, perform a burn-in test on it, and then have it as a "cold spare" ready to be swapped in.

brams said:
I think I will use two pools and dedicated each one to a different purpose. One could benefit from a SLOG as it will handle iSCSI and the other for sequential workloads. Does it make sense?

Yes, this makes sense here. If you have a workload that's small enough, using the 960GB Samsungs as a third "fast pool" of SSDs could be another tier of performance above the mirrors as well. You would then need to look at different devices for SLOG though. Your board does have a lot of PCIe slots and appears to support bifurcation (check your BIOS under Chipset -> Northbridge -> IIO1 Configuration -> IOU2 or see page 4-10 in the manual on footnote [2]) so you could use an inexpensive PCIe adapter to put multiple NVMe M.2 cards into a single slot and use a faster device for SLOG.

[1] https://www.techpowerup.com/290339/...ardware-spec-lottery-tlc-or-qlc-smi-or-phison

[2] https://www.supermicro.com/manuals/motherboard/C600/MNL-1560.pdf

brams · Sep 12, 2022

Thank you for your helpful guidance. Things are way much clearer now.

HoneyBadger said:
It would help if I had read the "SCALE" part - try probing for mpt2sas or mpt3sas to see if either of them are bound to PCI devices.

Indeed, probing for mpt3sas leads to find the module as shown below.

# modinfo mpt3sas
filename: /lib/modules/5.10.120+truenas/kernel/drivers/scsi/mpt3sas/mpt3sas.ko
alias: mpt2sas
version: 35.100.00.00
license: GPL
description: LSI MPT Fusion SAS 3.0 Device Driver
author: Avago Technologies <MPT-FusionLinux.pdl@avagotech.com>
srcversion: <...>
alias: pci:v00001000d000000E7sv*sd*bc*sc*i*
alias: pci:v00001000d000000E4sv*sd*bc*sc*i*
alias: pci:v00001000d000000E6sv*sd*bc*sc*i*
alias: pci:v00001000d000000E5sv*sd*bc*sc*i*
alias: pci:v00001000d000000B2sv*sd*bc*sc*i*
alias: pci:v00001000d000000E3sv*sd*bc*sc*i*
alias: pci:v00001000d000000E0sv*sd*bc*sc*i*
alias: pci:v00001000d000000E2sv*sd*bc*sc*i*
alias: pci:v00001000d000000E1sv*sd*bc*sc*i*
alias: pci:v00001000d000000D1sv*sd*bc*sc*i*
alias: pci:v00001000d000000ACsv*sd*bc*sc*i*
alias: pci:v00001000d000000ABsv*sd*bc*sc*i*
alias: pci:v00001000d000000AAsv*sd*bc*sc*i*
alias: pci:v00001000d000000AFsv*sd*bc*sc*i*
alias: pci:v00001000d000000AEsv*sd*bc*sc*i*
alias: pci:v00001000d000000ADsv*sd*bc*sc*i*
alias: pci:v00001000d000000C3sv*sd*bc*sc*i*
alias: pci:v00001000d000000C2sv*sd*bc*sc*i*
alias: pci:v00001000d000000C1sv*sd*bc*sc*i*
alias: pci:v00001000d000000C0sv*sd*bc*sc*i*
alias: pci:v00001000d000000C8sv*sd*bc*sc*i*
alias: pci:v00001000d000000C7sv*sd*bc*sc*i*
alias: pci:v00001000d000000C6sv*sd*bc*sc*i*
alias: pci:v00001000d000000C5sv*sd*bc*sc*i*
alias: pci:v00001000d000000C4sv*sd*bc*sc*i*
alias: pci:v00001000d000000C9sv*sd*bc*sc*i*
alias: pci:v00001000d00000095sv*sd*bc*sc*i*
alias: pci:v00001000d00000094sv*sd*bc*sc*i*
alias: pci:v00001000d00000091sv*sd*bc*sc*i*
alias: pci:v00001000d00000090sv*sd*bc*sc*i*
alias: pci:v00001000d00000097sv*sd*bc*sc*i*
alias: pci:v00001000d00000096sv*sd*bc*sc*i*
alias: pci:v00001000d0000007Esv*sd*bc*sc*i*
alias: pci:v00001000d000002B1sv*sd*bc*sc*i*
alias: pci:v00001000d000002B0sv*sd*bc*sc*i*
alias: pci:v00001000d0000006Esv*sd*bc*sc*i*
alias: pci:v00001000d00000087sv*sd*bc*sc*i*
alias: pci:v00001000d00000086sv*sd*bc*sc*i*
alias: pci:v00001000d00000085sv*sd*bc*sc*i*
alias: pci:v00001000d00000084sv*sd*bc*sc*i*
alias: pci:v00001000d00000083sv*sd*bc*sc*i*
alias: pci:v00001000d00000082sv*sd*bc*sc*i*
alias: pci:v00001000d00000081sv*sd*bc*sc*i*
alias: pci:v00001000d00000080sv*sd*bc*sc*i*
alias: pci:v00001000d00000065sv*sd*bc*sc*i*
alias: pci:v00001000d00000064sv*sd*bc*sc*i*
alias: pci:v00001000d00000077sv*sd*bc*sc*i*
alias: pci:v00001000d00000076sv*sd*bc*sc*i*
alias: pci:v00001000d00000074sv*sd*bc*sc*i*
alias: pci:v00001000d00000072sv*sd*bc*sc*i*
alias: pci:v00001000d00000070sv*sd*bc*sc*i*
depends: scsi_mod,scsi_transport_sas,raid_class
retpoline: Y
intree: Y
name: mpt3sas
vermagic: 5.10.120+truenas SMP mod_unload modversions
parm: logging_level: bits for enabling additional logging info (default=0)
parm: max_sectors:max sectors, range 64 to 32767 default=32767 (ushort)
parm: missing_delay: device missing delay , io missing delay (array of int)
parm: max_lun: max lun, default=16895 (ullong)
parm: hbas_to_enumerate: 0 - enumerates both SAS 2.0 & SAS 3.0 generation HBAs
1 - enumerates only SAS 2.0 generation HBAs
2 - enumerates only SAS 3.0 generation HBAs (default=0) (ushort)
parm: diag_buffer_enable: post diag buffers (TRACE=1/SNAPSHOT=2/EXTENDED=4/default=0) (int)
parm: disable_discovery: disable discovery (int)
parm: prot_mask: host protection capabilities mask, def=7 (int)
parm: enable_sdev_max_qd:Enable sdev max qd as can_queue, def=disabled(0) (bool)
parm: max_queue_depth: max controller queue depth (int)
parm: max_sgl_entries: max sg entries (int)
parm: msix_disable: disable msix routed interrupts (default=0) (int)
parm: smp_affinity_enable:SMP affinity feature enable/disable Default: enable(1) (int)
parm: max_msix_vectors: max msix vectors (int)
parm: irqpoll_weight:irq poll weight (default= one fourth of HBA queue depth) (int)
parm: mpt3sas_fwfault_debug: enable detection of firmware fault and halt firmware - (default=0)
parm: perf_mode:Performance mode (only for Aero/Sea Generation), options:
0 - balanced: high iops mode is enabled &
interrupt coalescing is enabled only on high iops queues,
1 - iops: high iops mode is disabled &
interrupt coalescing is enabled on all queues,
2 - latency: high iops mode is disabled &
interrupt coalescing is enabled on all queues with timeout value 0xA,
default - default perf_mode is 'balanced' (int)[/code]]

HoneyBadger said:
The Samsungs are coming back as SM863a which are suitable for SLOG use due to their power-loss-protection and high endurance (3.6 PBW I believe) and they seem like they'll perform around the level of the Intel DC S3520 units. The Kingston is an NV1 which I found an article about a "parts lottery" [1] in that they've been switching between Phison and SMI controllers, and TLC vs QLC NAND - so it might be suitable for L2ARC if it's TLC NAND, but the endurance is still low at 240TBW.

I just checked the NV1 and I lost at the Kingston's lottery. The SSD I have has a SM2263XT controller. So from my understanding there it is a QLC NAND. I guess I will not use it for L2ARC purposes and the low TBW you mentionned. Thanks for the highlight!

I will work on my server now. Have a great day!

Important Announcement for the TrueNAS Community.

ZIL/L2ARC necessity + RAID choice

brams

Cadet

HoneyBadger

actually does care

Resource - Why iSCSI often requires more resources for the same result

Resource - Sync writes, or: Why is my ESXi NFS so slow, and why is iSCSI faster?

brams

Cadet

Attachments

jgreco

Resident Grinch

What's all the noise about HBAs, and why can't I use a RAID controller?

HoneyBadger

actually does care

brams

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

ZIL/L2ARC necessity + RAID choice

Cadet

actually does care

Cadet

Attachments

Resident Grinch

actually does care

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ZIL/L2ARC necessity + RAID choice"

Similar threads