ZIL/L2ARC necessity + RAID choice

brams

Cadet
Joined
Sep 8, 2022
Messages
3
Hi everyone,

I am fresh to TrueNAS. I have a question about the necessity to have ZIL/SLOG and L2ARC cache in my pool and how to setup my vdevs properly.
I have read the TrueNAS Scale guide and the Cyberjock noob guide. With all this reading, my mind is getting blurry…

Use case:
I got a server that I want to convert to an iSCSI and NFS shares server.

Software:
TrueNAS-SCALE-22.02.2.1

Hardware:
Motherboard: X10DRC-LN4+ SuperMicro
RAM quantity: 4 x 16GB RDIMM
Hard drives: 14 x 3TB HDD drives, 2 x 960GB SSD drives, 1x 1TB NVMe drive

My initial (noob) plan is the following layout:
  • 1TB NVMe drive -> L2ARC
  • 2 x 960 GB SSD drives -> ZFS LOG (Mirror)
  • 13 HDD drives -> Data (RAIDZ-3)
  • 1 HDD -> Spare
According to Cyberjock guide, I see limitations on the previous layout. My cache drives are oversized and the layout I have is not so good as 13 drives in a vdev is maybe too much.

Question:
About cache: Is it a rule of thumb to have as much SSD as possible and no vdev cache in a pool or having cache vdev is a must?

Also, I cannot make my mind about the data vdev layout. As I have 14 x 3 TB drives, the possibilities are multiple. Is it better to have a single vdev or multiple vdevs? On a single pool or in two pools?
Should I dedicate one vdev for ZFS shares and the other for iSCSI share or is it a bad idea?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Hello. Apologies in advance for the questions to an already blurry mind. :)

Your motherboard has a built-in SAS3108 "RAID" controller - it can be run in JBOD mode using the mrsas driver with apparent success, but it has significantly less hours than the mps or mpr drivers used by the "true HBAs". Make sure that you have your controller in JBOD mode at the least, or consider obtaining a "true HBA" on the SAS2308 or SAS3008 chipset.

I got a server that I want to convert to an iSCSI and NFS shares server.

Question 1:
Can you describe the client systems you're going to connect to this, and what they're going to run?
For example: "I'm going to run two VMware ESXi hosts connected over 2x1Gbps iSCSI, and a Linux desktop connected to NFS for general file sharing."

iSCSI requires a good deal of resources, 64GB of RAM is a good start certainly, but your board has 24 DIMM slots. Often the best method for improved performance is "more RAM" - but the need for this will depend on your workload from Question 1 above, and "how fast you want to go." (See footnote [1])

Question 2:
Do you have information on the vendor and model of your SSDs (both the 1TB NVMe and the 960GB SAS/SATA ones) as this will help determine if they are suitable candidates for SLOG.

About cache: Is it a rule of thumb to have as much SSD as possible and no vdev cache in a pool or having cache vdev is a must?

L2ARC "cache" is not necessary for every workload, but when the workload benefits it can help greatly. A similar rule applies to SLOG, but with the note that it can only accelerate the speed of synchronous writes. You may or may not have these, but the workload suggestion of iSCSI and NFS makes me think that you will. (See footnote [2])

Also, I cannot make my mind about the data vdev layout. As I have 14 x 3 TB drives, the possibilities are multiple. Is it better to have a single vdev or multiple vdevs? On a single pool or in two pools?
Should I dedicate one vdev for ZFS shares and the other for iSCSI share or is it a bad idea?

Slight terminology correction - a "vdev" is the collection of disks that provides redundancy. Several "vdevs" are then collected together into a "pool" - so what you may mean here is "should you create multiple pools with different vdev configurations" - the answer is "possibly, depending on the workload."

Random access workloads (many users, small files, high level of responsiveness required) highly favor the use of mirrors. Single-user sequential workloads (a backup target, one user copying large files, sometimes several users playing video) can get better copy throughput with RAIDZ. With 14 drives, you might look at a solution of 8 drives in 2-way mirrors (4 vdevs of 2) and 6 drives in a single RAIDZ2 vdev - each pool will have roughly 12TB of usable space, but the first pool would have roughly 4x the "random" performance if the second (in a theoretical sense)

Hope this helps, feel free to ask more.

Resources as footnotes:

[1]:

[2]:
 

brams

Cadet
Joined
Sep 8, 2022
Messages
3
Your motherboard has a built-in SAS3108 "RAID" controller - it can be run in JBOD mode using the mrsas driver with apparent success, but it has significantly less hours than the mps or mpr drivers used by the "true HBAs". Make sure that you have your controller in JBOD mode at the least, or consider obtaining a "true HBA" on the SAS2308 or SAS3008 chipset
Thank you very much for your clear and concise explanations!

A quick thing I forgot to mention, TrueNAS Scale is installed and boots on a 16GB SATA DOM (SSD-DM016-SMCMVN1).
Before installing it, I made sure the drives were in JBOD mode (see picture enclosed). TrueNAS guide mentioned that using JBOD could for instance mask disk serial number and S.M.A.R.T. health information. The first observations on my system show SN are visible and S.M.A.R.T. tests can be performed. So everything seems to work fine although I couldn’t find the drivers you were talking about in my system.

root@truenas[~]# modinfo mrsas modinfo: ERROR: Module mrsas not found.

root@truenas[~]# modinfo mps modinfo: ERROR: Module mps not found.

root@truenas[~]# modinfo mpr modinfo: ERROR: Module mpr not found.

Question 1:
Can you describe the client systems you're going to connect to this, and what they're going to run?
For example: "I'm going to run two VMware ESXi hosts connected over 2x1Gbps iSCSI, and a Linux desktop connected to NFS for general file sharing."
The aim for me is to run a mix of containers and VMs that would access through a SFP+ on server side. In the long term, the server could be accessed by around 80 to 100 containers/VMs.

Question 2:
Do you have information on the vendor and model of your SSDs (both the 1TB NVMe and the 960GB SAS/SATA ones) as this will help determine if they are suitable candidates for SLOG.
The 960GB SSD drives is a SAMSUNG_MZ7KM960HMJP-00005. The 1TB NVMe drive is a KINGSTON SNVS1000G connected using a PCI adapter.

Two questions pops up after reading your answer.

-> Do you think it is beneficial to use SSDs that big for caching or to use them for data? When used for data, is it a good practice to have SSD mirrored disks and HDD mirror disks of different size in the same pool? I read that a pool adapts to the slowest vdev but my first understanding was that the vdevs in a pool were independant.

-> Using hot spare is a good practice in a pool (specially for mirroring) or having RAIDZ is enough?

Random access workloads (many users, small files, high level of responsiveness required) highly favor the use of mirrors. Single-user sequential workloads (a backup target, one user copying large files, sometimes several users playing video) can get better copy throughput with RAIDZ. With 14 drives, you might look at a solution of 8 drives in 2-way mirrors (4 vdevs of 2) and 6 drives in a single RAIDZ2 vdev - each pool will have roughly 12TB of usable space, but the first pool would have roughly 4x the "random" performance if the second (in a theoretical sense)
I think I will use two pools and dedicated each one to a different purpose. One could benefit from a SLOG as it will handle iSCSI and the other for sequential workloads. Does it make sense?

Thank you for your precious help!
 

Attachments

  • jbod.PNG
    jbod.PNG
    256.3 KB · Views: 73

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Your motherboard has a built-in SAS3108 "RAID" controller - it can be run in JBOD mode using the mrsas driver with apparent success,

I don't think this has been established to any significant degree of certainty. It's expected to be (merely) okay during normal operations, but it isn't normal operations that's a problem. It's stuff like bus timeouts and device hotswapping and SMART reporting and all the other finer points that are problematic. It is designed as a RAID card driver, and this can be problematic for at least some of the reasons outlined in

 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
So everything seems to work fine although I couldn’t find the drivers you were talking about in my system.

It would help if I had read the "SCALE" part - try probing for mpt2sas or mpt3sas to see if either of them are bound to PCI devices.

The aim for me is to run a mix of containers and VMs that would access through a SFP+ on server side. In the long term, the server could be accessed by around 80 to 100 containers/VMs.

Definitely separate the "random" and "sequential" workloads here as you mentioned at the end of your post. NFS does generate synchronous writes, but if it's larger blocks then won't be as impactful as it would for smaller random records.

The 960GB SSD drives is a SAMSUNG_MZ7KM960HMJP-00005. The 1TB NVMe drive is a KINGSTON SNVS1000G connected using a PCI adapter.

The Samsungs are coming back as SM863a which are suitable for SLOG use due to their power-loss-protection and high endurance (3.6 PBW I believe) and they seem like they'll perform around the level of the Intel DC S3520 units. The Kingston is an NV1 which I found an article about a "parts lottery" [1] in that they've been switching between Phison and SMI controllers, and TLC vs QLC NAND - so it might be suitable for L2ARC if it's TLC NAND, but the endurance is still low at 240TBW.

-> Do you think it is beneficial to use SSDs that big for caching or to use them for data? When used for data, is it a good practice to have SSD mirrored disks and HDD mirror disks of different size in the same pool? I read that a pool adapts to the slowest vdev but my first understanding was that the vdevs in a pool were independant.

Don't mix SSDs and HDDs as data vdevs in the same pool - it will only result in your SSDs "slowing down to HDD speed" as you read about. They will be able to deliver faster reads and writes when addressed individually, but they'll still have the metaphorical handbrake on. They're definitely large enough to be used as an individual pool, if you can fit a workload onto them, but then you may need a different SLOG for your iSCSI pool.

-> Using hot spare is a good practice in a pool (specially for mirroring) or having RAIDZ is enough?
With many RAIDZ setups, it's often better to just bump up the parity level (eg: RAIDZ2 instead of RAIDZ1+hotspare) - but for mirrors, hot spares start to make sense when you have a delayed ability to physically replace a drive (eg: a hosted solution, greater than four hours to be onsite to physically swap devices) and you have a large number of vdevs. Assuming you have a 16-bay chassis, I would buy another 3TB drive, perform a burn-in test on it, and then have it as a "cold spare" ready to be swapped in.

I think I will use two pools and dedicated each one to a different purpose. One could benefit from a SLOG as it will handle iSCSI and the other for sequential workloads. Does it make sense?

Yes, this makes sense here. If you have a workload that's small enough, using the 960GB Samsungs as a third "fast pool" of SSDs could be another tier of performance above the mirrors as well. You would then need to look at different devices for SLOG though. Your board does have a lot of PCIe slots and appears to support bifurcation (check your BIOS under Chipset -> Northbridge -> IIO1 Configuration -> IOU2 or see page 4-10 in the manual on footnote [2]) so you could use an inexpensive PCIe adapter to put multiple NVMe M.2 cards into a single slot and use a faster device for SLOG.

[1] https://www.techpowerup.com/290339/...ardware-spec-lottery-tlc-or-qlc-smi-or-phison

[2] https://www.supermicro.com/manuals/motherboard/C600/MNL-1560.pdf
 

brams

Cadet
Joined
Sep 8, 2022
Messages
3
Thank you for your helpful guidance. Things are way much clearer now.
It would help if I had read the "SCALE" part - try probing for mpt2sas or mpt3sas to see if either of them are bound to PCI devices.
Indeed, probing for mpt3sas leads to find the module as shown below.

# modinfo mpt3sas
filename: /lib/modules/5.10.120+truenas/kernel/drivers/scsi/mpt3sas/mpt3sas.ko
alias: mpt2sas
version: 35.100.00.00
license: GPL
description: LSI MPT Fusion SAS 3.0 Device Driver
author: Avago Technologies <MPT-FusionLinux.pdl@avagotech.com>
srcversion: <...>
alias: pci:v00001000d000000E7sv*sd*bc*sc*i*
alias: pci:v00001000d000000E4sv*sd*bc*sc*i*
alias: pci:v00001000d000000E6sv*sd*bc*sc*i*
alias: pci:v00001000d000000E5sv*sd*bc*sc*i*
alias: pci:v00001000d000000B2sv*sd*bc*sc*i*
alias: pci:v00001000d000000E3sv*sd*bc*sc*i*
alias: pci:v00001000d000000E0sv*sd*bc*sc*i*
alias: pci:v00001000d000000E2sv*sd*bc*sc*i*
alias: pci:v00001000d000000E1sv*sd*bc*sc*i*
alias: pci:v00001000d000000D1sv*sd*bc*sc*i*
alias: pci:v00001000d000000ACsv*sd*bc*sc*i*
alias: pci:v00001000d000000ABsv*sd*bc*sc*i*
alias: pci:v00001000d000000AAsv*sd*bc*sc*i*
alias: pci:v00001000d000000AFsv*sd*bc*sc*i*
alias: pci:v00001000d000000AEsv*sd*bc*sc*i*
alias: pci:v00001000d000000ADsv*sd*bc*sc*i*
alias: pci:v00001000d000000C3sv*sd*bc*sc*i*
alias: pci:v00001000d000000C2sv*sd*bc*sc*i*
alias: pci:v00001000d000000C1sv*sd*bc*sc*i*
alias: pci:v00001000d000000C0sv*sd*bc*sc*i*
alias: pci:v00001000d000000C8sv*sd*bc*sc*i*
alias: pci:v00001000d000000C7sv*sd*bc*sc*i*
alias: pci:v00001000d000000C6sv*sd*bc*sc*i*
alias: pci:v00001000d000000C5sv*sd*bc*sc*i*
alias: pci:v00001000d000000C4sv*sd*bc*sc*i*
alias: pci:v00001000d000000C9sv*sd*bc*sc*i*
alias: pci:v00001000d00000095sv*sd*bc*sc*i*
alias: pci:v00001000d00000094sv*sd*bc*sc*i*
alias: pci:v00001000d00000091sv*sd*bc*sc*i*
alias: pci:v00001000d00000090sv*sd*bc*sc*i*
alias: pci:v00001000d00000097sv*sd*bc*sc*i*
alias: pci:v00001000d00000096sv*sd*bc*sc*i*
alias: pci:v00001000d0000007Esv*sd*bc*sc*i*
alias: pci:v00001000d000002B1sv*sd*bc*sc*i*
alias: pci:v00001000d000002B0sv*sd*bc*sc*i*
alias: pci:v00001000d0000006Esv*sd*bc*sc*i*
alias: pci:v00001000d00000087sv*sd*bc*sc*i*
alias: pci:v00001000d00000086sv*sd*bc*sc*i*
alias: pci:v00001000d00000085sv*sd*bc*sc*i*
alias: pci:v00001000d00000084sv*sd*bc*sc*i*
alias: pci:v00001000d00000083sv*sd*bc*sc*i*
alias: pci:v00001000d00000082sv*sd*bc*sc*i*
alias: pci:v00001000d00000081sv*sd*bc*sc*i*
alias: pci:v00001000d00000080sv*sd*bc*sc*i*
alias: pci:v00001000d00000065sv*sd*bc*sc*i*
alias: pci:v00001000d00000064sv*sd*bc*sc*i*
alias: pci:v00001000d00000077sv*sd*bc*sc*i*
alias: pci:v00001000d00000076sv*sd*bc*sc*i*
alias: pci:v00001000d00000074sv*sd*bc*sc*i*
alias: pci:v00001000d00000072sv*sd*bc*sc*i*
alias: pci:v00001000d00000070sv*sd*bc*sc*i*
depends: scsi_mod,scsi_transport_sas,raid_class
retpoline: Y
intree: Y
name: mpt3sas
vermagic: 5.10.120+truenas SMP mod_unload modversions
parm: logging_level: bits for enabling additional logging info (default=0)
parm: max_sectors:max sectors, range 64 to 32767 default=32767 (ushort)
parm: missing_delay: device missing delay , io missing delay (array of int)
parm: max_lun: max lun, default=16895 (ullong)
parm: hbas_to_enumerate: 0 - enumerates both SAS 2.0 & SAS 3.0 generation HBAs
1 - enumerates only SAS 2.0 generation HBAs
2 - enumerates only SAS 3.0 generation HBAs (default=0) (ushort)
parm: diag_buffer_enable: post diag buffers (TRACE=1/SNAPSHOT=2/EXTENDED=4/default=0) (int)
parm: disable_discovery: disable discovery (int)
parm: prot_mask: host protection capabilities mask, def=7 (int)
parm: enable_sdev_max_qd:Enable sdev max qd as can_queue, def=disabled(0) (bool)
parm: max_queue_depth: max controller queue depth (int)
parm: max_sgl_entries: max sg entries (int)
parm: msix_disable: disable msix routed interrupts (default=0) (int)
parm: smp_affinity_enable:SMP affinity feature enable/disable Default: enable(1) (int)
parm: max_msix_vectors: max msix vectors (int)
parm: irqpoll_weight:irq poll weight (default= one fourth of HBA queue depth) (int)
parm: mpt3sas_fwfault_debug: enable detection of firmware fault and halt firmware - (default=0)
parm: perf_mode:Performance mode (only for Aero/Sea Generation), options:
0 - balanced: high iops mode is enabled &
interrupt coalescing is enabled only on high iops queues,
1 - iops: high iops mode is disabled &
interrupt coalescing is enabled on all queues,
2 - latency: high iops mode is disabled &
interrupt coalescing is enabled on all queues with timeout value 0xA,
default - default perf_mode is 'balanced' (int)[/code]]

The Samsungs are coming back as SM863a which are suitable for SLOG use due to their power-loss-protection and high endurance (3.6 PBW I believe) and they seem like they'll perform around the level of the Intel DC S3520 units. The Kingston is an NV1 which I found an article about a "parts lottery" [1] in that they've been switching between Phison and SMI controllers, and TLC vs QLC NAND - so it might be suitable for L2ARC if it's TLC NAND, but the endurance is still low at 240TBW.
I just checked the NV1 and I lost at the Kingston's lottery. The SSD I have has a SM2263XT controller. So from my understanding there it is a QLC NAND. I guess I will not use it for L2ARC purposes and the low TBW you mentionned. Thanks for the highlight!

I will work on my server now. Have a great day!
 
Last edited by a moderator:
Top