NVMe drives upgrades to add mirrored SLOG and a L2ARC

scott2500uk · Oct 30, 2018

Hi All,

We are looking for some advice on some upgrades to our existing company file server. Here is our current spec:

Our 12x4TB WD RED PRO drives are set up as one pool of two, six drive raidz2 volumes:

Giving us a usable space of 28TiB and using currently about 56%.

The server is connected to our network switches via 2x10Gb LACP Lagg. There, we have 20-30 users connected over 1Gb ethernet working on large design files over AFP file sharing. We also have a few other servers that are connecting over 10Gb connections to the file server and accessing and writing files using SMB and NFS.

This setup has worked well for us for the past 3 years but there are a few issues we want to address this Christmas if Santa will provide the funds. But before we can request the funds we need to spec what hardware we need.

So the issues we are having is that our sync writes are atrocious when talking over NFS. So we want to add a dedicated SLOG. Having a peruse over the forum, seeing what else others are suggesting, we are thinking of getting a couple of the Intel Optane 900p NVMe PCIe drives. Yes, we are aware they don't have enhanced power loss protection but our server is UPS backed up so we can live with that. We plan on mirroring them for paranoia reasons and they are fairly cheap drives anyway (£200ea). Yes, we know that 240GB is overkill for SLOG in our case but failing to see what else might be suitable. Has anyone got any better recommendations for our SLOG? We could go the route of using something like an m.2 adapter card Supermicro AOC-SLG3-2M2 and using two smaller m.2 SSDs, any thoughts?

The other issue we are having is when a lot of people are working off of the server the latency of directory browsing seems to tank. Throughput stays high but it feels like the drives are getting busy reading large chunks of data that small reads are suffering. While throughput is high the perceived speed of the filesystem is low due to the latency of directory browsing. We feel that a high speed, low latency L2ARC cache would improve things here. Correct us if we are wrong and we should look at tunning elsewhere. We are not sure what device, speed, size or technology should be used as an L2ARC here so really looking for some recommendations here.

We are currently on FreeNAS 9.3 but in the process of upgrading to the latest version once we have got the last of the mission-critical VMs/Jails off the system so we can be comfortable taking the fileserver offline outside of company hours to do the upgrades. We are assuming there the latest version of FreeNAS has no issues with NVMe drives since we are running on bare metal. Also, we a confidant that our hardware supports NVMe as the Supermicro AOC-SLG3-2M2 card is listed as compatible with our motherboard also from Supermicro, however, we have doubt that these newer drives might use a newer NMVe version that is not compatible.

Thanks for reading and we look forward to your suggestions.

Jessep · Oct 30, 2018

Please provide hardware specs or it will be difficult to make suggestions.

General Notes:

Optane "theoretically" doesn't need PLP as it writes direct to die, no cache.
Do you have "sync=always" on? If not a slog may not help much or at all.
L2ARC should only be added when you have maxed ARC, if you CAN add RAM do that first.
L2ARC will use RAM, so it's not likely to help (i.e. make it worse) unless you have over 64GB and for your use case 128-256GB RAM/ARC would make sense.

Ender117 · Oct 30, 2018

Depend on what you use NFS for you can just set sync=disable and get better performance than any SLOG. That being said, 900p is makes an excellent SLOG. But if you ever want a even higher write endurance, consider P3700 and P4800X.

Regarding the L2ARC, it is complicated. What's your current ARC hit rate like? There are some tweaking you can do on like prefetch etc to make directory load faster. Also consider add more vdev, 20-30 users sounds a bit lot for 2 vdevs.

scott2500uk · Oct 30, 2018

@Jessep We did provide specs, see first line/image of our post.

With Optane we are aware of that and also because of a UPS we are not worried about power loss corruption. We are asking if there is a better solution for us other than the 900p drives.

Sync is set to default on the datasets. Our sync writes are slow our async writes are fine. From my understanding setting sync=always forces, all writes to be sync. How does setting sync=always make a SLOG device worse off?

Currently, we have 128GB of RAM in the system. We cannot add more without also adding another CPU so this is out of the question.

scott2500uk · Oct 30, 2018

@Ender117 NFS is mounted to ESXi which forces all writes to be sync, so slow there. We know we can disable sync but would rather keep it on for data protection. We also mount NFS to some ubuntu VMs with some sync and async workloads. The P3700 and P4800X drives are going to be too expensive for Santa and I couldn't put that much money down on a device where only 1% of its storage capacity is used. Looking at the write endurance of the 900p drives they have more than adequate write endurance for our workload. Just wondering if there is anything smaller/cost effective that could fit for a slog device for us.

Adding another vdev at this point won't fly. We would have to add another 6 HDDs an external chassis to fit them and an HBA card with an external port to connect up the new chassis. It's a good idea if it can help but the actual implementation is harder.

As for ARC data we are not 100% sure what we are looking for. If you have some resources we can look at that so we can brush up on how ARC and L2ARC caches work then happy to take a look.

Here are our ARC graphs if they help anyone understand our usage:

Ender117 · Oct 30, 2018

scott2500uk said:
@Ender117 NFS is mounted to ESXi which forces all writes to be sync, so slow there. We know we can disable sync but would rather keep it on for data protection. We also mount NFS to some ubuntu VMs with some sync and async workloads. The P3700 and P4800X drives are going to be too expensive for Santa and I couldn't put that much money down on a device where only 1% of its storage capacity is used. Looking at the write endurance of the 900p drives they have more than adequate write endurance for our workload. Just wondering if there is anything smaller/cost effective that could fit for a slog device for us.

Adding another vdev at this point won't fly. We would have to add another 6 HDDs an external chassis to fit them and an HBA card with an external port to connect up the new chassis. It's a good idea if it can help but the actual implementation is harder.

As for ARC data we are not 100% sure what we are looking for. If you have some resources we can look at that so we can brush up on how ARC and L2ARC caches work then happy to take a look.

Here are our ARC graphs if they help anyone understand our usage:

View attachment 26342

OK, based on your OP I thought this is used as an general filer. But yeah backing hypervisor is a prime use case of SLOG. If you can buy used, P3700 can be get at ~200-250 each. Slightly slower than 900p but much higher endurance. I understand that 900p is enough for you, just pointing out an option.

Your ARC hit rate looks high yet ARC size is not maxed out. By default most of the RAM (~120G in your case) can be used for ARC yet yours only uses ~80G. I am afraid that simply throwing in more hardware may not help here (read wise).

Please, provide more details on your hardware (especially chassis) and use pattern (both backing hypervisor hosts and as a filer) in IOPS. I am thinking you might be best off switching to mirrors instead of raidz, at least for VM part.

Jessep · Oct 30, 2018

scott2500uk said:
@Jessep We did provide specs, see first line/image of our post.

With Optane we are aware of that and also because of a UPS we are not worried about power loss corruption. We are asking if there is a better solution for us other than the 900p drives.

Sync is set to default on the datasets. Our sync writes are slow our async writes are fine. From my understanding setting sync=always forces, all writes to be sync. How does setting sync=always make a SLOG device worse off?

Currently, we have 128GB of RAM in the system. We cannot add more without also adding another CPU so this is out of the question.

Apologies I tend to not click on pictures.

Your specs list 4X32GB LR DIMMs so you could add an additional 4X32GB DIMMs without requiring a second CPU.

What are your actual use cases? (EDIT: you responded while I was writing this)

"sync=always" doesn't make SLOG worse, if you aren't using that setting or a protocol that does the same thing a SLOG isn't going to IMPROVE performance as async writes will always be faster.

To quote cyberjock:

Your pool/dataset sync setting overrides ALL writes if you choose disabled or always. It makes every single write a sync write or every single write an async write.

NFS "supports" the sync write flag. So any write that is specified as a sync write will be a sync write if you have it set to standard. So what writes are sync writes normally? That depends on what you are copying and what program you are using to do the copying. Some programs support and do sync writes, some don't. Your benchmark for the always/standard/disabled is only valid for the exact client software you used to copy the data and only in the method you used. Other than that, it goes out the window.

Some other protocols have no sync write flag(iSCSI for one). So standard and disabled should be the same for those protocols. sync=always will of course make your non-sync protocol behave kind of like a sync protocol, with all of the performance killing properties associated with it.

ESXi is an application that sync flags EVERY SINGLE WRITE. That completely kills zpool performance, but has the advantage of protecting your data at pretty much every performance cost.

You can make every protocol appear to have sync writes for everything by setting sync=always. That will also kill zpool performance in a similar fashion. That's why I'm saying sync=always with iSCSI is no different than NFS with sync=standard.

So yes your write performance will improve using a SLOG with your listed workload.

RaidZ2 is a poor choice for ESXi or VM workloads (low IOPS).

scott2500uk · Oct 30, 2018

Jessep said:
Your specs list 4X32GB LR DIMMs so you could add an additional 4X32GB DIMMs without requiring a second CPU.

We may be wrong, but our understanding is that the motherboard has 8 DIMM slots, 4 per CPU. For the other 4 DIMMs to work there needs to be a CPU present. Based on that we cannot add more RAM without either getting another CPU or swapping out the 32GB dimms for higher capacities.

Jessep said:
RaidZ2 is a poor choice for ESXi or VM workloads (low IOPS).

Yep, we know it is not ideal. However, we are not running VMs off this zpool. We are mounting NFS into VMS to read and write some working data from the pool. The data usage is low IOPS like the filesharing use. As for ESXi we are mounting NFS just to write backups to. Our ESXi has its own drives and uses NFS share just to backup periodically. No VM's are directly run over NFS. As ESXi mounts NFS with sync enabled our backups are dog slow due to the lack of SLOG device. Also in our VMs that are talking to the zpool over NFS any sync writes are slow for the same reason.

Ender117 said:
Your ARC hit rate looks high yet ARC size is not maxed out. By default most of the RAM (~120G in your case) can be used for ARC yet yours only uses ~80G. I am afraid that simply throwing in more hardware may not help here (read wise).

We are currently in a quieter period and slowness hasn't been much of an issue for the past few weeks but we have been seeing the ARC size cap out at 100G for weeks on end in the past. If an l2arc device won't help is there anything we can tune to help with the existing setup?

Ender117 said:
Please, provide more details on your hardware (especially chassis) and use pattern (both backing hypervisor hosts and as a filer) in IOPS. I am thinking you might be best off switching to mirrors instead of raidz, at least for VM part.

Our chassis is a Supermicro 4U SC846BE16 - R920W/R1280W. We are sure you are asking what about the other 12 bays? 6 of them are used by a set of high capacity drives as a warm archive. Another couple are used for spares HDDs and another two used by an SSD mirror for running some jails off of.

As for the VM stuff I have explained above in a bit more detail our usage

Jessep · Oct 30, 2018

Ahhh X10DRL-I not X10DRI my mistake.

Ender117 · Oct 30, 2018

scott2500uk said:
Yep, we know it is not ideal. However, we are not running VMs off this zpool. We are mounting NFS into VMS to read and write some working data from the pool. The data usage is low IOPS like the filesharing use. As for ESXi we are mounting NFS just to write backups to. Our ESXi has its own drives and uses NFS share just to backup periodically. No VM's are directly run over NFS. As ESXi mounts NFS with sync enabled our backups are dog slow due to the lack of SLOG device. Also in our VMs that are talking to the zpool over NFS any sync writes are slow for the same reason.

I felt hard to make recommendation because you always seem to miss an essential piece of info:). However the general consensus is if you cannot lost the last 10 seconds of data (in default setting) in events of power loss/crash/etc, get an SLOG. Else just set sync=disabled. Most people doesn't consider backup to be one of these cases, they just redo the backup from original. However it's your money/Santa after all so if you want an SLOG, 900p is a good choice.

scott2500uk said:
We are currently in a quieter period and slowness hasn't been much of an issue for the past few weeks but we have been seeing the ARC size cap out at 100G for weeks on end in the past. If an l2arc device won't help is there anything we can tune to help with the existing setup?

Try google ZFS prefetch tuning. It may speed up read by loading stuff into RAM ahead. OTOH it can also cause memory pressure and hurt performance. However, please consider switch to stripped mirrors instead of raidz, especially you are now only using 1% of the space. There is only so much caching can do

HoneyBadger · Oct 30, 2018

Looking for a little clarification before I make some suggestions; see bolded areas of the pull-quotes below.

scott2500uk said:
The server is connected to our network switches via 2x10Gb LACP Lagg. There, we have 20-30 users connected over 1Gb ethernet working on large design files over AFP file sharing. We also have a few other servers that are connecting over 10Gb connections to the file server and accessing and writing files using SMB and NFS.

Is the bolded "file server" here the FreeNAS machine, or do you have a VM on your ESXi server using the FreeNAS NFS export to back a VMDK?

scott2500uk said:
Yep, we know it is not ideal. However, we are not running VMs off this zpool. We are mounting NFS into VMS to read and write some working data from the pool. The data usage is low IOPS like the filesharing use.

...

Also in our VMs that are talking to the zpool over NFS any sync writes are slow for the same reason.

Are the guest VMs doing the NFS mounts, or is the hypervisor (ESXi) mounting the NFS export?

If your guest OS explicitly requires/requests sync writes, then sure, you need an SLOG - but just make sure you aren't unnecessarily demanding them for data that doesn't need it.

scott2500uk said:
As for ESXi we are mounting NFS just to write backups to. Our ESXi has its own drives and uses NFS share just to backup periodically. No VM's are directly run over NFS. As ESXi mounts NFS with sync enabled our backups are dog slow due to the lack of SLOG device.

What backup software is being used, or are you manually doing snapshots/clones? You can always set sync=disabled on the backup dataset in order not to compromise integrity where it's needed elsewhere.

Elliot Dierksen · Oct 30, 2018

Just a note on the Optane 900P. I have one in each of my FreeNAS systems, and I have been very happy with them. I have mine partitioned since FreeNAS won't use all of the 280GB for an SLOG. I split them up, and use part of them as an L2ARC. I haven't done any testing to see if them L2ARC improves my read performance, but it sounds cool. Using the Optane as an SLOG gave me a 4-8X boost in NFS write performance from ESXi over 10G NFS.

Ender117 · Oct 30, 2018

Elliot Dierksen said:
Using the Optane as an SLOG gave me a 4-8X boost in NFS write performance from ESXi over 10G NFS.

Can you specify the 4X-8X boost? If it's comparing to no SLOG, 8X seems to be a bit small

Elliot Dierksen · Oct 30, 2018

Ender117 said:
Can you specify the 4X-8X boost? If it's comparing to no SLOG, 8X seems to be a bit small

If would have to go back and look for the detailed post I did on this. I'll try to do that later. The short version is I was getting 500-700K write performance from my ESXi hosts to FreeNAS via NFS over a 10G network. I wasn't happy with that, and after some research and discussion, an SLOG seemed like the thing to boost NFS synch write performance. I settled on the Optane 900P because of the speed and the fact that I didn't want to burn any drive slots. Now I can get fairly sustained periods of 4G write speeds, and that meets my needs. There are others in forum with a much better understanding of FreeNAS internals that can explain why FreeNAS will only use a certain amount of SLOG space. I am sure part of that is because of my config (details in the sig). The 7.2K SATA drives and RAIDZ2 vdevs aren't the best layout for performance. I know mirrors would give me better performance, but I wasn't willing to give up the space. Also, my environment is mostly a lab so my needs aren't too demanding and I didn't feel like maintaining multiple pools. Once I got to where I was happy with the performance, I stopped tinkering. Even before the SLOG and L2ARC, I could consistently get 8G read performance. I am very happy with that.

Chris Moore · Oct 30, 2018

Interesting

Sent from my SAMSUNG-SGH-I537 using Tapatalk

Ender117 · Oct 30, 2018

Elliot Dierksen said:
If would have to go back and look for the detailed post I did on this. I'll try to do that later. The short version is I was getting 500-700K write performance from my ESXi hosts to FreeNAS via NFS over a 10G network. I wasn't happy with that, and after some research and discussion, an SLOG seemed like the thing to boost NFS synch write performance. I settled on the Optane 900P because of the speed and the fact that I didn't want to burn any drive slots. Now I can get fairly sustained periods of 4G write speeds, and that meets my needs. There are others in forum with a much better understanding of FreeNAS internals that can explain why FreeNAS will only use a certain amount of SLOG space. I am sure part of that is because of my config (details in the sig). The 7.2K SATA drives and RAIDZ2 vdevs aren't the best layout for performance. I know mirrors would give me better performance, but I wasn't willing to give up the space. Also, my environment is mostly a lab so my needs aren't too demanding and I didn't feel like maintaining multiple pools. Once I got to where I was happy with the performance, I stopped tinkering. Even before the SLOG and L2ARC, I could consistently get 8G read performance. I am very happy with that.

If you meant 700kBit/s vs 4Gbit/s that's 5000x increase, if 700kByte/s vs 4Gbit/s that's 600x. Either is larger than 4-8x?

Stux · Oct 30, 2018

https://forums.freenas.org/index.php?threads/testing-the-benefits-of-slog-using-a-ram-disk.56561/

900ps are a good choice. I’d go with a cheap Samsung evo drive for l2arc. Sata or nvme is your choice. Sata will have a max burst of 550MB/s.

Next upgrade would probably be to mirrors if you need more iop performance. Or add another 24 bay external chassis and start adding more 6 drive raidz2 vdevs.

At >56% utilization it’s at least time to start thinking about how you would increase capacity,

May be easiest to just swap into a 24/36 bay chassis.

Double the vdevs will double your iops

You current fantastic arc cache hits are probably due to a long since fixed bug which was causing repetive reads which resulted in misleading arc stats. Unfortunately your hit ratio will probably drop significantly when you upgrade FreeNAS and the bug is fixed.

scott2500uk · Oct 30, 2018

HoneyBadger said:
Looking for a little clarification before I make some suggestions; see bolded areas of the pull-quotes below.

Is the bolded "file server" here the FreeNAS machine, or do you have a VM on your ESXi server using the FreeNAS NFS export to back a VMDK?

Are the guest VMs doing the NFS mounts, or is the hypervisor (ESXi) mounting the NFS export?

If your guest OS explicitly requires/requests sync writes, then sure, you need an SLOG - but just make sure you aren't unnecessarily demanding them for data that doesn't need it.

What backup software is being used, or are you manually doing snapshots/clones? You can always set sync=disabled on the backup dataset in order not to compromise integrity where it's needed elsewhere.

Yes, when we refer to file server we mean FreeNas. All VM's on ESXi server are running their disks (VMDK) files on physical disks local to the server and not over any network or remote share.

Both ESXi and guest OS's are mounting FreeNAS NFS shares.
ESXi is mounting only for the reason of writing backups of the guest OS disks to FreeNAS on a periodic cron task running on ESXI and having access to a number of ISOs stored on FreeNAS. We are using ghettoVCB to do the backups.

A number of guest OS's mount FreeNAS NFS shares to give them access to the shared office file server. Others mount to NFS shares to gain access to larger disk capacities to store things like backups, logs and some transactional data. Some of those writes are sync writes and we feel should stay as such.

All of the NFS mounts do very little in terms of IOPS and the throughput of our zpool is more than adequate for our needs. It's just sync writes that suffer due to our raidz2 disk setup, but because of those disks being primarily used as office fileserver storage, striped raidz2's give us best bang for our buck/storage.

We are pretty sure that a SLOG device is the way to go but its just if the 900p is the way to go or if there is a better alternative. To mirror what Elliot Dierksen said, we think that the 900p is ideal because of the speed of the PCIe interface and that it doesn't burn any HDDs slots. We have plenty of PCIe slots to spare. Only things that sadden us that a lot of the storage will go to waste as we know a SLOG device doesn't need to be very spacious. I have seen reports that you cannot over/under provision these drives to help with write endurance but I would suspect that if you gave the full 240GB to SLOG that the drive would do decent wear-levelling anyway, essentially doing the same as over/under provisioning?

Stux said:
At >56% utilization it’s at least time to start thinking about how you would increase capacity,

May be easiest to just swap into a 24/36 bay chassis.

As things get quieter around the XMAS season that's when housekeeping starts to happen. We will start to archive off old client data to a warm and cold archive. We look to keep our disk utilization around 30% after the housekeeping. While we do like to hoard data and never delete anything just in case, we are pretty good at keeping our working data to a minimum to keep things running nippy.

Elliot Dierksen · Oct 30, 2018

Ender117 said:
If you meant 700kBit/s vs 4Gbit/s that's 5000x increase, if 700kByte/s vs 4Gbit/s that's 600x. Either is larger than 4-8x?

Edit: I am losing my freaking mind! Sorry, it has been a rough week or so. ~= 3 all nighters in 10 days.

My "X" may be off, but the transfer difference was 500-700Mb up to 4Gb. Yes, "M" instead of "K". Argh!

HoneyBadger · Oct 30, 2018

scott2500uk said:
Yes, when we refer to file server we mean FreeNas. All VM's on ESXi server are running their disks (VMDK) files on physical disks local to the server and not over any network or remote share.

Understood.

Both ESXi and guest OS's are mounting FreeNAS NFS shares.
ESXi is mounting only for the reason of writing backups of the guest OS disks to FreeNAS on a periodic cron task running on ESXI and having access to a number of ISOs stored on FreeNAS. We are using ghettoVCB to do the backups.

If the NFS export being used as a backup target is a separate dataset (and if it's not; why not?) you could specify sync=disabled at that level, to give you maximum throughput there. Ditto for the ISO export; why stress your disks unnecessarily (especially if you're using RAIDZ2)

A number of guest OS's mount FreeNAS NFS shares to give them access to the shared office file server. Others mount to NFS shares to gain access to larger disk capacities to store things like backups, logs and some transactional data. Some of those writes are sync writes and we feel should stay as such.

All of the NFS mounts do very little in terms of IOPS and the throughput of our zpool is more than adequate for our needs. It's just sync writes that suffer due to our raidz2 disk setup, but because of those disks being primarily used as office fileserver storage, striped raidz2's give us best bang for our buck/storage.

Absolutely fair. Just wanted to make sure you weren't being unnecessarily demanding on the data. OLTP/real-time data and things like that; yes, definitely sync writes, and SLOG will be required.

We are pretty sure that a SLOG device is the way to go but its just if the 900p is the way to go or if there is a better alternative. To mirror what Elliot Dierksen said, we think that the 900p is ideal because of the speed of the PCIe interface and that it doesn't burn any HDDs slots. We have plenty of PCIe slots to spare. Only things that sadden us that a lot of the storage will go to waste as we know a SLOG device doesn't need to be very spacious. I have seen reports that you cannot over/under provision these drives to help with write endurance but I would suspect that if you gave the full 240GB to SLOG that the drive would do decent wear-levelling anyway, essentially doing the same as over/under provisioning?

Since you indicated above that the P4800X is out of your price range, the only other similarly-performing alternative would be a P3700 - you might be able to find one on clearance for similar pricing to the Optane 900p. Regarding wear-leveling, Optane doesn't wear-level the same way as traditional NAND, so Intel disabled the hardware overprovisioning. Annoying, but you can get similar results by doing a secure-erase and manually creating a smaller (16GB?) partition, then adding that partition as an SLOG.

Important Announcement for the TrueNAS Community.

NVMe drives upgrades to add mirrored SLOG and a L2ARC

Dabbler

Patron

Patron

Dabbler

Dabbler

Patron

Patron

Dabbler

Patron

Patron

actually does care

Guru

Patron

Guru

Hall of Famer

Patron

MVP

Dabbler

Guru

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "NVMe drives upgrades to add mirrored SLOG and a L2ARC"

Similar threads