NVMe drives upgrades to add mirrored SLOG and a L2ARC

Status
Not open for further replies.

scott2500uk

Dabbler
Joined
Nov 17, 2014
Messages
37
Hi All,

We are looking for some advice on some upgrades to our existing company file server. Here is our current spec:
Screen Shot 2018-10-30 at 14.51.36.png

Our 12x4TB WD RED PRO drives are set up as one pool of two, six drive raidz2 volumes:
Screen Shot 2018-10-30 at 15.01.57.png

Giving us a usable space of 28TiB and using currently about 56%.

The server is connected to our network switches via 2x10Gb LACP Lagg. There, we have 20-30 users connected over 1Gb ethernet working on large design files over AFP file sharing. We also have a few other servers that are connecting over 10Gb connections to the file server and accessing and writing files using SMB and NFS.

This setup has worked well for us for the past 3 years but there are a few issues we want to address this Christmas if Santa will provide the funds. But before we can request the funds we need to spec what hardware we need.

So the issues we are having is that our sync writes are atrocious when talking over NFS. So we want to add a dedicated SLOG. Having a peruse over the forum, seeing what else others are suggesting, we are thinking of getting a couple of the Intel Optane 900p NVMe PCIe drives. Yes, we are aware they don't have enhanced power loss protection but our server is UPS backed up so we can live with that. We plan on mirroring them for paranoia reasons and they are fairly cheap drives anyway (£200ea). Yes, we know that 240GB is overkill for SLOG in our case but failing to see what else might be suitable. Has anyone got any better recommendations for our SLOG? We could go the route of using something like an m.2 adapter card Supermicro AOC-SLG3-2M2 and using two smaller m.2 SSDs, any thoughts?

The other issue we are having is when a lot of people are working off of the server the latency of directory browsing seems to tank. Throughput stays high but it feels like the drives are getting busy reading large chunks of data that small reads are suffering. While throughput is high the perceived speed of the filesystem is low due to the latency of directory browsing. We feel that a high speed, low latency L2ARC cache would improve things here. Correct us if we are wrong and we should look at tunning elsewhere. We are not sure what device, speed, size or technology should be used as an L2ARC here so really looking for some recommendations here.

We are currently on FreeNAS 9.3 but in the process of upgrading to the latest version once we have got the last of the mission-critical VMs/Jails off the system so we can be comfortable taking the fileserver offline outside of company hours to do the upgrades. We are assuming there the latest version of FreeNAS has no issues with NVMe drives since we are running on bare metal. Also, we a confidant that our hardware supports NVMe as the Supermicro AOC-SLG3-2M2 card is listed as compatible with our motherboard also from Supermicro, however, we have doubt that these newer drives might use a newer NMVe version that is not compatible.

Thanks for reading and we look forward to your suggestions.
 

Jessep

Patron
Joined
Aug 19, 2018
Messages
379
Please provide hardware specs or it will be difficult to make suggestions.

General Notes:
  • Optane "theoretically" doesn't need PLP as it writes direct to die, no cache.
  • Do you have "sync=always" on? If not a slog may not help much or at all.
  • L2ARC should only be added when you have maxed ARC, if you CAN add RAM do that first.
  • L2ARC will use RAM, so it's not likely to help (i.e. make it worse) unless you have over 64GB and for your use case 128-256GB RAM/ARC would make sense.
 

Ender117

Patron
Joined
Aug 20, 2018
Messages
219
Depend on what you use NFS for you can just set sync=disable and get better performance than any SLOG. That being said, 900p is makes an excellent SLOG. But if you ever want a even higher write endurance, consider P3700 and P4800X.

Regarding the L2ARC, it is complicated. What's your current ARC hit rate like? There are some tweaking you can do on like prefetch etc to make directory load faster. Also consider add more vdev, 20-30 users sounds a bit lot for 2 vdevs.
 

scott2500uk

Dabbler
Joined
Nov 17, 2014
Messages
37
@Jessep We did provide specs, see first line/image of our post.

With Optane we are aware of that and also because of a UPS we are not worried about power loss corruption. We are asking if there is a better solution for us other than the 900p drives.

Sync is set to default on the datasets. Our sync writes are slow our async writes are fine. From my understanding setting sync=always forces, all writes to be sync. How does setting sync=always make a SLOG device worse off?

Currently, we have 128GB of RAM in the system. We cannot add more without also adding another CPU so this is out of the question.
 

scott2500uk

Dabbler
Joined
Nov 17, 2014
Messages
37
@Ender117 NFS is mounted to ESXi which forces all writes to be sync, so slow there. We know we can disable sync but would rather keep it on for data protection. We also mount NFS to some ubuntu VMs with some sync and async workloads. The P3700 and P4800X drives are going to be too expensive for Santa and I couldn't put that much money down on a device where only 1% of its storage capacity is used. Looking at the write endurance of the 900p drives they have more than adequate write endurance for our workload. Just wondering if there is anything smaller/cost effective that could fit for a slog device for us.

Adding another vdev at this point won't fly. We would have to add another 6 HDDs an external chassis to fit them and an HBA card with an external port to connect up the new chassis. It's a good idea if it can help but the actual implementation is harder.

As for ARC data we are not 100% sure what we are looking for. If you have some resources we can look at that so we can brush up on how ARC and L2ARC caches work then happy to take a look.

Here are our ARC graphs if they help anyone understand our usage:

Screen Shot 2018-10-30 at 16.29.01.png
 

Ender117

Patron
Joined
Aug 20, 2018
Messages
219
@Ender117 NFS is mounted to ESXi which forces all writes to be sync, so slow there. We know we can disable sync but would rather keep it on for data protection. We also mount NFS to some ubuntu VMs with some sync and async workloads. The P3700 and P4800X drives are going to be too expensive for Santa and I couldn't put that much money down on a device where only 1% of its storage capacity is used. Looking at the write endurance of the 900p drives they have more than adequate write endurance for our workload. Just wondering if there is anything smaller/cost effective that could fit for a slog device for us.

Adding another vdev at this point won't fly. We would have to add another 6 HDDs an external chassis to fit them and an HBA card with an external port to connect up the new chassis. It's a good idea if it can help but the actual implementation is harder.

As for ARC data we are not 100% sure what we are looking for. If you have some resources we can look at that so we can brush up on how ARC and L2ARC caches work then happy to take a look.

Here are our ARC graphs if they help anyone understand our usage:

View attachment 26342
OK, based on your OP I thought this is used as an general filer. But yeah backing hypervisor is a prime use case of SLOG. If you can buy used, P3700 can be get at ~200-250 each. Slightly slower than 900p but much higher endurance. I understand that 900p is enough for you, just pointing out an option.

Your ARC hit rate looks high yet ARC size is not maxed out. By default most of the RAM (~120G in your case) can be used for ARC yet yours only uses ~80G. I am afraid that simply throwing in more hardware may not help here (read wise).

Please, provide more details on your hardware (especially chassis) and use pattern (both backing hypervisor hosts and as a filer) in IOPS. I am thinking you might be best off switching to mirrors instead of raidz, at least for VM part.
 

Jessep

Patron
Joined
Aug 19, 2018
Messages
379
@Jessep We did provide specs, see first line/image of our post.

With Optane we are aware of that and also because of a UPS we are not worried about power loss corruption. We are asking if there is a better solution for us other than the 900p drives.

Sync is set to default on the datasets. Our sync writes are slow our async writes are fine. From my understanding setting sync=always forces, all writes to be sync. How does setting sync=always make a SLOG device worse off?

Currently, we have 128GB of RAM in the system. We cannot add more without also adding another CPU so this is out of the question.

Apologies I tend to not click on pictures.

Your specs list 4X32GB LR DIMMs so you could add an additional 4X32GB DIMMs without requiring a second CPU.

What are your actual use cases? (EDIT: you responded while I was writing this)

"sync=always" doesn't make SLOG worse, if you aren't using that setting or a protocol that does the same thing a SLOG isn't going to IMPROVE performance as async writes will always be faster.

To quote cyberjock:
Your pool/dataset sync setting overrides ALL writes if you choose disabled or always. It makes every single write a sync write or every single write an async write.

NFS "supports" the sync write flag. So any write that is specified as a sync write will be a sync write if you have it set to standard. So what writes are sync writes normally? That depends on what you are copying and what program you are using to do the copying. Some programs support and do sync writes, some don't. Your benchmark for the always/standard/disabled is only valid for the exact client software you used to copy the data and only in the method you used. Other than that, it goes out the window.

Some other protocols have no sync write flag(iSCSI for one). So standard and disabled should be the same for those protocols. sync=always will of course make your non-sync protocol behave kind of like a sync protocol, with all of the performance killing properties associated with it.

ESXi is an application that sync flags EVERY SINGLE WRITE. That completely kills zpool performance, but has the advantage of protecting your data at pretty much every performance cost.

You can make every protocol appear to have sync writes for everything by setting sync=always. That will also kill zpool performance in a similar fashion. That's why I'm saying sync=always with iSCSI is no different than NFS with sync=standard.

So yes your write performance will improve using a SLOG with your listed workload.

RaidZ2 is a poor choice for ESXi or VM workloads (low IOPS).
 

scott2500uk

Dabbler
Joined
Nov 17, 2014
Messages
37
Your specs list 4X32GB LR DIMMs so you could add an additional 4X32GB DIMMs without requiring a second CPU.

We may be wrong, but our understanding is that the motherboard has 8 DIMM slots, 4 per CPU. For the other 4 DIMMs to work there needs to be a CPU present. Based on that we cannot add more RAM without either getting another CPU or swapping out the 32GB dimms for higher capacities.

RaidZ2 is a poor choice for ESXi or VM workloads (low IOPS).

Yep, we know it is not ideal. However, we are not running VMs off this zpool. We are mounting NFS into VMS to read and write some working data from the pool. The data usage is low IOPS like the filesharing use. As for ESXi we are mounting NFS just to write backups to. Our ESXi has its own drives and uses NFS share just to backup periodically. No VM's are directly run over NFS. As ESXi mounts NFS with sync enabled our backups are dog slow due to the lack of SLOG device. Also in our VMs that are talking to the zpool over NFS any sync writes are slow for the same reason.

Your ARC hit rate looks high yet ARC size is not maxed out. By default most of the RAM (~120G in your case) can be used for ARC yet yours only uses ~80G. I am afraid that simply throwing in more hardware may not help here (read wise).

We are currently in a quieter period and slowness hasn't been much of an issue for the past few weeks but we have been seeing the ARC size cap out at 100G for weeks on end in the past. If an l2arc device won't help is there anything we can tune to help with the existing setup?

Please, provide more details on your hardware (especially chassis) and use pattern (both backing hypervisor hosts and as a filer) in IOPS. I am thinking you might be best off switching to mirrors instead of raidz, at least for VM part.

Our chassis is a Supermicro 4U SC846BE16 - R920W/R1280W. We are sure you are asking what about the other 12 bays? 6 of them are used by a set of high capacity drives as a warm archive. Another couple are used for spares HDDs and another two used by an SSD mirror for running some jails off of.

As for the VM stuff I have explained above in a bit more detail our usage
 

Jessep

Patron
Joined
Aug 19, 2018
Messages
379
Ahhh X10DRL-I not X10DRI my mistake.
 

Ender117

Patron
Joined
Aug 20, 2018
Messages
219
Yep, we know it is not ideal. However, we are not running VMs off this zpool. We are mounting NFS into VMS to read and write some working data from the pool. The data usage is low IOPS like the filesharing use. As for ESXi we are mounting NFS just to write backups to. Our ESXi has its own drives and uses NFS share just to backup periodically. No VM's are directly run over NFS. As ESXi mounts NFS with sync enabled our backups are dog slow due to the lack of SLOG device. Also in our VMs that are talking to the zpool over NFS any sync writes are slow for the same reason.

I felt hard to make recommendation because you always seem to miss an essential piece of info:). However the general consensus is if you cannot lost the last 10 seconds of data (in default setting) in events of power loss/crash/etc, get an SLOG. Else just set sync=disabled. Most people doesn't consider backup to be one of these cases, they just redo the backup from original. However it's your money/Santa after all so if you want an SLOG, 900p is a good choice.


We are currently in a quieter period and slowness hasn't been much of an issue for the past few weeks but we have been seeing the ARC size cap out at 100G for weeks on end in the past. If an l2arc device won't help is there anything we can tune to help with the existing setup?

Try google ZFS prefetch tuning. It may speed up read by loading stuff into RAM ahead. OTOH it can also cause memory pressure and hurt performance. However, please consider switch to stripped mirrors instead of raidz, especially you are now only using 1% of the space. There is only so much caching can do
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Looking for a little clarification before I make some suggestions; see bolded areas of the pull-quotes below.

The server is connected to our network switches via 2x10Gb LACP Lagg. There, we have 20-30 users connected over 1Gb ethernet working on large design files over AFP file sharing. We also have a few other servers that are connecting over 10Gb connections to the file server and accessing and writing files using SMB and NFS.

Is the bolded "file server" here the FreeNAS machine, or do you have a VM on your ESXi server using the FreeNAS NFS export to back a VMDK?

Yep, we know it is not ideal. However, we are not running VMs off this zpool. We are mounting NFS into VMS to read and write some working data from the pool. The data usage is low IOPS like the filesharing use.

...

Also in our VMs that are talking to the zpool over NFS any sync writes are slow for the same reason.

Are the guest VMs doing the NFS mounts, or is the hypervisor (ESXi) mounting the NFS export?

If your guest OS explicitly requires/requests sync writes, then sure, you need an SLOG - but just make sure you aren't unnecessarily demanding them for data that doesn't need it.

As for ESXi we are mounting NFS just to write backups to. Our ESXi has its own drives and uses NFS share just to backup periodically. No VM's are directly run over NFS. As ESXi mounts NFS with sync enabled our backups are dog slow due to the lack of SLOG device.

What backup software is being used, or are you manually doing snapshots/clones? You can always set sync=disabled on the backup dataset in order not to compromise integrity where it's needed elsewhere.
 
Joined
Dec 29, 2014
Messages
1,135
Just a note on the Optane 900P. I have one in each of my FreeNAS systems, and I have been very happy with them. I have mine partitioned since FreeNAS won't use all of the 280GB for an SLOG. I split them up, and use part of them as an L2ARC. I haven't done any testing to see if them L2ARC improves my read performance, but it sounds cool. Using the Optane as an SLOG gave me a 4-8X boost in NFS write performance from ESXi over 10G NFS.
 

Ender117

Patron
Joined
Aug 20, 2018
Messages
219
Using the Optane as an SLOG gave me a 4-8X boost in NFS write performance from ESXi over 10G NFS.
Can you specify the 4X-8X boost? If it's comparing to no SLOG, 8X seems to be a bit small
 
Joined
Dec 29, 2014
Messages
1,135
Can you specify the 4X-8X boost? If it's comparing to no SLOG, 8X seems to be a bit small

If would have to go back and look for the detailed post I did on this. I'll try to do that later. The short version is I was getting 500-700K write performance from my ESXi hosts to FreeNAS via NFS over a 10G network. I wasn't happy with that, and after some research and discussion, an SLOG seemed like the thing to boost NFS synch write performance. I settled on the Optane 900P because of the speed and the fact that I didn't want to burn any drive slots. Now I can get fairly sustained periods of 4G write speeds, and that meets my needs. There are others in forum with a much better understanding of FreeNAS internals that can explain why FreeNAS will only use a certain amount of SLOG space. I am sure part of that is because of my config (details in the sig). The 7.2K SATA drives and RAIDZ2 vdevs aren't the best layout for performance. I know mirrors would give me better performance, but I wasn't willing to give up the space. Also, my environment is mostly a lab so my needs aren't too demanding and I didn't feel like maintaining multiple pools. Once I got to where I was happy with the performance, I stopped tinkering. Even before the SLOG and L2ARC, I could consistently get 8G read performance. I am very happy with that.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Interesting

Sent from my SAMSUNG-SGH-I537 using Tapatalk
 

Ender117

Patron
Joined
Aug 20, 2018
Messages
219
If would have to go back and look for the detailed post I did on this. I'll try to do that later. The short version is I was getting 500-700K write performance from my ESXi hosts to FreeNAS via NFS over a 10G network. I wasn't happy with that, and after some research and discussion, an SLOG seemed like the thing to boost NFS synch write performance. I settled on the Optane 900P because of the speed and the fact that I didn't want to burn any drive slots. Now I can get fairly sustained periods of 4G write speeds, and that meets my needs. There are others in forum with a much better understanding of FreeNAS internals that can explain why FreeNAS will only use a certain amount of SLOG space. I am sure part of that is because of my config (details in the sig). The 7.2K SATA drives and RAIDZ2 vdevs aren't the best layout for performance. I know mirrors would give me better performance, but I wasn't willing to give up the space. Also, my environment is mostly a lab so my needs aren't too demanding and I didn't feel like maintaining multiple pools. Once I got to where I was happy with the performance, I stopped tinkering. Even before the SLOG and L2ARC, I could consistently get 8G read performance. I am very happy with that.

If you meant 700kBit/s vs 4Gbit/s that's 5000x increase, if 700kByte/s vs 4Gbit/s that's 600x. Either is larger than 4-8x?
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
https://forums.freenas.org/index.php?threads/testing-the-benefits-of-slog-using-a-ram-disk.56561/

900ps are a good choice. I’d go with a cheap Samsung evo drive for l2arc. Sata or nvme is your choice. Sata will have a max burst of 550MB/s.

Next upgrade would probably be to mirrors if you need more iop performance. Or add another 24 bay external chassis and start adding more 6 drive raidz2 vdevs.

At >56% utilization it’s at least time to start thinking about how you would increase capacity,

May be easiest to just swap into a 24/36 bay chassis.

Double the vdevs will double your iops

You current fantastic arc cache hits are probably due to a long since fixed bug which was causing repetive reads which resulted in misleading arc stats. Unfortunately your hit ratio will probably drop significantly when you upgrade FreeNAS and the bug is fixed.
 

scott2500uk

Dabbler
Joined
Nov 17, 2014
Messages
37
Looking for a little clarification before I make some suggestions; see bolded areas of the pull-quotes below.

Is the bolded "file server" here the FreeNAS machine, or do you have a VM on your ESXi server using the FreeNAS NFS export to back a VMDK?

Are the guest VMs doing the NFS mounts, or is the hypervisor (ESXi) mounting the NFS export?

If your guest OS explicitly requires/requests sync writes, then sure, you need an SLOG - but just make sure you aren't unnecessarily demanding them for data that doesn't need it.

What backup software is being used, or are you manually doing snapshots/clones? You can always set sync=disabled on the backup dataset in order not to compromise integrity where it's needed elsewhere.

Yes, when we refer to file server we mean FreeNas. All VM's on ESXi server are running their disks (VMDK) files on physical disks local to the server and not over any network or remote share.

Both ESXi and guest OS's are mounting FreeNAS NFS shares.
ESXi is mounting only for the reason of writing backups of the guest OS disks to FreeNAS on a periodic cron task running on ESXI and having access to a number of ISOs stored on FreeNAS. We are using ghettoVCB to do the backups.

A number of guest OS's mount FreeNAS NFS shares to give them access to the shared office file server. Others mount to NFS shares to gain access to larger disk capacities to store things like backups, logs and some transactional data. Some of those writes are sync writes and we feel should stay as such.

All of the NFS mounts do very little in terms of IOPS and the throughput of our zpool is more than adequate for our needs. It's just sync writes that suffer due to our raidz2 disk setup, but because of those disks being primarily used as office fileserver storage, striped raidz2's give us best bang for our buck/storage.

We are pretty sure that a SLOG device is the way to go but its just if the 900p is the way to go or if there is a better alternative. To mirror what Elliot Dierksen said, we think that the 900p is ideal because of the speed of the PCIe interface and that it doesn't burn any HDDs slots. We have plenty of PCIe slots to spare. Only things that sadden us that a lot of the storage will go to waste as we know a SLOG device doesn't need to be very spacious. I have seen reports that you cannot over/under provision these drives to help with write endurance but I would suspect that if you gave the full 240GB to SLOG that the drive would do decent wear-levelling anyway, essentially doing the same as over/under provisioning?

At >56% utilization it’s at least time to start thinking about how you would increase capacity,

May be easiest to just swap into a 24/36 bay chassis.

As things get quieter around the XMAS season that's when housekeeping starts to happen. We will start to archive off old client data to a warm and cold archive. We look to keep our disk utilization around 30% after the housekeeping. While we do like to hoard data and never delete anything just in case, we are pretty good at keeping our working data to a minimum to keep things running nippy.
 
Joined
Dec 29, 2014
Messages
1,135
If you meant 700kBit/s vs 4Gbit/s that's 5000x increase, if 700kByte/s vs 4Gbit/s that's 600x. Either is larger than 4-8x?

Edit: I am losing my freaking mind! Sorry, it has been a rough week or so. ~= 3 all nighters in 10 days.

My "X" may be off, but the transfer difference was 500-700Mb up to 4Gb. Yes, "M" instead of "K". Argh!
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Yes, when we refer to file server we mean FreeNas. All VM's on ESXi server are running their disks (VMDK) files on physical disks local to the server and not over any network or remote share.

Understood.

Both ESXi and guest OS's are mounting FreeNAS NFS shares.
ESXi is mounting only for the reason of writing backups of the guest OS disks to FreeNAS on a periodic cron task running on ESXI and having access to a number of ISOs stored on FreeNAS. We are using ghettoVCB to do the backups.

If the NFS export being used as a backup target is a separate dataset (and if it's not; why not?) you could specify sync=disabled at that level, to give you maximum throughput there. Ditto for the ISO export; why stress your disks unnecessarily (especially if you're using RAIDZ2)

A number of guest OS's mount FreeNAS NFS shares to give them access to the shared office file server. Others mount to NFS shares to gain access to larger disk capacities to store things like backups, logs and some transactional data. Some of those writes are sync writes and we feel should stay as such.

All of the NFS mounts do very little in terms of IOPS and the throughput of our zpool is more than adequate for our needs. It's just sync writes that suffer due to our raidz2 disk setup, but because of those disks being primarily used as office fileserver storage, striped raidz2's give us best bang for our buck/storage.

Absolutely fair. Just wanted to make sure you weren't being unnecessarily demanding on the data. OLTP/real-time data and things like that; yes, definitely sync writes, and SLOG will be required.

We are pretty sure that a SLOG device is the way to go but its just if the 900p is the way to go or if there is a better alternative. To mirror what Elliot Dierksen said, we think that the 900p is ideal because of the speed of the PCIe interface and that it doesn't burn any HDDs slots. We have plenty of PCIe slots to spare. Only things that sadden us that a lot of the storage will go to waste as we know a SLOG device doesn't need to be very spacious. I have seen reports that you cannot over/under provision these drives to help with write endurance but I would suspect that if you gave the full 240GB to SLOG that the drive would do decent wear-levelling anyway, essentially doing the same as over/under provisioning?

Since you indicated above that the P4800X is out of your price range, the only other similarly-performing alternative would be a P3700 - you might be able to find one on clearance for similar pricing to the Optane 900p. Regarding wear-leveling, Optane doesn't wear-level the same way as traditional NAND, so Intel disabled the hardware overprovisioning. Annoying, but you can get similar results by doing a secure-erase and manually creating a smaller (16GB?) partition, then adding that partition as an SLOG.
 
Status
Not open for further replies.
Top