Hi,
@jgreco, thanks for your response.
So would you recommend that I tune down the transaction group size and / or timeout then? What's a good place to start for that? Would doing this increase fragmentation?
Look for posts discussing vfs.zfs.txg.timeout
Fragmentation might increase, but fragmentation is always such a problem for block storage that you should already be designing your systems with an expectation that you'll hit steady state fragmentation. Have lots of pool free space and lots of L2ARC, basically,
Maybe I need to revisit the decision but I purposely chose to use SATA SSDs. I wanted to be able to replace an SSD if necessary without taking the server out of production. There aren't any motherboards or chassis in my almost-no-money price range that can do hot swap PCIe cards.
Yeah, there's that. Tradeoffs. Of course, if your chassis has extra SATA/SAS bays, you can still add a replacement SSD without downtime.
Also I don't think the older motherboards we're using support gen 3 PCIe which I suspect would be required?
I'm not aware of any such limitations. I believe we've used NVMe on PCIe-less-than-3 in the past but I don't know for sure. Hardware compatibility is always good to verify.
We originally tried Samsung consumer-level "Pro" SATA SSDs before settling on the S3710s. What we found is that while on paper they claim really high IOPS there's apparently a difference between IOPS and sustained IOPS. The Samsung Pro SSDs couldn't even keep up with our moderate needs and we ended up having to set sync=disabled on our VM datasets until we could replace them.
Sustained I/O numbers on *any* SSD seem to be incredibly overoptimistic. The Samy 850 PRO's are still basically a consumer-grade drive and don't compare to the S3710, which is a solid data-center class drive. I usually find that there's a little more reality if I drop an order of magnitude off the rating.
S3710s seem to be getting harder to find cheap so I will probably try some Samsung 863a SATA SSDs, though, as those are supposed to also be quite nice.
I'm impressed you ever found S3710's cheap.
True. But my ARCs currently have have a hit rate between 93% and 99% so I'm pretty sure I could saturate at least one 10GbE link even with my lowly SATA SSDs especially since I'm striping them for the L2ARC! (That is I could if the ixgbe driver in 10.3 weren't so horribly pathetic.) Going with PCIe or even more SSDs just doesn't seem to be required for our workload though things may change in the future.
Just throwing things out there. I'm not real thrilled about the evolution of the SSD marketplace, and I guess for tasks like this the future is probably leaning NVMe. I personally really like the 2.5" form factor SATA/SAS because it is so versatile. I can use them with RAID, etc. But if you were to actually have to go out and buy something, the NVMe has at least some substantial benefits, and similar cost, to SATA.
I forgot to mention that for our newer servers we are over-provisioning these SSDs by 25%. We secure erase the drives before partitioning them to only use 300GB out of 400GB. It's not necessary to pull the SSDs out and use Intel's SSD toolbox to secure erase. It can been right from FreeNAS using camcontrol.
Yes, but I don't recommend it for beginners. If you know it can be done and how to do it, then that's great.
These enterprise-level drives already have a good amount of over-provisioning built in but it's reasonably well established that over-provisioning by 20% or so more can gain you a bit of performance down the road after the drives have been used for a while. What's not well established is that over-provisioning by massive amounts gains you even more than over-provisioning by 25%. I suspect that the law of diminishing returns kicks in and any gains from massive overprovisioning vs, normal overprovisioning are negligible. So my opinion is buy a bigger SSD for SLOG because bigger SSDs are faster and last longer. Over-provision it normally but don't completely waste the space; you bought it you might as well use it!
Well, that *sounds* nice but is meaningless in practice. The only meaningful data ever stored on a SLOG device are the last two transaction groups. These are the things that will be recovered from SLOG upon pool import. You can choose to make the SLOG an exabyte large, if you want, but that'll all be wasted space except for the amount of storage required to store the last two transaction groups.
If you look at how an SSD operates under the sheets, what you really want is for the SSD to have a massive pile of already-erased blocks.
https://arstechnica.com/information...volution-how-solid-state-disks-really-work/3/
What slows an SSD down is when it has to go and do page shuffling to create blocks that can be erased. You already recognize that having free space on the disk leads to faster performance. If you have 300GB of that 400GB SSD filled with dirty sectors, that means you have 100GB of the advertised capacity (and probably ~150GB of actual underlying flash) that may be erased and ready-to-go. So if your clients start pounding writes, you can soak up ~100GB before you're shuffling pages.
My point is that if you instead only use 30GB of that 400GB SSD, you can soak up ~370GB before you're shuffling pages.
It's totally possible your SSD is fast enough to erase blocks at a rate sufficient to make sure that this doesn't turn into a significant impact. That's great if so. However, because there's no VALUE in consuming more than ~30GB, it seems to me that the smart move is to limit it to that size, let the controller have a huge pool of erased blocks, and then be assured of optimal behaviour. A larger SSD mostly gets you greater endurance, and reducing page shuffling is also a good thing for the longevity of an SSD.
Do none of you think that the issue could simply be that I don't have enough ARC used for metadata as I theorize below?:
Or that I'm simply asking too much from NFS and iSCSI would give me more consistency?:
Or that I simply need to tune the transaction group settings?:
Thanks,
Carl
Anything's possible. There are large numbers of moving parts in these systems, and it is entirely possible for there to be half a dozen with knives poised to stab you in the back, even if five of the six aren't your current pain point. Try to avoid discounting the advice you're getting. I haven't seen anything that's bad or wrong so far. Even if these suggestions do not fix your immediate issue, they're good knowledge.