Some insights into SLOG/ZIL with ZFS on FreeNAS

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,510
That seems like a very bad idea.

I actually wondered if that was possible a long time ago. Then I had better things to do with my hours. I'd actually bet that it is possible if you try hard enough, but the question is "how hard is that."

No, it's not a typo. The SAS2208 is on the IBM MegaRaid M5110 card that came with the server. It is set to JBOD mode. FreeNAS defaults to using mrsas driver (rather than the older mfi), passes all info to camcontrol. The drives show up as da12 (Sandisk SSD Plus boot drive), da13 and da14 (the S3700s). smartctl works properly on all drives I plugged in so far (thee types of SSDs). Are there any known issues with this configuration?

Well the basic issue is that there are about zero hours of mileage.

One of the reasons we promote crossflashing to IT mode is because there's probably in excess of a billion driver-hours on the LSI HBA IT stuff. ZFS pushes I/O systems hard and it isn't really good enough to have 99.8% or 99.9% or 99.999% "correctness". We know the LSI HBA's in IT with the proper firmware/driver work correctly under adverse conditions, under normal conditions, etc.

The PC enthusiast who is used to overclocking and getting the occasional BSOD, that mindset isn't particularly good with respect to server builds. We assume that you're building FreeNAS to provide a safe haven for your valuable data. Using untested and unproven hardware is fundamentally risky, and putting thousands of hours of runtime on it just to pass basic testing is something most enthusiasts are not willing to do.

The MFI driver is known (firsthand, by me) to be good - not great, but good - but more than a little quirky. I suppose the MRSAS driver could be better. I'd actually love for it to be better. It'd be great if that were the case because so many systems come with LSI RAID controllers.

It's an option, although I'm a bit worried that IBM server will complain if I put a different card in the storage slot (like Dell servers do). And M5110 is free :)
 

2nd-in-charge

Explorer
Joined
Jan 10, 2017
Messages
94

2nd-in-charge

Explorer
Joined
Jan 10, 2017
Messages
94
I replaced the M5110 with M1215, cross-flashed it to 9300-8i IT firmware. MPR driver gets invoked:
Code:
root@pm2[~]# dmesg | grep mpr0
mpr0: <Avago Technologies (LSI) SAS3008> port 0x2000-0x20ff mem 0xa9b40000-0xa9b4ffff,0xa9b00000-0xa9b3ffff irq 42 at device 0.0 numa-domain 0 on pci4
mpr0: Firmware: 16.00.01.00, Driver: 18.03.00.00-fbsd
...
da12 at mpr0 bus 0 scbus1 target 4 lun 0
da14 at mpr0 bus 0 scbus1 target 6 lun 0
da13 at mpr0 bus 0 scbus1 target 5 lun 0

The drives were detected (da12 as boot mirror, da13/da14 as a mirrored slog).
Smart works as well.
However after I removed the log from the pool (and rebooted a few times), I still cannot run slog benchmark
Edit: ahh, found the reason for "operation not permitted". The drives were still used for swap. The latency of SAS3008 is pretty bad. Will post the result in the slog testing thread.

Is mpr the right diver and is it also "low mileage"?
 
Last edited:

SamM

Dabbler
Joined
May 29, 2017
Messages
39
On this note, if you purchase the planned Intel DC S3700 drives I would strongly suggest that you use the Intel SSD Data Center Toolbox (isdct) - https://downloadcenter.intel.com/product/87278/Intel-SSD-Data-Center-Tool - to both change the logical sector size to 4K and to massively increase the overprovisioning on the drives - you'll really only need 8GB available on each, most likely. This will increase both performance and endurance.
I'm glad I saw this. I was wondering how to "overprovision" SSD's. So is taking a 200Gb (or whatever size greater than 11Gb) Intel SSD, and 'overprovisioning' it so that it appears to be an 11Gb SSD a good idea? The thinking being that if ZIL Transaction Groups are capped at 4Gb, then that leaves sufficient space for 2 groups, plus 2Gb default swap reservation, plus 1Gb of 'wiggle room'? If I do this to a 200Gb drive, then the remaining 189Gb becomes redundant blocks for the specified 11Gb, multiplying the write endurance of that 11Gb roughly 17 fold yes?

Yep, single SLOG device could be added and removed. Two drives added at the same time were partitioned (when the Swap Size was default 2GiB), and added as a mirror. I could offline the individual drives, but not remove the log vdev.
Just before leaving work today I removed the log vdev again (using zpool remove), but couldn't run the slog test diskinfo -Sw (operation not permitted). Crudely re-plugged one of the drives, then it worked on that drive.
I suspect that I just ran into this last week during a FreeNAS crash (late Oct 2019) due to faulty NVMe drive(s), to currently being unable to remove my mirrored SLOG device, as described in https://www.ixsystems.com/community...-booting-and-cant-be-removed-from-pool.79966/.
 

HoneyBadger

Mushroom! Mushroom!
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
4,612
I'm glad I saw this. I was wondering how to "overprovision" SSD's. So is taking a 200Gb (or whatever size greater than 11Gb) Intel SSD, and 'overprovisioning' it so that it appears to be an 11Gb SSD a good idea? The thinking being that if ZIL Transaction Groups are capped at 4Gb, then that leaves sufficient space for 2 groups, plus 2Gb default swap reservation, plus 1Gb of 'wiggle room'? If I do this to a 200Gb drive, then the remaining 189Gb becomes redundant blocks for the specified 11Gb, multiplying the write endurance of that 11Gb roughly 17 fold yes?

Endurance will be helped out by the fact that the writes can be spread out across the entirety of the drive, so you won't have "hot spots" on the NAND, but the overall TBW endurance rating will still remain roughly the same Edit: and modern controllers are apparently very good at benefiting from this spread, to the point where TBW estimates are doubling (or more!) when drives are heavily overprovisioned.

What it will more noticeably Edit: immediately do is improve the consistency of the write speeds, since there will be at most the 11G (in your example) of NAND pages that would be considered as "holding valid data" and everything else could be treated as free space; so after the 11G is used, the controller will still have 189G of "free space" left that it can write to at full speed while furiously performing garbage collection/background TRIM operations on the first 11G to make it ready again later.

Regarding the how-to, Intel has a nice little whitepaper on this - for their drives specifically you can use the isdct tool, but third-party ones like hdparm perform the same function. [PDF warning]

https://www.intel.com/content/dam/w...nd-based-ssds-better-endurance-whitepaper.pdf
 
Last edited:
Top