Some insights into SLOG/ZIL with ZFS on FreeNAS

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
That seems like a very bad idea.

I actually wondered if that was possible a long time ago. Then I had better things to do with my hours. I'd actually bet that it is possible if you try hard enough, but the question is "how hard is that."

No, it's not a typo. The SAS2208 is on the IBM MegaRaid M5110 card that came with the server. It is set to JBOD mode. FreeNAS defaults to using mrsas driver (rather than the older mfi), passes all info to camcontrol. The drives show up as da12 (Sandisk SSD Plus boot drive), da13 and da14 (the S3700s). smartctl works properly on all drives I plugged in so far (thee types of SSDs). Are there any known issues with this configuration?

Well the basic issue is that there are about zero hours of mileage.

One of the reasons we promote crossflashing to IT mode is because there's probably in excess of a billion driver-hours on the LSI HBA IT stuff. ZFS pushes I/O systems hard and it isn't really good enough to have 99.8% or 99.9% or 99.999% "correctness". We know the LSI HBA's in IT with the proper firmware/driver work correctly under adverse conditions, under normal conditions, etc.

The PC enthusiast who is used to overclocking and getting the occasional BSOD, that mindset isn't particularly good with respect to server builds. We assume that you're building FreeNAS to provide a safe haven for your valuable data. Using untested and unproven hardware is fundamentally risky, and putting thousands of hours of runtime on it just to pass basic testing is something most enthusiasts are not willing to do.

The MFI driver is known (firsthand, by me) to be good - not great, but good - but more than a little quirky. I suppose the MRSAS driver could be better. I'd actually love for it to be better. It'd be great if that were the case because so many systems come with LSI RAID controllers.

It's an option, although I'm a bit worried that IBM server will complain if I put a different card in the storage slot (like Dell servers do). And M5110 is free :)
 

2nd-in-charge

Explorer
Joined
Jan 10, 2017
Messages
94

2nd-in-charge

Explorer
Joined
Jan 10, 2017
Messages
94
I replaced the M5110 with M1215, cross-flashed it to 9300-8i IT firmware. MPR driver gets invoked:
Code:
root@pm2[~]# dmesg | grep mpr0
mpr0: <Avago Technologies (LSI) SAS3008> port 0x2000-0x20ff mem 0xa9b40000-0xa9b4ffff,0xa9b00000-0xa9b3ffff irq 42 at device 0.0 numa-domain 0 on pci4
mpr0: Firmware: 16.00.01.00, Driver: 18.03.00.00-fbsd
...
da12 at mpr0 bus 0 scbus1 target 4 lun 0
da14 at mpr0 bus 0 scbus1 target 6 lun 0
da13 at mpr0 bus 0 scbus1 target 5 lun 0

The drives were detected (da12 as boot mirror, da13/da14 as a mirrored slog).
Smart works as well.
However after I removed the log from the pool (and rebooted a few times), I still cannot run slog benchmark
Edit: ahh, found the reason for "operation not permitted". The drives were still used for swap. The latency of SAS3008 is pretty bad. Will post the result in the slog testing thread.

Is mpr the right diver and is it also "low mileage"?
 
Last edited:

SamM

Dabbler
Joined
May 29, 2017
Messages
39
On this note, if you purchase the planned Intel DC S3700 drives I would strongly suggest that you use the Intel SSD Data Center Toolbox (isdct) - https://downloadcenter.intel.com/product/87278/Intel-SSD-Data-Center-Tool - to both change the logical sector size to 4K and to massively increase the overprovisioning on the drives - you'll really only need 8GB available on each, most likely. This will increase both performance and endurance.
I'm glad I saw this. I was wondering how to "overprovision" SSD's. So is taking a 200Gb (or whatever size greater than 11Gb) Intel SSD, and 'overprovisioning' it so that it appears to be an 11Gb SSD a good idea? The thinking being that if ZIL Transaction Groups are capped at 4Gb, then that leaves sufficient space for 2 groups, plus 2Gb default swap reservation, plus 1Gb of 'wiggle room'? If I do this to a 200Gb drive, then the remaining 189Gb becomes redundant blocks for the specified 11Gb, multiplying the write endurance of that 11Gb roughly 17 fold yes?

Yep, single SLOG device could be added and removed. Two drives added at the same time were partitioned (when the Swap Size was default 2GiB), and added as a mirror. I could offline the individual drives, but not remove the log vdev.
Just before leaving work today I removed the log vdev again (using zpool remove), but couldn't run the slog test diskinfo -Sw (operation not permitted). Crudely re-plugged one of the drives, then it worked on that drive.
I suspect that I just ran into this last week during a FreeNAS crash (late Oct 2019) due to faulty NVMe drive(s), to currently being unable to remove my mirrored SLOG device, as described in https://www.ixsystems.com/community...-booting-and-cant-be-removed-from-pool.79966/.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I'm glad I saw this. I was wondering how to "overprovision" SSD's. So is taking a 200Gb (or whatever size greater than 11Gb) Intel SSD, and 'overprovisioning' it so that it appears to be an 11Gb SSD a good idea? The thinking being that if ZIL Transaction Groups are capped at 4Gb, then that leaves sufficient space for 2 groups, plus 2Gb default swap reservation, plus 1Gb of 'wiggle room'? If I do this to a 200Gb drive, then the remaining 189Gb becomes redundant blocks for the specified 11Gb, multiplying the write endurance of that 11Gb roughly 17 fold yes?

Endurance will be helped out by the fact that the writes can be spread out across the entirety of the drive, so you won't have "hot spots" on the NAND, but the overall TBW endurance rating will still remain roughly the same Edit: and modern controllers are apparently very good at benefiting from this spread, to the point where TBW estimates are doubling (or more!) when drives are heavily overprovisioned.

What it will more noticeably Edit: immediately do is improve the consistency of the write speeds, since there will be at most the 11G (in your example) of NAND pages that would be considered as "holding valid data" and everything else could be treated as free space; so after the 11G is used, the controller will still have 189G of "free space" left that it can write to at full speed while furiously performing garbage collection/background TRIM operations on the first 11G to make it ready again later.

Regarding the how-to, Intel has a nice little whitepaper on this - for their drives specifically you can use the isdct tool, but third-party ones like hdparm perform the same function. [PDF warning]

https://www.intel.com/content/dam/w...nd-based-ssds-better-endurance-whitepaper.pdf
 
Last edited:

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
I have plenty of spares for my own use - just in case
 

Christian K

Dabbler
Joined
Sep 22, 2021
Messages
17
Sync writes involve immediately writing the update to the ZIL (whether in-pool or SLOG), and then proceeding as though the write was async, inserting the write into the transaction group along with all the other writes.
Thank you for that great article. I am on my journey learning why NFS write speed to my two mirror-mode vdevs SSD pool is slower than my anticipation. I do not use a SLOG device as of now.

I understand that storing the intent log on a separate SLOG device moves transactions away from the main pool, and therefore speeding up sync writes from a client's perspective. If I understand correctly, the ZIL holds the data to be written in sync mode.
However, for the case of an in-pool ZIL, the sync write transactions block (or at least compete with) other pool activity anyways. Since the data must be written from the ZIL to the pool afterwards, this takes something like twice the toll on pool transactions. I do not understand why an in-pool ZIL is being used, instead of directly commiting the "sync" writes to the pool, and possibly stalling "async" writes. This would reduce load on the pool by avoiding "double writes".
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I do not understand why an in-pool ZIL is being used, instead of directly commiting the "sync" writes to the pool, and possibly stalling "async" writes. This would reduce load on the pool by avoiding "double writes".

Because ZFS needs to commit the entire current transaction group that is developing in the write cache. What you suggest would be horribly poor performance-wise. For stuff like overwrites, a copy-on-write system like ZFS isn't merely laying down the data on top of whatever data might have been there before. Small pool writes especially to RAIDZ vdevs are a real mess. This would tend to have the effect of stalling out pool traffic on HDD pools.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Keep in mind that sync writes are blocking everything the client is doing in their thread of execution (that's why they're sync writes). Every microsecond of added delay before the write can be acknowledged is time 100% wasted on the client.
 

JeffNAS

Dabbler
Joined
Oct 18, 2023
Messages
19
I'm trying to find proof that istgt even has sync writes in any form but all I'm seeing is comments about asynchronous writes. I'm trying to find something that explicitly mentions that sync writes do not exist for istgt. Google don't fail me now!

Edit: Just to clarify I'm doing this because if you read up on how you make iSCSI have sync writes you actually make the dataset or pool have exclusively sync writes by setting sync=always. That's not an iSCSI setting, which tends to support all the other information that iSCSI just doesn't have a "sync write" in any form... or at least not in the version that's in FreeNAS.
What do you mean by istgt? iSCSI Target???
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
What do you mean by istgt? iSCSI Target???
istgt is the legacy iSCSI target daemon that was used in FreeNAS - if you'll note the date on the post you quoted, it's over a decade old at this point. :wink:

The logic stands though - in order to enforce sync writes for remote iSCSI initiators, you should set sync=always on the zvol (or parent dataset of the zvols, if you're creating them programmatically)
 
Last edited:
Top