Some insights into SLOG/ZIL with ZFS on FreeNAS

jgreco · Apr 23, 2019

Ericloewe said:
That seems like a very bad idea.

I actually wondered if that was possible a long time ago. Then I had better things to do with my hours. I'd actually bet that it is possible if you try hard enough, but the question is "how hard is that."

2nd-in-charge said:
No, it's not a typo. The SAS2208 is on the IBM MegaRaid M5110 card that came with the server. It is set to JBOD mode. FreeNAS defaults to using mrsas driver (rather than the older mfi), passes all info to camcontrol. The drives show up as da12 (Sandisk SSD Plus boot drive), da13 and da14 (the S3700s). smartctl works properly on all drives I plugged in so far (thee types of SSDs). Are there any known issues with this configuration?

Well the basic issue is that there are about zero hours of mileage.

One of the reasons we promote crossflashing to IT mode is because there's probably in excess of a billion driver-hours on the LSI HBA IT stuff. ZFS pushes I/O systems hard and it isn't really good enough to have 99.8% or 99.9% or 99.999% "correctness". We know the LSI HBA's in IT with the proper firmware/driver work correctly under adverse conditions, under normal conditions, etc.

The PC enthusiast who is used to overclocking and getting the occasional BSOD, that mindset isn't particularly good with respect to server builds. We assume that you're building FreeNAS to provide a safe haven for your valuable data. Using untested and unproven hardware is fundamentally risky, and putting thousands of hours of runtime on it just to pass basic testing is something most enthusiasts are not willing to do.

The MFI driver is known (firsthand, by me) to be good - not great, but good - but more than a little quirky. I suppose the MRSAS driver could be better. I'd actually love for it to be better. It'd be great if that were the case because so many systems come with LSI RAID controllers.

It's an option, although I'm a bit worried that IBM server will complain if I put a different card in the storage slot (like Dell servers do). And M5110 is free :)

2nd-in-charge · May 2, 2019

HoneyBadger said:
Also, you might want to switch them to 4Kn sectors vs. 512b for improved results - better yet, benchmark before and after you do that switch!

Have done what you suggested. Doesn't seem to make much difference.

2nd-in-charge · Jun 10, 2019

I replaced the M5110 with M1215, cross-flashed it to 9300-8i IT firmware. MPR driver gets invoked:

Code:

root@pm2[~]# dmesg | grep mpr0
mpr0: <Avago Technologies (LSI) SAS3008> port 0x2000-0x20ff mem 0xa9b40000-0xa9b4ffff,0xa9b00000-0xa9b3ffff irq 42 at device 0.0 numa-domain 0 on pci4
mpr0: Firmware: 16.00.01.00, Driver: 18.03.00.00-fbsd
...
da12 at mpr0 bus 0 scbus1 target 4 lun 0
da14 at mpr0 bus 0 scbus1 target 6 lun 0
da13 at mpr0 bus 0 scbus1 target 5 lun 0

The drives were detected (da12 as boot mirror, da13/da14 as a mirrored slog).
Smart works as well.
~~However after I removed the log from the pool (and rebooted a few times), I still cannot run slog benchmark~~
Edit: ahh, found the reason for "operation not permitted". The drives were still used for swap. The latency of SAS3008 is pretty bad. Will post the result in the slog testing thread.

Is mpr the right diver and is it also "low mileage"?

SamM · Nov 1, 2019

HoneyBadger said:
On this note, if you purchase the planned Intel DC S3700 drives I would strongly suggest that you use the Intel SSD Data Center Toolbox (isdct) - https://downloadcenter.intel.com/product/87278/Intel-SSD-Data-Center-Tool - to both change the logical sector size to 4K and to massively increase the overprovisioning on the drives - you'll really only need 8GB available on each, most likely. This will increase both performance and endurance.

I'm glad I saw this. I was wondering how to "overprovision" SSD's. So is taking a 200Gb (or whatever size greater than 11Gb) Intel SSD, and 'overprovisioning' it so that it appears to be an 11Gb SSD a good idea? The thinking being that if ZIL Transaction Groups are capped at 4Gb, then that leaves sufficient space for 2 groups, plus 2Gb default swap reservation, plus 1Gb of 'wiggle room'? If I do this to a 200Gb drive, then the remaining 189Gb becomes redundant blocks for the specified 11Gb, multiplying the write endurance of that 11Gb roughly 17 fold yes?

2nd-in-charge said:
Yep, single SLOG device could be added and removed. Two drives added at the same time were partitioned (when the Swap Size was default 2GiB), and added as a mirror. I could offline the individual drives, but not remove the log vdev.
Just before leaving work today I removed the log vdev again (using zpool remove), but couldn't run the slog test diskinfo -Sw (operation not permitted). Crudely re-plugged one of the drives, then it worked on that drive.

I suspect that I just ran into this last week during a FreeNAS crash (late Oct 2019) due to faulty NVMe drive(s), to currently being unable to remove my mirrored SLOG device, as described in https://www.ixsystems.com/community...-booting-and-cant-be-removed-from-pool.79966/.

HoneyBadger · Nov 4, 2019

SamM said:
I'm glad I saw this. I was wondering how to "overprovision" SSD's. So is taking a 200Gb (or whatever size greater than 11Gb) Intel SSD, and 'overprovisioning' it so that it appears to be an 11Gb SSD a good idea? The thinking being that if ZIL Transaction Groups are capped at 4Gb, then that leaves sufficient space for 2 groups, plus 2Gb default swap reservation, plus 1Gb of 'wiggle room'? If I do this to a 200Gb drive, then the remaining 189Gb becomes redundant blocks for the specified 11Gb, multiplying the write endurance of that 11Gb roughly 17 fold yes?

Endurance will be helped out by the fact that the writes can be spread out across the entirety of the drive, so you won't have "hot spots" on the NAND, ~~but the overall TBW endurance rating will still remain roughly the same~~ Edit: and modern controllers are apparently very good at benefiting from this spread, to the point where TBW estimates are doubling (or more!) when drives are heavily overprovisioned.

What it will more ~~noticeably~~ Edit: immediately do is improve the consistency of the write speeds, since there will be at most the 11G (in your example) of NAND pages that would be considered as "holding valid data" and everything else could be treated as free space; so after the 11G is used, the controller will still have 189G of "free space" left that it can write to at full speed while furiously performing garbage collection/background TRIM operations on the first 11G to make it ready again later.

Regarding the how-to, Intel has a nice little whitepaper on this - for their drives specifically you can use the isdct tool, but third-party ones like hdparm perform the same function. [PDF warning]

https://www.intel.com/content/dam/w...nd-based-ssds-better-endurance-whitepaper.pdf

JeffNAS · Nov 21, 2023

Here is an interesting YouTube discussion by level1techs on using Intel optane for caching and as a meta data special device.

Intel Optane is DEAD. Long Live Intel Optane! Buy them cheap right now...

Intel Optane may be dead, but it's still useful. Wendell shows us the huge difference it makes when used as part of your ZFS setup!Cheap Optane here: https:/...

youtu.be

HoneyBadger · Nov 21, 2023

Wendell is a great resource for this, and many other things ZFS.

And yes, some of us may have been hoarding Optane drives for a bit now.

NugentS · Nov 21, 2023

I have plenty of spares for my own use - just in case

Christian K · Dec 10, 2023

jgreco said:
Sync writes involve immediately writing the update to the ZIL (whether in-pool or SLOG), and then proceeding as though the write was async, inserting the write into the transaction group along with all the other writes.

Thank you for that great article. I am on my journey learning why NFS write speed to my two mirror-mode vdevs SSD pool is slower than my anticipation. I do not use a SLOG device as of now.

I understand that storing the intent log on a separate SLOG device moves transactions away from the main pool, and therefore speeding up sync writes from a client's perspective. If I understand correctly, the ZIL holds the data to be written in sync mode.
However, for the case of an in-pool ZIL, the sync write transactions block (or at least compete with) other pool activity anyways. Since the data must be written from the ZIL to the pool afterwards, this takes something like twice the toll on pool transactions. I do not understand why an in-pool ZIL is being used, instead of directly commiting the "sync" writes to the pool, and possibly stalling "async" writes. This would reduce load on the pool by avoiding "double writes".

jgreco · Dec 10, 2023

Christian K said:
I do not understand why an in-pool ZIL is being used, instead of directly commiting the "sync" writes to the pool, and possibly stalling "async" writes. This would reduce load on the pool by avoiding "double writes".

Because ZFS needs to commit the entire current transaction group that is developing in the write cache. What you suggest would be horribly poor performance-wise. For stuff like overwrites, a copy-on-write system like ZFS isn't merely laying down the data on top of whatever data might have been there before. Small pool writes especially to RAIDZ vdevs are a real mess. This would tend to have the effect of stalling out pool traffic on HDD pools.

Ericloewe · Dec 10, 2023

Keep in mind that sync writes are blocking everything the client is doing in their thread of execution (that's why they're sync writes). Every microsecond of added delay before the write can be acknowledged is time 100% wasted on the client.

JeffNAS · Dec 11, 2023

cyberjock said:
I'm trying to find proof that istgt even has sync writes in any form but all I'm seeing is comments about asynchronous writes. I'm trying to find something that explicitly mentions that sync writes do not exist for istgt. Google don't fail me now!

Edit: Just to clarify I'm doing this because if you read up on how you make iSCSI have sync writes you actually make the dataset or pool have exclusively sync writes by setting sync=always. That's not an iSCSI setting, which tends to support all the other information that iSCSI just doesn't have a "sync write" in any form... or at least not in the version that's in FreeNAS.

What do you mean by istgt? iSCSI Target???

HoneyBadger · Dec 12, 2023

JeffNAS said:
What do you mean by istgt? iSCSI Target???

istgt is the legacy iSCSI target daemon that was used in FreeNAS - if you'll note the date on the post you quoted, it's over a decade old at this point.

The logic stands though - in order to enforce sync writes for remote iSCSI initiators, you should set sync=always on the zvol (or parent dataset of the zvols, if you're creating them programmatically)

Important Announcement for the TrueNAS Community.

Some insights into SLOG/ZIL with ZFS on FreeNAS

jgreco

Resident Grinch

2nd-in-charge

Explorer

2nd-in-charge

Explorer

SamM

Dabbler

HoneyBadger

actually does care

JeffNAS

Dabbler

Intel Optane is DEAD. Long Live Intel Optane! Buy them cheap right now...

HoneyBadger

actually does care

NugentS

MVP

Christian K

Dabbler

jgreco

Resident Grinch

Ericloewe

Server Wrangler

JeffNAS

Dabbler

HoneyBadger

actually does care

Similar threads

Important Announcement for the TrueNAS Community.

Some insights into SLOG/ZIL with ZFS on FreeNAS

Resident Grinch

Explorer

Explorer

Dabbler

actually does care

Dabbler

actually does care

MVP

Dabbler

Resident Grinch

Server Wrangler

Dabbler

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Some insights into SLOG/ZIL with ZFS on FreeNAS"

Similar threads