Any one know what's the effect of sysctl vfs.zfs.zil.maxblocksize ?

plumpsada · Feb 9, 2022

The version I'm using is TrueNAS core 12.0U7.
I'm trying to improve slog performance.
I tried to change the value of vfs.zfs.zil.maxblocksize. The default value is 131072 (128k). I changed it to 1048576 (1M).
But after this change, any sync write will result in kernel panic.
It's nearly 100% reproducible.
Steps to reproduce:
1. set sysctl variable vfs.zfs.zil.maxblocksize to 1048576 in GUI System->Tunables.
2. Reboot the system
3. Try sync write on a dataset with sync enabled (like dd with oflag=sync)

Am I doing something wrong or should I report a bug?

Different kernel backtraces have been seen:

Fatal trap 9: general protection fault while in kernel mode
cpuid = 18; apic id = 12
instruction pointer = 0x20:0xffffffff80ad1168
stack pointer = 0x28:0xfffffe0198b53f80
frame pointer = 0x28:0xfffffe0198b54020
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 2692 (reader#4)
trap number = 9
panic: general protection fault
cpuid = 18
time = 1644398971
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0198b53c90
vpanic() at vpanic+0x17b/frame 0xfffffe0198b53ce0
panic() at panic+0x43/frame 0xfffffe0198b53d40
trap_fatal() at trap_fatal+0x391/frame 0xfffffe0198b53da0
trap() at trap+0x67/frame 0xfffffe0198b53eb0
calltrap() at calltrap+0x8/frame 0xfffffe0198b53eb0
--- trap 0x9, rip = 0xffffffff80ad1168, rsp = 0xfffffe0198b53f80, rbp = 0xfffffe0198b54020 ---
_sx_xlock_hard() at _sx_xlock_hard+0x168/frame 0xfffffe0198b54020
rrw_enter_read_impl() at rrw_enter_read_impl+0x27/frame 0xfffffe0198b54050
zfs_root() at zfs_root+0x32/frame 0xfffffe0198b54090
lookup() at lookup+0x914/frame 0xfffffe0198b54130
namei() at namei+0x414/frame 0xfffffe0198b54200
kern_statfs() at kern_statfs+0x50/frame 0xfffffe0198b54310
sys_statfs() at sys_statfs+0x39/frame 0xfffffe0198b54340
amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe0198b54470
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0198b54470
--- syscall (555, FreeBSD ELF64, sys_statfs), rip = 0x8004e71ea, rsp = 0x7fffde9f2ec8, rbp = 0x7fffde9f30d0 ---
KDB: enter: panic

Fatal trap 12: page fault while in kernel mode
cpuid = 19; apic id = 13
fault virtual address = 0xfffffe01737ff000
fault code = supervisor write data, page not present
instruction pointer = 0x20:0xffffffff80fddb30
stack pointer = 0x28:0xfffffe017717abb0
frame pointer = 0x28:0xfffffe017717abb0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 3228 (dd)
trap number = 12
panic: page fault
cpuid = 19
time = 1644401012
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe017717a870
vpanic() at vpanic+0x17b/frame 0xfffffe017717a8c0
panic() at panic+0x43/frame 0xfffffe017717a920
trap_fatal() at trap_fatal+0x391/frame 0xfffffe017717a980
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe017717a9d0
trap() at trap+0x286/frame 0xfffffe017717aae0
calltrap() at calltrap+0x8/frame 0xfffffe017717aae0
--- trap 0xc, rip = 0xffffffff80fddb30, rsp = 0xfffffe017717abb0, rbp = 0xfffffe017717abb0 ---
memmove_erms() at memmove_erms+0x140/frame 0xfffffe017717abb0
zil_commit_impl() at zil_commit_impl+0x9e7/frame 0xfffffe017717ad10
zfs_write() at zfs_write+0x995/frame 0xfffffe017717aed0
VOP_WRITE_APV() at VOP_WRITE_APV+0xeb/frame 0xfffffe017717afe0
vn_write() at vn_write+0x261/frame 0xfffffe017717b060
vn_io_fault_doio() at vn_io_fault_doio+0x43/frame 0xfffffe017717b0c0
vn_io_fault1() at vn_io_fault1+0x15c/frame 0xfffffe017717b210
vn_io_fault() at vn_io_fault+0x185/frame 0xfffffe017717b280
dofilewrite() at dofilewrite+0xb0/frame 0xfffffe017717b2d0
sys_write() at sys_write+0xc0/frame 0xfffffe017717b340
amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe017717b470
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe017717b470
--- syscall (4, FreeBSD ELF64, sys_write), rip = 0x800424bca, rsp = 0x7fffffffea48, rbp = 0x7fffffffea80 ---
KDB: enter: panic

panic: vm_fault: fault on nofault entry, addr: 0xfffffe04ba8c0000
cpuid = 29
time = 1644288869
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0190ff8160
vpanic() at vpanic+0x17b/frame 0xfffffe0190ff81b0
panic() at panic+0x43/frame 0xfffffe0190ff8210
vm_fault() at vm_fault+0x24f2/frame 0xfffffe0190ff8360
vm_fault_trap() at vm_fault_trap+0x60/frame 0xfffffe0190ff83a0
trap_pfault() at trap_pfault+0x19c/frame 0xfffffe0190ff83f0
trap() at trap+0x286/frame 0xfffffe0190ff8500
calltrap() at calltrap+0x8/frame 0xfffffe0190ff8500
--- trap 0xc, rip = 0xffffffff80fddec0, rsp = 0xfffffe0190ff85d0, rbp = 0xfffffe0190ff85d0 ---
memcpy_erms() at memcpy_erms+0x130/frame 0xfffffe0190ff85d0
dmu_read_impl() at dmu_read_impl+0x118/frame 0xfffffe0190ff8640
dmu_read() at dmu_read+0x45/frame 0xfffffe0190ff8680
zfs_get_data() at zfs_get_data+0x167/frame 0xfffffe0190ff8710
zil_commit_impl() at zil_commit_impl+0xb1e/frame 0xfffffe0190ff8870
zfs_write() at zfs_write+0x995/frame 0xfffffe0190ff8a30
VOP_WRITE_APV() at VOP_WRITE_APV+0xeb/frame 0xfffffe0190ff8b40
nfsvno_write() at nfsvno_write+0x184/frame 0xfffffe0190ff8bf0
nfsrvd_write() at nfsrvd_write+0x532/frame 0xfffffe0190ff8f40
nfsrvd_dorpc() at nfsrvd_dorpc+0x122e/frame 0xfffffe0190ff9110
nfssvc_program() at nfssvc_program+0x5d3/frame 0xfffffe0190ff92e0
svc_run_internal() at svc_run_internal+0x98e/frame 0xfffffe0190ff9420
svc_thread_start() at svc_thread_start+0xb/frame 0xfffffe0190ff9430
fork_exit() at fork_exit+0x7e/frame 0xfffffe0190ff9470
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0190ff9470
--- trap 0xc, rip = 0x8002e1b2a, rsp = 0x7fffffffe578, rbp = 0x7fffffffe810 ---
KDB: enter: panic

sretalla · Feb 9, 2022

plumpsada said:
Am I doing something wrong or should I report a bug?

I wouldn't be reporting a bug if I were messing with settings and changing them from defaults and then experience problems as a direct result.

What exactly are you thinking is going to be better with a different number in that setting if your blocksize for that SLOG device was set to match 128K when it was created?

I suppose you need to:
remove the SLOG
set that setting/reboot if that's needed
add a new slog (which will presumably then be set with the correct blocksize... maybe there's even something you will find is baked into the middleware that sets it to 128K, so you may need to do it at CLI and work out how to ensure you're adding a SLOG with the right blocksize)

That's all a guess, so you'll need to be prepared for it not to work, but I can see the logic, maybe you can too.
I don't know a lot about it but this command may help you to see the current ashift of the slog...
zdb -l /dev/da3p2 | egrep 'ashift| name'

plumpsada · Feb 9, 2022

sretalla said:
I wouldn't be reporting a bug if I were messing with settings and changing them from defaults and then experience problems as a direct result.

What exactly are you thinking is going to be better with a different number in that setting if your blocksize for that SLOG device was set to match 128K when it was created?

I suppose you need to:
remove the SLOG
set that setting/reboot if that's needed
add a new slog (which will presumably then be set with the correct blocksize... maybe there's even something you will find is baked into the middleware that sets it to 128K, so you may need to do it at CLI and work out how to ensure you're adding a SLOG with the right blocksize)

That's all a guess, so you'll need to be prepared for it not to work, but I can see the logic, maybe you can too.

I just didn't expect kernel panic from changing the setting. If changing this value with slog attached is dangerous, then maybe zfs should reject the change and give some warnings. Also even without slog, zil is still working. So I don't expect remove SLOG will change anything.
Many of these settings are important but not very well documented.

plumpsada · Feb 9, 2022

sretalla said:
I don't know a lot about it but this command may help you to see the current ashift of the slog...
zdb -l /dev/da3p2 | egrep 'ashift| name'

I don't think ashift has anything to do with my problem. Ashift mismatch should only result in degraded performance, not kernel panic. This problem seems more likely buffer overflow in some zfs code.

sretalla · Feb 9, 2022

Seems like the name is already pretty good documentation...

What's the size of the largest block that we will try to write to ZIL (which is also your SLOG if you have one).

If your SLOG is added to a pool where the blocksize is larger than that value, I guess that means blocks over the limit will be broken into smaller ones.

If you have added your SLOG while that value is at default, perhaps it is used to set the recordsize of the SLOG device, which would explain when you change it with that SLOG still in place, the system might be trying to write blocks larger than the recordsize of that device to it, hence the panic. Ashift may give a clue about that, but maybe not.

I don't know if changing the value as I proposed before adding the SLOG to the pool will mean that the recordsize of the slog device is matched appropriately or not, but I am guessing that might be the case.

I would also question the value of doing that as a lot of consumers of block storage use 32K blocks anyway, so well under the current limit.

sretalla · Feb 9, 2022

plumpsada said:
If changing this value with slog attached is dangerous, then maybe zfs should reject the change and give some warnings

Maybe you've misunderstood the tool you're working with... if you're editing tunables, it's expected you know what you're doing, so nobody will argue with you and what you set will happen... if you're wrong, then bad things can happen, there's no nanny running after you here... ZFS is a cold, hard mistress.

Samuel Tai · Feb 9, 2022

zfs/module/zfs/zil.c at master · openzfs/zfs

OpenZFS on Linux and FreeBSD. Contribute to openzfs/zfs development by creating an account on GitHub.

github.com

There should not ever be a need to change vfs.zfs.zil.maxblocksize. In your case, you've increased the memory needed to manage ZIL blocks x8 for no good reason, and this causes heap and stack collisions and subsequent kernel panics. What the heck are you trying to accomplish?

plumpsada · Feb 9, 2022

Samuel Tai said:
zfs/module/zfs/zil.c at master · openzfs/zfs

OpenZFS on Linux and FreeBSD. Contribute to openzfs/zfs development by creating an account on GitHub.

github.com

There should not ever be a need to change vfs.zfs.zil.maxblocksize. In your case, you've increased the memory needed to manage ZIL blocks x8 for no good reason, and this causes heap and stack collisions and subsequent kernel panics. What the heck are you trying to accomplish?

I've read the code. From what I understand, zil only write 1 block a time and wait for the underlying zio reply before issue the next one. With my 2 optane 900p mirror slog, single stream sync write can only achieve less than 1GB/s. The raw speed of 900p is more than 2GB/s. And with sync=disabled, the pool write speed can easily go over 2GB/s. So I thought improve this value may improve the zil write speed.

jgreco · Feb 9, 2022

plumpsada said:
From what I understand, zil only write 1 block a time and wait for the underlying zio reply before issue the next one.

This is correct.

plumpsada said:
So I thought improve this value may improve the zil write speed

Why would you think that? Twiddling this value isn't going to magically make blocks larger or reduce latency.

Block sizes are going to be whatever they are. The thing we can do battle with is latency. I discuss this in the SLOG post:

Some insights into SLOG/ZIL with ZFS on FreeNAS

What is the ZIL? POSIX provides a facility for the system or an application to make sure that data requested to be written is actually committed to stable storage: a synchronous write request. Upon completion of a sync write request, the underlying filesystem is supposed to guarantee that a...

www.truenas.com

See the section about "Laaaaaaaaaaaaaaaatency".

What it really comes down to is that you can reduce latency by:

1) Using faster SLOG

2) Using faster networking

3) Optimizing client block sizes larger where practical (not really latency reduction, but rather reduced #txns)

4) Using better protocol (userland iSCSI becoming kernel iSCSI being the poster child)

5) Using faster CPU

6) Using faster memory

plumpsada · Feb 9, 2022

jgreco said:
Why would you think that? Twiddling this value isn't going to magically make blocks larger or reduce latency.

The calculation is simple. I tested with dataset whose blocksize=1MB. If I can't improve the latency, then increase the size of each write should improve the throughput.

jgreco said:
Block sizes are going to be whatever they are. The thing we can do battle with is latency. I discuss this in the SLOG post:

Some insights into SLOG/ZIL with ZFS on FreeNAS

What is the ZIL? POSIX provides a facility for the system or an application to make sure that data requested to be written is actually committed to stable storage: a synchronous write request. Upon completion of a sync write request, the underlying filesystem is supposed to guarantee that a...

www.truenas.com

See the section about "Laaaaaaaaaaaaaaaatency".

What it really comes down to is that you can reduce latency by:

1) Using faster SLOG

I'm limited to the available hardware. And I also tried ramdisk, not very much improvement. So I almost give up on this.

jgreco said:
2) Using faster networking

I run the test on local machine. Not network related.

jgreco said:
3) Optimizing client block sizes larger where practical (not really latency reduction, but rather reduced #txns)

The blocksize of the dataset I tested was set to 1MB. And I tested with dd from bs=1M to bs=32M.

jgreco said:
5) Using faster CPU

6) Using faster memory

As I said, I'm limited by the hardware. And disk io should not bounded by modern cpu or memory. Zil compression has been mentioned, but as far as I know not yet implemented.

jgreco · Feb 9, 2022

*facepalm* Okay, then.

plumpsada · Feb 9, 2022

Actually I can live with the current performance by moving some not so important workloads to dataset with sync=disabled. The purpose for this post is to find if I should report a bug for this.

Samuel Tai · Feb 9, 2022

plumpsada said:
The purpose for this post is to find if I should report a bug for this.

No. Like @sretalla said, you adjust tunables at your own risk.

Important Announcement for the TrueNAS Community.

Any one know what's the effect of sysctl vfs.zfs.zil.maxblocksize ?

plumpsada

Cadet

sretalla

Powered by Neutrality

plumpsada

Cadet

plumpsada

Cadet

sretalla

Powered by Neutrality

sretalla

Powered by Neutrality

Samuel Tai

Never underestimate your own stupidity

zfs/module/zfs/zil.c at master · openzfs/zfs

plumpsada

Cadet

zfs/module/zfs/zil.c at master · openzfs/zfs

jgreco

Resident Grinch

Some insights into SLOG/ZIL with ZFS on FreeNAS

plumpsada

Cadet

Some insights into SLOG/ZIL with ZFS on FreeNAS

jgreco

Resident Grinch

plumpsada

Cadet

Samuel Tai

Never underestimate your own stupidity

Similar threads

Important Announcement for the TrueNAS Community.

Any one know what's the effect of sysctl vfs.zfs.zil.maxblocksize ?

Cadet

Powered by Neutrality

Cadet

Cadet

Powered by Neutrality

Powered by Neutrality

Never underestimate your own stupidity

Cadet

Resident Grinch

Cadet

Resident Grinch

Cadet

Never underestimate your own stupidity

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Any one know what's the effect of sysctl vfs.zfs.zil.maxblocksize ?"

Similar threads