Random write performance within a VM

Status
Not open for further replies.

David E

Contributor
Joined
Nov 1, 2013
Messages
119
Quick question, when sync=disabled, the QD1 random writes listed on the bottom line of CrystalDiskMark are being acked and returned as soon as they hit the RAM ZIL, correct? So really this is a measurement of stack latency between the two servers.. are there any ways to improve this?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
"as soon as they hit the RAM ZIL" ... that doesn't begin to make any sense.

https://forums.freenas.org/index.php?threads/some-insights-into-slog-zil-with-zfs-on-freenas.13633/

Write performance with a SLOG is always a lot less than it is without. Here's some idea of what's happening under the hood when you request a sync write:

A sync write starts at the client, and has to make a very complicated round trip, in lockstep, for EACH WRITE REQUEST. The "sync" part of "sync write" means that the client is requesting that the current data block be confirmed as written to disk before the write() system call returns to the client. Without sync writes, a client is free to just stack up a bunch of write requests and then they can send over a slowish channel, and they arrive when they can. Look at the layers:

Client initiates a write syscall
Client filesystem processes request
Filesystem hands this off to the network stack as NFS or iSCSI
Network stack hands this packet off to network silicon
Silicon transmits to switch
Switch transmits to NAS network silicon
NAS network silicon throws an interrupt
NAS network stack processes packet
Kernel identifies this as a NFS or iSCSI request and passes to appropriate kernel thread
Kernel thread passes request off to ZFS
ZFS sees "sync request", sees an available SLOG device
ZFS pushes the request to the SAS device driver
Device driver pushes to LSI SAS silicon
LSI SAS chipset serializes the request and passes it over the SAS topology
SAS or SATA SSD deserializes the request
SSD controller processes the request and queues for commit to flash
SSD controller confirms request
SSD serializes the response and pssses it back over the SAS topology
LSI SAS chipset receives the response and throws an interrupt
SAS device driver gets the acknowledgment and passes it up to ZFS
ZFS passes acknowledgement back to kernel NFS/iSCSI thread
NFS/iSCSI thread generates an acknowledgement packet and passes it to the network silicon
NAS network silicon transmits to switch
Switch transmits to client network silicon
Client network silicon throws an interrupt
Client network stack receives acknowledgement packet and hands it off to filesystem
Filesystem says "yay, finally, what took you so long" and releases the syscall, allowing the client program to move on.

That's what happens for EACH sync write request.

The trick to increasing write performance is to optimize as many of these steps as possible. One of the better optimizations possible is to get rid of SAS from the equation and go straight to NVMe, which eliminates some handling in the middle. For your particular case, with that 2603 CPU, another optimization is to get a better CPU that allows the NAS to process things faster. It isn't clear to me HOW much faster that would make it, just that it's a thing that affects performance.
 

David E

Contributor
Joined
Nov 1, 2013
Messages
119
"as soon as they hit the RAM ZIL" ... that doesn't begin to make any sense.

https://forums.freenas.org/index.php?threads/some-insights-into-slog-zil-with-zfs-on-freenas.13633/

Write performance with a SLOG is always a lot less than it is without. Here's some idea of what's happening under the hood when you request a sync write:

A sync write starts at the client, and has to make a very complicated round trip, in lockstep, for EACH WRITE REQUEST. The "sync" part of "sync write" means that the client is requesting that the current data block be confirmed as written to disk before the write() system call returns to the client. Without sync writes, a client is free to just stack up a bunch of write requests and then they can send over a slowish channel, and they arrive when they can. Look at the layers:

Client initiates a write syscall
Client filesystem processes request
Filesystem hands this off to the network stack as NFS or iSCSI
Network stack hands this packet off to network silicon
Silicon transmits to switch
Switch transmits to NAS network silicon
NAS network silicon throws an interrupt
NAS network stack processes packet
Kernel identifies this as a NFS or iSCSI request and passes to appropriate kernel thread
Kernel thread passes request off to ZFS
ZFS sees "sync request", sees an available SLOG device
ZFS pushes the request to the SAS device driver
Device driver pushes to LSI SAS silicon
LSI SAS chipset serializes the request and passes it over the SAS topology
SAS or SATA SSD deserializes the request
SSD controller processes the request and queues for commit to flash
SSD controller confirms request
SSD serializes the response and pssses it back over the SAS topology
LSI SAS chipset receives the response and throws an interrupt
SAS device driver gets the acknowledgment and passes it up to ZFS
ZFS passes acknowledgement back to kernel NFS/iSCSI thread
NFS/iSCSI thread generates an acknowledgement packet and passes it to the network silicon
NAS network silicon transmits to switch
Switch transmits to client network silicon
Client network silicon throws an interrupt
Client network stack receives acknowledgement packet and hands it off to filesystem
Filesystem says "yay, finally, what took you so long" and releases the syscall, allowing the client program to move on.

That's what happens for EACH sync write request.

The trick to increasing write performance is to optimize as many of these steps as possible. One of the better optimizations possible is to get rid of SAS from the equation and go straight to NVMe, which eliminates some handling in the middle. For your particular case, with that 2603 CPU, another optimization is to get a better CPU that allows the NAS to process things faster. It isn't clear to me HOW much faster that would make it, just that it's a thing that affects performance.

I appreciate the detailed reply, and I am familiar with this flow, but I believe you misread my question, I asked specifically about optimizing the call flow when sync=disabled on the dataset. I also realize my terminology was wrong, ZIL should have been swapped with transaction group, or whatever data structure the data lands into in RAM before the call returns. In particular, it seems that the high teens, low 20s numbers from QD1 random r/w's are peak numbers, and when sync is turned on and a SLOG introduced, (writes) will only get slower. So I'm curious if there is any way to reduce latency through the path, via NIC tweaks, or anything else that can help.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
In general, if you look at what is going on with a NAS, the way that you get throughput with it is to pipeline I/O at the remote. QD=1 is exactly the OPPOSITE of that, i.e. not pipelining I/O. So you're basically causing traversal of a vast majority of those steps for each write. Whether you're writing to a SLOG device or merely putting it into the transaction group pending a write, you're not doing the thing that causes throughput to happen, which is to get multiple balls in the air simultaneously.
 

David E

Contributor
Joined
Nov 1, 2013
Messages
119
In general, if you look at what is going on with a NAS, the way that you get throughput with it is to pipeline I/O at the remote. QD=1 is exactly the OPPOSITE of that, i.e. not pipelining I/O. So you're basically causing traversal of a vast majority of those steps for each write. Whether you're writing to a SLOG device or merely putting it into the transaction group pending a write, you're not doing the thing that causes throughput to happen, which is to get multiple balls in the air simultaneously.

Totally understand, this is basically the worst case operation, which is why I'm trying to understand what if anything can be done to improve it.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Totally understand, this is basically the worst case operation, which is why I'm trying to understand what if anything can be done to improve it.

Change to QD=32 or find other ways to increase parallelism. QD=1 strikes me as a particularly meaningless test.
 
Status
Not open for further replies.
Top