Performance of sync writes

Status
Not open for further replies.

Ender117

Patron
Joined
Aug 20, 2018
Messages
219
Edit your VM settings, look under CPU Options/Advanced (not sure what they're calling it these days) for the "CPU Scheduling Affinity" field.

Quick Googling turned up this document for 5.1; I believe it's pretty similar in newer versions

https://pubs.vmware.com/vsphere-51/index.jsp?topic=/com.vmware.vsphere.resmgmt.doc/GUID-F40F901D-C1A7-43E2-90AF-E6F98C960E4B.html
OK, I think the FreeNAS system not correctly detecting NUMA nodes? (if I understand the man page correctly) there should be at least 2 domain?
Code:
root@freenas:/tmp # sysctl -a | grep ndomain
vm.ndomains: 1
 

Ender117

Patron
Joined
Aug 20, 2018
Messages
219
Edit your VM settings, look under CPU Options/Advanced (not sure what they're calling it these days) for the "CPU Scheduling Affinity" field.

Quick Googling turned up this document for 5.1; I believe it's pretty similar in newer versions

https://pubs.vmware.com/vsphere-51/index.jsp?topic=/com.vmware.vsphere.resmgmt.doc/GUID-F40F901D-C1A7-43E2-90AF-E6F98C960E4B.html
I tried that, diskinfo -wS of passthrough p3700 varies from 98 to 135 depending on the core count or affiniation, however iozone result are always ~4.5K no matter how I play with CPU settings. Somehow virtualization adds another layer of overhead.
 
Last edited:

Ender117

Patron
Joined
Aug 20, 2018
Messages
219
Upon some research it seems that FreeBSD defaults to round-robin for numa allocation. I tried ZFS on ubuntu and it should do first touch allocation and the speed got even worse,~4.5K.

Unless someone else have a better idea I guess I would just give up, for now I have a pool with a 4K QD1 write IOPS of 20+K and capacity of 8T (to avoid fragmentation so I plan to make the pool at least less then 50% full). Not bad I would say. Maybe in the future I build another freenas box I will make sure to get a single CPU system

Update: I should also point out that I found (by watching zpool iostat -v)that for every block you write, ZFS will write another 4K block (presumably to be)checksum along side with into SLOG. So for user thread writing 4K blocks at ~80MB/s (~20K IOPS), the SLOG is writing 8K blocks at 160MB/s, still ~20K IOPS. This is true until 128K (IIRC) where ZFS start to split data into 2 64+4=68K blocks. This might help explain part of the overhead I saw
 
Last edited:

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Multiple threads are sending writes to ZFS, and multiple threads will commit a transaction group to a pool; but as far as writing into the transaction groups, the timing needs to be incredibly tightly controlled for integrity reasons; basically, when you get a "write" command, there's no way (quantum computing aside) to know whether or not the very next command you receive is "read back that address I just write to."

I’m not sure if this is correct. The data is in ram... is it the arc? Is it the txg? There are issues of atomicity. Can you read the data before it’s written? Asking to read the data before it’s sync written is an upper layer error, and thus it doesn’t matter if the soon to be overwritten contents, or the new contents are returned
 

Ender117

Patron
Joined
Aug 20, 2018
Messages
219
I’m not sure if this is correct. The data is in ram... is it the arc? Is it the txg? There are issues of atomicity. Can you read the data before it’s written? Asking to read the data before it’s sync written is an upper layer error, and thus it doesn’t matter if the soon to be overwritten contents, or the new contents are returned

Hi Stux, while you are here, would you mind run iozone -a -s 512M -O on your P3700 backed pool? It would help me decide whether there are something I did wrong or this is the natural overhead of ZFS
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Haven’t upgraded to 10.1 yet so can’t. It’s on my todo list
 

Ender117

Patron
Joined
Aug 20, 2018
Messages
219
Haven’t upgraded to 10.1 yet so can’t. It’s on my todo list
I see. Let me know when you do that. ;)
And I think you mean update to 11.1?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I’m not sure if this is correct. The data is in ram... is it the arc? Is it the txg? There are issues of atomicity. Can you read the data before it’s written? Asking to read the data before it’s sync written is an upper layer error, and thus it doesn’t matter if the soon to be overwritten contents, or the new contents are returned

We're definitely getting off topic now. ;)

The data is always in RAM in a txg - if it's an async write, ZFS will then respond "complete" to the waiting process and not write it anywhere else, but if it's a sync write, ZFS will send the write in-pool ZIL or SLOG, and wait for it to be confirmed as committed to stable storage before responding "complete" to the waiting process.

At this point, ZFS knows what the "current state" of the pool is - even though it hasn't been written to disk. It knows that block #123 is about to receive value "ABC", and even though that block may not be committed for up to five seconds, any subsequent reads from #123 will return the value "ABC" - because those reads are occurring after the write, and ZFS knows that the "current state" is what's being requested; the only way to get "prior state" would be to pull from a snapshot, or have the txg fail and be completely rolled back.

Asking to read the data before it’s sync written is an upper layer error, and thus it doesn’t matter if the soon to be overwritten contents, or the new contents are returned

Not if the ZFS txg fill process was multithreaded. The upstream application would send the write first, then the read a microsecond later, and if the two operations were split on the ZFS side of things, the write might take longer - it's doing more work than the read, after all. And the upstream application doesn't know (and shouldn't need to know or care) about the storage system's behavior - "write data and immediately read back" should give exactly the data back that you just wrote. Race conditions and storage don't mix.
 
Status
Not open for further replies.
Top