Fusion Pool (Special vDEV) and ZILs

hammermaster

Cadet
Joined
Jan 29, 2023
Messages
9
Hypothetical Pool Geometry
4 x HDDs in RAID-Z2 (data)
3 x NVMEs in mirror (metadata)

After a metadata vDEV has been created does it have its own ZIL? i.e. are there now two ZILs in the pool?

I ask this because I was thinking about adding 3 x SAS3 SSDs I have laying around to create a SLOG.

I cannot find clear ZFS documentation stating how the ZIL(s) will function in this scenario.

All writes to this pool will be SYNC writes.

So, am I correct to assume that the slower SLOG WILL NOT act as a ZIL for the faster NVMEs which contain metadata and some other small block data? The special vDEV has its own ZIL?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
After a metadata vDEV has been created does it have its own ZIL? i.e. are there now two ZILs in the pool?
No.

If you're worried about sync write performance, why are you using RAIDZ2? https://www.truenas.com/community/threads/the-path-to-success-for-block-storage.81165/

So, am I correct to assume that the slower SLOG WILL NOT act as a ZIL for the faster NVMEs which contain metadata and some other small block data? The special vDEV has its own ZIL?
No.

A SLOG, if present, takes over the ZIL function for the pool, meaning no pool integral VDEVs will keep their own ZIL... that's kind-of the whole premise of SLOG.

The intent of having one is to smooth out the performance degradation of sync writes, not to accelerate anything.
 

hammermaster

Cadet
Joined
Jan 29, 2023
Messages
9
From what you're saying then the whole idea of adding a fast (NVME) metadata vDEV to a pool of spinning rust (HDDs) would never make sense in any situation. If one were to omit the SLOG entirely then that would mean the ZIL is on the HDDs. With that being said, any data (even metadata) would be written to the ZIL first (which is on the HDDs) and then copied to the final destination. So metadata written to HDDs first then NVMEs second. Data written the HDDs first then to another location on the same HDDs.

Something doesn't add up about your responses.

See the 2nd graphic in the following link:

 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Something doesn't add up about your responses.

See the 2nd graphic in the following link:
The second to last frame in that graphic explaining SLOG shows exactly what I said... with SLOG, there's no ZIL on pool VDEVs.

From what you're saying then the whole idea of adding a fast (NVME) metadata vDEV to a pool of spinning rust (HDDs) would never make sense in any situation.
I don't think you've thought through all the scenarios.

If one were to omit the SLOG entirely then that would mean the ZIL is on the HDDs
True. But it may also be on the NVMEs depending on block size... not a decision you can make for ZFS to always select the NVMEs without setting the Metadata (Special) small block size to an unreasonably large number, effectively ruining the concept of a hybrid pool at that point.

With that being said, any data (even metadata) would be written to the ZIL first (which is on the HDDs) and then copied to the final destination. So metadata written to HDDs first then NVMEs second. Data written the HDDs first then to another location on the same HDDs.
Seems to me like you're not considering that the on-pool ZIL will folow the same rules for data and ZIL alike, so will go to the respective VDEV.

If I try to re-interpret what you were saying in your OP, maybe that's what you meant?

Just to clarify a little more... SLOG must always be faster than all VDEVs in your pool or it won't really help.
 

hammermaster

Cadet
Joined
Jan 29, 2023
Messages
9
Ok, we're getting somewhere now.

Let's just consider the dataset configuration scenario where there is NO SLOG, ALL writes are SYNC writes, and Metadata (SPECIAL) Small Block Size is 64KB.

Now, two scenarios are possible depending on the Record Size of the Dataset:

a) Incoming record is 128KB
1. data undergoes TXG aggregation in ARC (RAM) and TXGs are periodically written (committed as TXGs) to the zpool final data area on the HDDs (spinning rust)
2. white TXG aggregation in RAM is occurring, data is simultaneously being written to the ZIL on the HDDs (spinning rust)
3. once the ZIL gets notified that the TXGs are written to their final destination the ZIL can now flush (delete) the data

b) Incoming record is 32KB
1. data undergoes TXG aggregation in ARC (RAM) and TXGs are periodically written (committed as TXGs) to the zpool final data area on the SSDs (NVME)
2. white TXG aggregation in RAM is occurring, data is simultaneously being written to the ZIL on the SSDs (NVME)
3. once the ZIL gets notified that the TXGs are written to their final destination the ZIL can now flush (delete) the data

Do we have agreement on theses scnearios? There are essentially two locations where the ZIL can reside depending on the Record Size.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Do we have agreement on theses scnearios? There are essentially two locations where the ZIL can reside depending on the Record Size.
Sounds reasonable...

In a pool with no special VDEV and multiple data VDEVs, the ZIL may go where it decides is optimal, but where the pool has a special VDEV, those rules of assignment are followed for ZIL too.

I might have a slight disagreement with the "simultaneously" part of your point 2s. It's only after the TXGs are organized that they are written anywhere other than memory.

And thinking about it a little harder, I don't like the wording of point 3s either... it's not really a notification process. ZFS is aware as it's controlling all of the parts (RAM, ZIL, POOL).
 

hammermaster

Cadet
Joined
Jan 29, 2023
Messages
9
Ok, thanks for your reply. It's clear the flow of data in ZFS is complicated. But, now let's consider the scenario where a SLOG is added. Make it a "slow" SLOG as well like a SATA SSD.

1) Are there still two ZIL "areas" depending on Record Size? i.e. would "large" blocks hit the SATA SSD ZIL(SLOG) and "small" blocks hit the NVME SSD ZIL?

or

2) Is there now only one ZIL "area" on the SATA SSD (SLOG) regardless of Record Size? i.e. even "small" blocks hit the SATA SSD ZIL(SLOG) before being confirmed they've been written to the NVME SSD?

For #2 directly above, I wonder what implications that would have for the internal logic of ZFS if data is completely written to its final destination (data vDEV) before being completely written to the SLOG. In other words, what happens if the data vDEV is "faster" than the SLOG vDEV?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
1) Are there still two ZIL "areas" depending on Record Size? i.e. would "large" blocks hit the SATA SSD ZIL(SLOG) and "small" blocks hit the NVME SSD ZIL?
No. SLOG takes over all ZIL in the pool.

2) Is there now only one ZIL "area" on the SATA SSD (SLOG) regardless of Record Size? i.e. even "small" blocks hit the SATA SSD ZIL(SLOG) before being confirmed they've been written to the NVME SSD?
Yes.

For #2 directly above, I wonder what implications that would have for the internal logic of ZFS if data is completely written to its final destination (data vDEV) before being completely written to the SLOG. In other words, what happens if the data vDEV is "faster" than the SLOG vDEV?
Can't happen. The committed write won't go back until the SLOG writes it and will back off pool IO until it catches up.

Like I said already, SLOG is no good unless it's faster than ALL pool VDEVs.

I would direct you back to the linked post for some more thinking about how you might speed up your pool for sync writes.
 

hammermaster

Cadet
Joined
Jan 29, 2023
Messages
9
Thanks for clarifying those scenarios. Documentation on special vDEV scenarios is really lacking both from TrueNAS and ZFS in general. All I wanted to do was discuss a few of them in the forum so they're written down somewhere. Its important to know where you can cut corners and where you cant and the implications of such.

The linked post you keep referring me to mentions nothing of special vDEVs. So, please take your own advice, read the post yourself, and don't jump to conlusions about what exactly I'm trying to achieve here. I would expect more from a moderator.

Cheers!
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
The linked post you keep referring me to mentions nothing of special vDEVs. So, please take your own advice, read the post yourself, and don't jump to conlusions about what exactly I'm trying to achieve here. I would expect more from a moderator.
I don't find that amusing.

The linked article talks about what it takes to design a pool for sync write performance, which is the only reason anyone would be considering SLOG.

What's most important to understand about it is that you need to use mirrors, not RAIDZ VDEVs, if you want to have any hope of keeping up with the required IOPS... remembering that RAIDZ gives you the IOPS of a single member disk of each VDEV.

If you feel that your use of the metadata/special VDEV somehow gets you around that problem, I suspect you'll find that it doesn't, but please share your findings either way for the benefit of others (which is the reason I keep mentioning the block storage link).
 
Top