Redundant SSD's for both SLOG and Metadata VDEV's

bullerwins · Aug 12, 2022

Hi!

I've built a Truenas Scale system, with 4 HDD's and have 2 cheap 128GB SSD's that I plan to use for SLOG vdev to be able to use SYNC on and speed up writes.
I wonder if it's possible to "partition" the SSD's to use them also as Metadata vdev to make use of the space left, as I have a 10Gbit network, more than 5GB of space wouldn't be used.

My use case for the system is for both backup and archival of files, so metadata would be good to index files and make searches faster. But also to ingest video and photo files and to edit from there, that's why I have the SLOG idea in the first place, to speed up those writes in a "safer" way, using sync=on.
I just don't want to spend more on more drives, nor I have the space left to put more SSDs/nvme to have both SLOG and Metadata Vdevs.

So do I have to choose, of can I have them both with the same set of SSD's?

System specs:
Intel i7 920
24GB DDR3 RAM
10Gbit Asus Nic
4x 14TB HDDs (3x WD, 1xSeagate) (RAIDZ1)
2x 128GB Intenso's SSD

Thanks!

awasb · Aug 12, 2022

1.) I'd use ECC RAM (and therefore ditch the i7-920).
2.) Consumer SSDs won't work (very long). Get used server grade SSDs with high write endurance/several drive writes per day. Never forget: As soon as your special vdev is gone, your pool will be toast.

danb35 · Aug 12, 2022

bullerwins said:
that I plan to use for SLOG vdev to be able to use SYNC on and speed up writes.

Sync always slows writes. It slows them less with a SLOG than without one, but sync writes are always slower than async.

bullerwins said:
I wonder if it's possible to "partition" the SSD's to use them also as Metadata vdev

Possible, but it'd all be manual work at the CLI and thus not recommended.

bullerwins · Aug 12, 2022

awasb said:
1.) I'd use ECC RAM (and therefore ditch the i7-920).
2.) Consumer SSDs won't work (very long). Get used server grade SSDs with high write endurance/several drive writes per day. Never forget: As soon as your special vdev is gone, your pool will be toast.

1) ECC is something I'm really considering, but this is a system that I put using spare parts I have to learn, should be stable, but ECC is something I'm considering in my upgrade path yes.
2) Related to the first point, is a system to learn, so enterprise gear is in the upgrade path yes. The SSD's are 15€ 128GB that I got to test the SLOG and metadata capabilites. Once I get the grasp of it, i'll consider the server grade SSD.
Regarding the never forget. You mean that If I use a couple of ssd for slog or for a metadata vdev, and I they break, the pool no longer works? doesn't it just keep working without that functionality?

But you didn't really anwser my question if it's possible to use the same pair of SSDs for both slog and metadata.

danb35 said:
Sync always slows writes. It slows them less with a SLOG than without one, but sync writes are always slower than async.

Possible, but it'd all be manual work at the CLI and thus not recommended.

I kind of knew it, but in regards to being capped by the SSD's write speed. I expected to increase the write speed to somewhat the speed of the SSD's. I feel I would sleep better with "no-as-fast-as-possible-speed" but "kinda-fast-but-more-secure-speed" that async vs sync with SLOG SSD's would provide.

I get you, CLI work, it might even break on truenas/zfs version upgrades even.

danb35 · Aug 12, 2022

bullerwins said:
I kind of knew it, but in regards to being capped by the SSD's write speed.

Then no, you didn't know it. Even if you put SLOG on a ramdisk (which would be silly, but interesting for testing purposes), sync writes will be slower than async. And if you aren't using enterprise SSDs with power loss protection, the "more secure" you think you're getting is illusory. Put the system on a good UPS, set up monitoring, and call it a day.

bullerwins said:
You mean that If I use a couple of ssd for slog or for a metadata vdev, and I they break, the pool no longer works? doesn't it just keep working without that functionality?

SLOG is not essential to the pool--if it dies, ZFS goes back to storing the ZIL on the pool disks itself. Slow, but no harm done. But a metadata vdev is like any other vdev, an essential part of your pool. If it dies, your pool dies.

bullerwins · Aug 12, 2022

danb35 said:
Then no, you didn't know it. Even if you put SLOG on a ramdisk (which would be silly, but interesting for testing purposes), sync writes will be slower than async. And if you aren't using enterprise SSDs with power loss protection, the "more secure" you think you're getting is illusory. Put the system on a good UPS, set up monitoring, and call it a day.

SLOG is not essential to the pool--if it dies, ZFS goes back to storing the ZIL on the pool disks itself. Slow, but no harm done. But a metadata vdev is like any other vdev, an essential part of your pool. If it dies, your pool dies.

Thanks for all the input!

I already have it on a UPS, so async would be fine to use then?

SLOG not essential, got it. Metadata vdev important, got it.

Would make more sense in my case then to use the 2xSSDs as metadata vdev (in mirror) and use async as I have and UPS?

If 1 SSD dies, I replace it and I'm good, right?

If 1 dies can I "gracefully" remove the good one while I replace the faulty one? to avoid having the system run on only 1 metadata vdev while I replace it?

Edit: I found that you cannot detele a metadata vdev from the pool, it's like any other disk in the data vdev, only that it stores metadata instead of data.
Edit2: I found that if I add a metadata vdev it will only add the metadata for new data, not for the files that already live in the pool.

danb35 · Aug 12, 2022

bullerwins said:
Would make more sense in my case then to use the 2xSSDs as metadata vdev (in mirror) and use async as I have and UPS?

I'd think so.

bullerwins said:
If 1 dies can I "gracefully" remove the good one while I replace the faulty one? to avoid having the system run on only 1 metadata vdev while I replace it?

The two SSDs would comprise one (mirrored) metadata vdev. As with any other mirrored vdev, you can replace devices in that vdev while it's still operational (if degraded).

bullerwins said:
Edit: I found that you cannot detele a metadata vdev from the pool, it's like any other disk in the data vdev, only that it stores metadata instead of data.

Correct.

awasb · Aug 12, 2022

bullerwins said:
[...]

If 1 SSD dies, I replace it and I'm good, right?

If 1 dies can I "gracefully" remove the good one while I replace the faulty one? to avoid having the system run on only 1 metadata vdev while I replace it?

[...]

Please have in mind: There are more ways for a file server to go south than "just" a SSD dying. You are (wrongly IMHO) assuming, that everything else will be fine and unaffected.

Ceterum censeo: I'd _never_ use anything for [meta]data (not even for playing, because it's no fun at all), that is

a) comparably slow
b) insecure / not trustworthy (not even product infos about TBW/DWPD exist for those Intenso SSDs AFAICS)
c) lacks essential software support (secure erase etc.).

lo barato cuesta caro.

Davvo · Aug 12, 2022

I would go triple mirror for metadata vdev, at the very least.

bullerwins · Aug 12, 2022

Davvo said:
I would go triple mirror for metadata vdev, at the very least.

I have a RAIDZ1 on the data, do you consider that I should invest first on a triple mirror for my metadata vdev rathen that a hotspare for the data vdev?

Edit: for my setup another 128GB SSD would be cheap to be honest. Another 14TB HDD not so much

HoneyBadger · Aug 12, 2022

bullerwins said:
My use case for the system is for both backup and archival of files, so metadata would be good to index files and make searches faster.

This can be done in a couple ways, one of which is to increase the arc_meta_min tunable to force metadata to stay in RAM which is discussed quite a bit in this thread:

MrTP7 said:
System Settings -> Advanced -> Init/Shutdown Scripts

Type: Command
Command: echo 4294967296 > /sys/module/zfs/parameters/zfs_arc_meta_min
When: Pre Init

The other would be to use an L2ARC, and set it to cache metadata only in order to limit the amount of writes going to it.

bullerwins said:
But also to ingest video and photo files and to edit from there, that's why I have the SLOG idea in the first place, to speed up those writes in a "safer" way, using sync=on.

SLOG is only necessary for writes that "can't be replayed" - being the target of a copy job usually doesn't equal this, because you can either re-run the copy job or hit the "save" button again/redirect it to a temporary local location in order to not lose your work. (Maybe not with a multi-GB raw video file as easily.) I did see that you adjusted this to the idea of using async/standard with the UPS being present.

bullerwins said:
Edit: I found that you cannot detele a metadata vdev from the pool, it's like any other disk in the data vdev, only that it stores metadata instead of data.

This is a limitation on vdev removal - if there are any top-level RAIDZ vdevs in a pool (your config has a mirror for metadata, RAIDZ for data) then you can't remove any vdevs other than cache or log. If you're using mirrors across the board, you can remove metadata vdevs.

bullerwins said:
Edit2: I found that if I add a metadata vdev it will only add the metadata for new data, not for the files that already live in the pool.

This is correct, copying files between datasets locally should do this (as there will be new records written) so it doesn't require copying them to separate physical disks.

bullerwins · Aug 12, 2022

HoneyBadger said:
This can be done in a couple ways, one of which is to increase the arc_meta_min tunable to force metadata to stay in RAM which is discussed quite a bit in this thread:

The other would be to use an L2ARC, and set it to cache metadata only in order to limit the amount of writes going to it.

SLOG is only necessary for writes that "can't be replayed" - being the target of a copy job usually doesn't equal this, because you can either re-run the copy job or hit the "save" button again/redirect it to a temporary local location in order to not lose your work. (Maybe not with a multi-GB raw video file as easily.) I did see that you adjusted this to the idea of using async/standard with the UPS being present.

This is a limitation on vdev removal - if there are any top-level RAIDZ vdevs in a pool (your config has a mirror for metadata, RAIDZ for data) then you can't remove any vdevs other than cache or log. If you're using mirrors across the board, you can remove metadata vdevs.

This is correct, copying files between datasets locally should do this (as there will be new records written) so it doesn't require copying them to separate physical disks.

Thank you so much for the response, quite complete.

I'm not sure I want to use the L1 ARC to store the metadata, as I don't have too much of it.
But using L2ARC for this seems a good idea. I didn't know it could be done (i'll have to research this), and I think that would be reversible, as it's "duplicate" data, that is already in the pool, just caching it on the couple of SSD that I would set up as L2ARC, in this case also 1 SSD would be good, as there it no need for redundancy am I correct?

I didn't quite understand SLOG and thought it was REALLY neccesary to avoid data loss if I want to use async, but from what I've read and your response, seems like in my use case and having a UPS, it doesn't seem so critical. As you say copying files, making backups/timemachine, saving projects... If the copy fails I can just restarted, or save the file again. As long as the copy/save/backup finishes without any errors, the copy went through just fine right?

Edit:

HoneyBadger said:
This is a limitation on vdev removal - if there are any top-level RAIDZ vdevs in a pool (your config has a mirror for metadata, RAIDZ for data) then you can't remove any vdevs other than cache or log. If you're using mirrors across the board, you can remove metadata vdevs.

So If i set up, 4X 14TB Data vdevs in mirror, and 2xSSD metadata vdev. I can then remove the metadata vdevs no problem? how does that work? Isn't the metadata content only on the SSD's?

HoneyBadger · Aug 12, 2022

bullerwins said:
Thank you so much for the response, quite complete.

I'm not sure I want to use the L1 ARC to store the metadata, as I don't have too much of it.
But using L2ARC for this seems a good idea. I didn't know it could be done (i'll have to research this), and I think that would be reversible, as it's "duplicate" data, that is already in the pool, just caching it on the couple of SSD that I would set up as L2ARC, in this case also 1 SSD would be good, as there it no need for redundancy am I correct?

Metadata really is very small, query the sysctl kstat.zfs.misc.arcstats.metadata_size to see how little is actually required. Setting the tunable mentioned doesn't necessarily say that 4GB is immediately lost to "metadata only" but rather that it won't push metadata out of RAM unless it's consuming more than that.

L2ARC doesn't need to be redundant though - it will only ever be a "faster duplicate copy" of what's on the main data vdevs.

bullerwins said:
I didn't quite understand SLOG and thought it was REALLY neccesary to avoid data loss if I want to use async, but from what I've read and your response, seems like in my use case and having a UPS, it doesn't seem so critical. As you say copying files, making backups/timemachine, saving projects... If the copy fails I can just restarted, or save the file again. As long as the copy/save/backup finishes without any errors, the copy went through just fine right?

As long as the copy succeeds, you're usually okay; however, if you just finish a copy, and the power goes out within the next few seconds, then it might be in jeopardy - but you also have a UPS to mitigate this risk. That doesn't prevent a situation where there's a system crash due to kernel panic or hardware fault (a dead PSU in the server, HBA fails, or some component lets out the Magic Blue Smoke that all electronics run on)

bullerwins said:
Edit:
So If i set up, 4X 14TB Data vdevs in mirror, and 2xSSD metadata vdev. I can then remove the metadata vdevs no problem? how does that work? Isn't the metadata content only on the SSD's?

Yes. Metadata lives on the special vdev during normal operation, but once the removal request is committed to the pool, the existing records will start being moved from the special vdev back to the main ones. This results in a little bit of RAM overhead for it to store the virtual device redirection tables - you can estimate it by using the command zpool remove -n poolname vdevname - replacing poolname and vdevname with your own pool and the special vdev. This doesn't actually remove the vdev.

bullerwins · Aug 12, 2022

HoneyBadger said:
Metadata really is very small, query the sysctl kstat.zfs.misc.arcstats.metadata_size to see how little is actually required. Setting the tunable mentioned doesn't necessarily say that 4GB is immediately lost to "metadata only" but rather that it won't push metadata out of RAM unless it's consuming more than that.

Metadata seems to be 6.2GB at the moment, for 12TB of stored data, not too much. But still a bunch for 24GB of RAM? Maybe I don't know. My ram arc ratio is good at the moment

root@truenas[~]# arc_summary | grep "Metadata cache size (current)"
Metadata cache size (current): 77.1 % 6.2 GiB

HoneyBadger said:
L2ARC doesn't need to be redundant though - it will only ever be a "faster duplicate copy" of what's on the main data vdevs.

Cool, thanks

HoneyBadger said:
As long as the copy succeeds, you're usually okay; however, if you just finish a copy, and the power goes out within the next few seconds, then it might be in jeopardy - but you also have a UPS to mitigate this risk. That doesn't prevent a situation where there's a system crash due to kernel panic or hardware fault (a dead PSU in the server, HBA fails, or some component lets out the Magic Blue Smoke that all electronics run on)

Good to know. My "threat level" is not that high as to account for kernel panics or other hardware fault, other than loss of power. I can't account for everything I think. I don't have a redundant PSU, as I have consumer hardware at the moment. And I don't know how to account for a kernel panic.

HoneyBadger said:
Yes. Metadata lives on the special vdev during normal operation, but once the removal request is committed to the pool, the existing records will start being moved from the special vdev back to the main ones. This results in a little bit of RAM overhead for it to store the virtual device redirection tables - you can estimate it by using the command zpool remove -n poolname vdevname - replacing poolname and vdevname with your own pool and the special vdev. This doesn't actually remove the vdev.

Is there any documentation I can read about this? seems odd that if I have a mirror data vdev+mirror metadata vdev, I can remove the metadata vdev and the metadata goes back to the data vdev. But in a RaidZ data+mirror metadata it doesn't.

To use the L2ARC as metadata only, i would only need to run this, after adding the L2ARC SSD to the pool?

Code:

zfs set secondarycache=metadata misradpool/misraddata

Can L2ARC vdev be added live, like SLOG can me added and remove at will. Or does the pool need to be redone?

Davvo · Aug 12, 2022

bullerwins said:
I have a RAIDZ1 on the data, do you consider that I should invest first on a triple mirror for my metadata vdev rathen that a hotspare for the data vdev?

Edit: for my setup another 128GB SSD would be cheap to be honest. Another 14TB HDD not so much

Imho you should consider RAIDZ2 (so yeah, scrapping the pool) your first priority.
If this isn't possibile, you want that hotspare more than a special vdev.
If you already have a metadata vdev though, you need at least a 2-way mirror asap.

HoneyBadger · Aug 12, 2022

bullerwins said:
Metadata seems to be 6.2GB at the moment, for 12TB of stored data, not too much. But still a bunch for 24GB of RAM? Maybe I don't know. My ram arc ratio is good at the moment

arc_summary can tell you a bit more about what kind of hits/misses you're seeing, but 6.2G of metadata for 12TB isn't unreasonable at all, that's about 0.5% of the data size. What kind of data are you storing? Large (multi-GB) media files might benefit from being placed onto a dataset with recordsize=1M to push down the metadata amount more.

It really depends on if you're seeing a lot of demand_data misses vs demand_metadata misses - if you're happy with the general performance now in terms of the balance between "directory browsing" (metadata) and "reading files" (data) then I wouldn't worry too much about turning the screws in the background. The easy solution for improving performance is the ZFS default answer of "add more RAM" because it can virtually always use more, and will fill it with whatever's hitting the combination of MFU/MRU.

bullerwins said:
Good to know. My "threat level" is not that high as to account for kernel panics or other hardware fault, other than loss of power. I can't account for everything I think. I don't have a redundant PSU, as I have consumer hardware at the moment. And I don't know how to account for a kernel panic.

It's all degrees of willingness to prepare for it, which can range all the way from "no sync writes, no UPS, que sera sera" for someone's repository of "Linux ISOs" all the way to "sync=always, mirrored hotswap SLOG, redundant UPSes" depending on your risk tolerance. Planning to survive kernel panics is somewhere in the middle of the spectrum (sync=always, have an SLOG to restore the performance, have a UPS for additional safety)

bullerwins said:
Is there any documentation I can read about this? seems odd that if I have a mirror data vdev+mirror metadata vdev, I can remove the metadata vdev and the metadata goes back to the data vdev. But in a RaidZ data+mirror metadata it doesn't.

zpool-remove.8 — OpenZFS documentation

openzfs.github.io

I'm not sure exactly which part of the code prevents the removal of RAIDZ vdevs, I suspect it's the lack of checksumming on reads from the pending-removal vdev.

bullerwins said:
To use the L2ARC as metadata only, i would only need to run this, after adding the L2ARC SSD to the pool?

Code:
zfs set secondarycache=metadata misradpool/misraddata

Can L2ARC vdev be added live, like SLOG can me added and remove at will. Or does the pool need to be redone?

That's the command to use it, and yes an L2ARC (cache) vdev can be added and removed at will.

bullerwins · Aug 12, 2022

Thanks a lot, all of your answers were so helpful. I even had a chuckle

HoneyBadger said:
"no sync writes, no UPS, que sera sera"

also known as YOLO

I store of kinds of crap, from old backups of backups, backup/new backup/desktop/backup new new... kind of stuff,to movies, videos, VMs, config files, pictures, many documents... I have dedup enabled just to make my life simpler and not having to check duplicated backups inside folders.
I checked and at the moment the recordsize is at 128Kb

Currently dedup is saving me 10% (i think)
root@truenas[~]# zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
bigboi 50.9T 13.4T 37.5T - - 15% 26% 1.10x ONLINE /mnt
boot-pool 206G 2.63G 203G - - 0% 1% 1.00x ONLINE -

HoneyBadger · Aug 12, 2022

bullerwins said:
I have dedup enabled

Bold for attention: you should probably turn that off right now.

Deduplication is very resource intensive and will result in instability and bad performance quickly, especially with only 24GB of RAM. When it starts to become order-of-magnitude reductions (eg: 10x) then it's easily justified - 10% isn't enough to be worthwhile, I have to imagine. Is it on for every dataset, or just the ones you know/suspect to carry a lot of duplicate data?

Can you run zpool status -D and post the output of the line that looks similar to

Code:

dedup: DDT entries 34188575, size 829B on disk, 267B in core

Multiply the number of entries by the bytes in core, and that will tell you how much RAM is being used by the deduplication tables.

bullerwins · Aug 12, 2022

HoneyBadger said:
Bold for attention: you should probably turn that off right now.

Deduplication is very resource intensive and will result in instability and bad performance quickly, especially with only 24GB of RAM. When it starts to become order-of-magnitude reductions (eg: 10x) then it's easily justified - 10% isn't enough to be worthwhile, I have to imagine. Is it on for every dataset, or just the ones you know/suspect to carry a lot of duplicate data?

Can you run zpool status -D and post the output of the line that looks similar to

Code:
dedup: DDT entries 34188575, size 829B on disk, 267B in core

Multiply the number of entries by the bytes in core, and that will tell you how much RAM is being used by the deduplication tables.

I just got the feeling it wasn't that bad as per Craft's Computing video.
It's on every dataset

Code:

dedup: DDT entries 86714013, size 930B on disk, 206B in core

Davvo · Aug 12, 2022

bullerwins said:
Code:
dedup: DDT entries 86714013, size 930B on disk, 206B in core

That's almost 18GB over 24GB. of non ECC, oof.
Is your system using swap space?

Important Announcement for the TrueNAS Community.

Redundant SSD's for both SLOG and Metadata VDEV's

Dabbler

Patron

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Patron

MVP

Dabbler

actually does care

Dabbler

actually does care

Dabbler

MVP

actually does care

Dabbler

actually does care

Dabbler

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Redundant SSD's for both SLOG and Metadata VDEV's"

Similar threads