question about Metadata ssd

Jeremy Guo · Aug 2, 2023

Dear all,

when i first setup the VDev, I put two 960G ssd as mirror to store the metadata. The redenduncy mismatch is always there since the built of the Vdev. I searched on the itnernet, seems it's a bug and doesn't imapct anything, so I leave it alone.

Now I want to replace the SSD for Metadata with a big SSD, is there anyway to change that? can I just simply replace one with big capacity, and replace another one after rebuilt? if yes, which one is the right one as it's reported mismatch?

looking forward to guidance.

thanks
Jeremy

Arwen · Aug 3, 2023

It is likely the redundancy mis-match warning is that you have a RAID-Z2 vDev for your main data, with 2 disks of redundancy. But, you Special / Metadata vDev has only 1 disk of redundancy, (2 way Mirror). It is probably prudent to make the Special / Metadata vDev a 3 way Mirror.

Remember, the complete loss of your Special / Metadata vDev is also the loss of your entire pool. That is NOT the case for SLOG or Cache / L2ARC vDevs.

During normal operation, loss of the SLOG has no affect on reliability. Only on a graceless reboot, (aka crash), does the SLOG come into play and may need to be read. Even then, if the SLOG ends up bad, it's not total loss of the pool. Just the last few transactions.

In the case of replacing your existing Special / Metadata vDev disks, yes, replace them one at a time. Whence both are fully replaced / re-silvered, you would see the size increase. I think the command zpool list -v would show you the amount used in each vDev. Though, SLOG & Cache / L2ARC might not show usage their, as SLOG is not really a permanently used vDev. And the L2ARC usage might be found elsewhere.

sretalla · Aug 3, 2023

To clarify for you...

The mismatch is with the redundancy level.

For RAIDZ2, you can lose 2 drives and survive, so for metadata, you should follow the same rule, meaning a 3-way mirror.

If those are SSDs and you're happy to accept the risk of no redundancy for the pool with one SSD lost, then you can continue to ignore that warning.

Jeremy Guo · Aug 3, 2023

@Arwen @sretalla thanks to both of your for your advice.

I was wrong with the understanding of the redendency mismatch, i used to think the metadata information on two mirror are mismatch. Now I understand.

checking the list of the zpool list, seems I don't need to repalce the metadata drive, it's only 33% used. it's 200GB, is it a configured maxmium size as I saw 200GB metadata on other NAS?

So I will replace one of the cache driver with a big SSD, and remove the other one and add to metadata mirror-2.

i didn't see any option in the UI.

should the command to zpool as below?

remove sdc from cache
zpool detach cache sdc

add sdc to metadata mirror-2
zpool attach mirror-2 mirror sdc

thanks

HoneyBadger · Aug 3, 2023

What version of SCALE are you running? There were some recent patches made to correct some issues where the removal options were incorrectly being masked in the UI through the "Topology" menu.

Also, your main data vdev is extremely wide (is that a 52-wide Z2?) and as such may be causing some performance or space efficiency issues.

Jeremy Guo · Aug 3, 2023

@HoneyBadger thanks for your replay.

My version is TrueNAS-SCALE-22.12.3.1

Yes, as you can see from the screenshot, I am using 53 as one VDev for a 60 bay server. I am a new starter with truenas, so I built a big Vdev, now I have lots of data on it, and can't rebuild it. I am using it as backup repository, so it is accceptable.

I did enocunter some NFS perormance issue, but not sure if it is related with ios performance.

when I run backup job, it works well firstly, and can even running on 1-1.5GB backup with CPU load lower than 10%, but after over 10TB write, the nfsd becomes extremely busy, and the transfer speed was degraded to less than 5MB.

it's a bit annoying that I have to restart the NFS service after nfsd beomes extremely busy, and that would interupt the backup job.

Once I restart the nfsd, everything is back to normal, quite strange. Sretalla suspect it could be io performance issue because of the large Vdev.

Thanks

Jeremy Guo · Aug 4, 2023

Yesterday, the nfsd got high CPU usage after writing 13TB, today, the nfsd become busy after copying 12TB.

I chevked the zpool iostat when nfsd is busy, little io is going on with the VDev.

would it be the full of cache issue?

I noticed that when writing data to "data" pool, it also write to cache when I look at the disk io in the reporting.

really weired.

Arwen · Aug 4, 2023

No, it is NOT really weird. This write performance issue is the hallmark of a too wide RAID-Zx vDev. I missed that your RAID-Z2 vDev was 53 disks wide.

The general maximum for a RAID-Zx vDev is 10 to 12 disks. People have gone higher, like to 15 to 20 disks. But, 53? That is a recipe for disaster.

One item that can be a killer, is disk failure. While ZFS will attempt to do the right thing and replace a failed disk properly, it may end up taking days to weeks for a simple disk replacement. This is because in most cases the replacement disks' block has to be calculated from 51 other disks. (The other parity disk is not needed here.) That means up to 51 disk reads for each replacement disk write.

The performance loss can be staggering. Their are tunables to reduce the disk replacement overhead, so normal operation can be, well, "normal". BUT, this can cause the disk replacement to take even longer, like MONTHS!

Scrubs could be painful, though less so because it is all reads. But even then, it could take many days or weeks to complete a scrub. (And I hope you have them enabled, at least monthly.)

Sometimes I wish this type of configuration was tested, so we had hard facts to pass along. But, straying outside of known working configurations for RAID-Zx widths is much less tested and documented.

In any case, we've passed along what we could. Good luck.

Johnny Fartpants · Aug 4, 2023

Arwen said:
I missed that your RAID-Z2 vDev was 53 disks wide

Wow! @Arwen is right that's wayyyy too big. Re-silvers will take like forever and performance will suck and for general data resilience its bad.

I'd probably do a 13 wide Z2 or Z3 with 4 vdevs and that would leave you one spare be it hot or cold your choice.

Jeremy Guo · Aug 6, 2023

If both pool and the nfsd are busy , I can suspect that the nfsd performance is impacted by the big pool performance.

But the problem is the poll is quite free, little io, but the nfsd is busy.

this is quite strange.

Jeremy Guo · Aug 20, 2023

@Arwen @sretalla

just for research sharing, no offence.

I met a HDD missing from the Vdev last week, and experienced a resilver process. My data is 50% of the total Vdev size

It took 5 days, less than 6 days to complete the resilver process.

And for the scrub task, normally it took 24-30 hours to complete.

I used CIFS protocol for data backup, there is no issue with the truenas even after writting more than 20TB data. But it happened on NFS.

thanks
Jeremy

Johnny Fartpants · Aug 21, 2023

Jeremy Guo said:
I met a HDD missing from the Vdev last week, and experienced a resilver process.

Do you mean a drive failed and dropped from the system and you replaced it with a new one?

Jeremy Guo said:
My data is 50% of the total Vdev size

So your pool is 50% full?

If a 53 disk Z2 works for you then brill. The general feeling around fault tolerance is the more drives you have the more likely you are to lose drives therefore the more fault tolerance you need to add. This is not to mention that if you are storing hundreds of TBs then you most likely would like to keep it safe even if its *JUST A BACKUP*. I must say I am surprised the resilver only took 5 days. Resilver on my systems take a similar amount of time using 15 diskZ3 vdevs with 18TB SAS drives at approx 50% capacity. Personally my primary concern with this confit wouldn't be resilver times or performance just the general lack of fault tolerance. I assume you have built this in a work environment as it seems quite large for a bedroom? If so these are often the systems that when you leave the company and they get forgotten about till the next poor sod discovers its two drives down with a handful of others throwing errors so they come to forum asking for help. I always build my systems assuming that one day I may not be around to look after them but hopefully they have enough resilience to keep going for a good while after if needed.

Arwen · Aug 21, 2023

Jeremy Guo said:
...
It took 5 days, less than 6 days to complete the resilver process.

And for the scrub task, normally it took 24-30 hours to complete.
...

That looks great in the short term. Glad you have a reasonable experience.

Just note that 1 or a few successfully replaced failed disks, does not constitute a guarantee that it will painless next time. Most ZFS installations are similar enough, that they end up being extremely well tested. As in 100,000 Petabytes likely stored in more common ZFS configurations.

That said, who knows, you could be fine for the normal life of the NAS.

Etorix · Aug 21, 2023

Jeremy Guo said:
I am using it as backup repository, so it is accceptable.

If it is "a backup repository" why does it feature a mirrored SLOG AND a stripe of two L2ARC drives???
SLOG is only for sync writes, which you shouldn't have, and with a single 53-wide vdev performance is definitely not sought.
L2ARC is not useful on a backup NAS, especially a backup NAS which already has a metadata vdev, and I suspect that 2*894 GiB is way oversized for your RAM (with SCALE, you'd need over 400 GB RAM for a 2 TB L2ARC to be viable).
There's exactly nothing in this pool which follows good design practices. Let's hope that the primary NAS is safer than this one. (If there's no "primary NAS" and this 53-wide monster is your sole backup, the whole setup is a disaster waiting to happen.)

Jeremy Guo · Aug 21, 2023

Johnny Fartpants said:
Do you mean a drive failed and dropped from the system and you replaced it with a new one?

So your pool is 50% full?

If a 53 disk Z2 works for you then brill. The general feeling around fault tolerance is the more drives you have the more likely you are to lose drives therefore the more fault tolerance you need to add. This is not to mention that if you are storing hundreds of TBs then you most likely would like to keep it safe even if its *JUST A BACKUP*. I must say I am surprised the resilver only took 5 days. Resilver on my systems take a similar amount of time using 15 diskZ3 vdevs with 18TB SAS drives at approx 50% capacity. Personally my primary concern with this confit wouldn't be resilver times or performance just the general lack of fault tolerance. I assume you have built this in a work environment as it seems quite large for a bedroom? If so these are often the systems that when you leave the company and they get forgotten about till the next poor sod discovers its two drives down with a handful of others throwing errors so they come to forum asking for help. I always build my systems assuming that one day I may not be around to look after them but hopefully they have enough resilience to keep going for a good while after if needed.

It could be a comminucation issue with the HDD, and the zfs considered it is missing, and triggered a resilver process wih a spare HDD.

and yes, it around 50% full

and I would thank for you advice, and I am going to add more spare to the system.

I checked the iostat during the resilver with zpool iostat -v 2, each of 52 HDDs contributed around 16MB data read and the resilvered HDD commited 16MB write action.

Jeremy Guo · Aug 21, 2023

Etorix said:
If it is "a backup repository" why does it feature a mirrored SLOG AND a stripe of two L2ARC drives???
SLOG is only for sync writes, which you shouldn't have, and with a single 53-wide vdev performance is definitely not sought.
L2ARC is not useful on a backup NAS, especially a backup NAS which already has a metadata vdev, and I suspect that 2*894 GiB is way oversized for your RAM (with SCALE, you'd need over 400 GB RAM for a 2 TB L2ARC to be viable).
There's exactly nothing in this pool which follows good design practices. Let's hope that the primary NAS is safer than this one. (If there's no "primary NAS" and this 53-wide monster is your sole backup, the whole setup is a disaster waiting to happen.)

Hi, Etorix

Yes, you are right, I am new to the truenas, and did a bad setup, but I am going too far and can't rebuild it with 3over 300TB on it.

I have removed cache and SLOG from the system and adding three more spare HDD to the system as spare, after removing the cache and SLOG disk.

thanks
Jeremy

Important Announcement for the TrueNAS Community.

question about Metadata ssd

Jeremy Guo

Dabbler

Attachments

Arwen

MVP

sretalla

Powered by Neutrality

Jeremy Guo

Dabbler

Attachments

HoneyBadger

actually does care

Jeremy Guo

Dabbler

Jeremy Guo

Dabbler

Attachments

Arwen

MVP

Johnny Fartpants

Guru

Jeremy Guo

Dabbler

Jeremy Guo

Dabbler

Johnny Fartpants

Guru

Arwen

MVP

Etorix

Wizard

Jeremy Guo

Dabbler

Jeremy Guo

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

question about Metadata ssd

Dabbler

Attachments

MVP

Powered by Neutrality

Dabbler

Attachments

actually does care

Dabbler

Dabbler

Attachments

MVP

Guru

Dabbler

Dabbler

Guru

MVP

Wizard

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "question about Metadata ssd"

Similar threads