Resilver didn't start up again after reboot

TidalWave · Jan 3, 2024

Hey Guys,

I read online that it was safe to reboot during a resilver, so we did, but after the reboot, the resilver process was done, even though it had 3 days remaining. I'm fairly confident that the new drive didn't have everything copied over to it yet and that there was some sort of bug or something that prevented zfs from recognizing the need to continue the resilvering process after the reboot. (Magically the resiliver was finished when before the reboot it had 3 days left)

So we ran a scrub and which finished and all the drives look fine, however I'm still concerned that I have a drive with only half of it's proper raidz2 parity data written to it. Will the scrub i ran be able to create the necessary parity data like a resilver normally does, if for some reason the resiliver process stops half way through?

Are there any commands i can use to check the status of the disk to make sure the parity is okay? It's raidz2 so we have the ability to lose two disks and still be okay.

joeschmuck · Jan 3, 2024

I am curious to hear an expert answer on this. My opinion is that when you run a Scrub, it reads ALL the data on ALL drives to ensure it is valid.

Scrub times are estimates. When you start a scrub the system may say it will take 3 days, after running for few hours it may say 2 days because it processed the data quicker than estimated, and a few hours later it says 2 hours again because the system was flying through the data. I would expect the algorithm to make an estimate of how many large and small files you may have and your average read speed, and while the system crunches through all that data it figures out it has processed more than originally calculated. Maybe instead of 50 thousand small files, you actually 10,000 small files. Small files of course take the longest from a total quantity of data processed. Large files are processed faster, and all that is due to latency to move the heads around.

I'm not sure if you just looked at the estimate when you started the scrub and a six hours later you decided to reboot. If you checked the scrub status just before rebooting and it said 3 days remaining, then I'd be curious in what really happened.

That is my opinion but don't trust that as fact for what actually happened.

TidalWave · Jan 4, 2024

One bit of information that may change things is that this vdev has two failed hard drives, we were in the process of re-silvering one of the drives, both of the failed hard drives had been offlined.

We have a giant server 1PB it always takes days and days to copy everything to these 16TB drives. No way the re-silver finished. our vdev is 8 to 10 drives deep.

Unless during the boot up process the resilver finished 1000% faster than normal. There is no monitor hooked up to these servers, so no one was watching the scrolling text during reboot.

The server was shutdown during the middle of the resilver and when it booted back up, the resilvering process didn't start back up again. The drive appears online and healthy, but because of the logical time constraints there is no way all the data was copied to the drive.

So we ran a scrub which took 5 days (normal for us) and it found 0 errors and says everything is fine. I'm just concerned that something is gonna get corrupted months down the line, or when i start to rebuild this 2nd failed disk. So I'm curious to know, say somehow you can stop a resilver, and then you run a scrub, does it move all the data to the new harddrive that the resilver paused at?

In the past servers with traditional raid cards, I've seen this type of behavior before. Where a drive will fail to rebuild for some reason, maybe a reboot or maybe a 2nd drive fails during the rebuild causing the 1st drive to halt it's parity copy. And in those cases they don't have a scrub, so I didn't do anything and just sort of hoped the data was fine, I replaced the failed 2nd drive and the arrary was all screwed up because the 1st failed drive didn't complete the rebuild.

I know TrueNAS is a different beast all together and I'm hoping to better handle this type of situation, but I want to know about this scrub vs resilver - like theoretically can i use a scrub to rebuild into a new hard drive, or do i have to use a resilver?

Additionally what is the best practice when you have two failed disks at once in a Raidz2 with 8-10 drives? Should I rebuild them one at a time or "replace" them both at the same time?

EDIT:

I did pull these logs
zpool history tank | grep replace
2023-09-16.16:44:51 zpool replace tank 2656726977078078380 /dev/gptid/ef7161e6-54ea-11ee-90c5-ac1f6bc275cd
2023-12-01.09:26:21 zpool replace tank 2517086439596449221 /dev/gptid/a6388457-906e-11ee-9199-ac1f6bc275cd
2023-12-04.20:26:37 zpool replace tank 16894485428322374071 /dev/gptid/71801193-9326-11ee-a649-ac1f6bc275cd
2023-12-15.16:29:14 zpool replace tank 10091730335195583156 /dev/gptid/0c172e6b-9baa-11ee-82c8-ac1f6bc275cd

The drive in question is the one replaced on the 15th, and we rebooted on the 20th. So maybe you are right maybe we didn't pay attention to the %done on the resilver before rebooting...

Jailer · Jan 4, 2024

If you are really that concerned about it why not offline the drive, wipe it and then resilver it back into the pool this time without rebooting.

joeschmuck · Jan 4, 2024

I just find it difficult to believe that a scrub would not find any discrepancies if the data was not all there. It is too bad that the history file does not record the resilver was completed.

From Oracle

The simplest way to check data integrity is to initiate an explicit scrubbing of all data within the pool. This operation traverses all the data in the pool once and verifies that all blocks can be read. Scrubbing proceeds as fast as the devices allow, though the priority of any I/O remains below that of normal operations. This operation might negatively impact performance, though the pool's data should remain usable and nearly as responsive while the scrubbing occurs. To initiate an explicit scrub, use the zpool scrub command.

However, run

Code:

zpool history -i tank

and this will provide more detailed information, such as if the scrub completed properly. It might even state the resilver was done. If it does, please post the message, I would just like to see what it says.

Cheers,
-Joe

Important Announcement for the TrueNAS Community.

Resilver didn't start up again after reboot

TidalWave

Explorer

joeschmuck

Old Man

TidalWave

Explorer

Jailer

Not strong, but bad

joeschmuck

Old Man

Similar threads

Important Announcement for the TrueNAS Community.

Resilver didn't start up again after reboot

TidalWave

Explorer

joeschmuck

Old Man

TidalWave

Explorer

Jailer

Not strong, but bad

joeschmuck

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Resilver didn't start up again after reboot"

Similar threads