Sudden "POOL degraded" followed by "Healthy" upon reboot. No drives changed.

ghamauricio · Apr 20, 2021

Hello guys,

So I have two TrueNAS Core servers running in a sort of poor-man's HA and one of them turned out to accuse "Degraded pool".
I went to the Storage > Disks page and one of the disks wasn't listing it's serial.

For some reason I decided to reboot the server, and when it came back, it managed to resilver the POOL in less then 20 minutes (it's a three-disk raidz1 pool with almost 5TB of data, mind you) and now says the pool is healthy.

How do I even begin to find out what happened? ‍[edit] just noticed the forum changes the "shurg" emoji for a bonker (male symbol), so I'll use the ASCIIart version: ¯\_(ツ)_/¯

sretalla · Apr 21, 2021

ghamauricio said:
For some reason I decided to reboot the server, and when it came back, it managed to resilver the POOL in less then 20 minutes (it's a three-disk raidz1 pool with almost 5TB of data, mind you) and now says the pool is healthy.

The equivalent of a zpool clear happens on reboot, so the errors aren't showing because you haven't raised them yet. A scrub would possibly uncover them again.

ghamauricio said:
How do I even begin to find out what happened? ‍

You should look at the SMART data for all disks (smartctl -a /dev/ada1) and assess if it's a problem with one (or more) of them.

Also look at dmesg

ghamauricio · Apr 21, 2021

Hello.

I don't use no scrub.
Scrub is a tech that get no love from me.
J/k, I had to say that.

So I ran a scrub in the morning and it's about 95% complete, but I'm going out for work and I won't be back for a while. Untill now, no other errors have come up on the TrueNAS Core web UI.

The SMART reports are as follows:

btw, huginn was the first NAS. muninn are the reports from the "second NAS", which was bought much later and has much less working hours
btw2: all SATA drives included here, including the SATADOM which the system boots off of.

sretalla · Apr 21, 2021

So:

muninn-ada3 shows some recovered blocks and hasn't run a test for a very long time.
muninn-ada2 looks OK (and is running tests)
muninn-ada1 shows a couple of command timeouts (could indicate drive issues or connectivity issues) and the G-Sense error count is a few thousand... may indicate the drive is moving/vibrating too much.

huginn-ada3 shows almost the same as for muninn.
huginn-ada2 seems to be quite old and shows even more g-sense errors than muninn-ada2
huginn-ada1 hasn't run a SMART test for over 1000 hours, but looks OK based on the last run.
huginn-ada0 is probably OK... it is using the seagate logging complication, so hard to read without the command being adapted, but I think the drive looks OK.

As a general note, you don't seem to have run long tests recently, so please consider doing that.

Muninn dmesg shows CAM is waiting for something during boot... maybe an indication of issues with the boot media? (is that ada3?)

I also see you are running both on AMD systems... have you done the tuning of the sleep states and other optimizations to ensure things run well for that platform?

ghamauricio · Apr 22, 2021

Thank you for giving some attention to the subject, sir.

I've scheduled the long test for tonight, and tomorrow night I'll run the commands again and post here.

BTW: I had no SMART tests scheduled by default, I just setup a quick test weekly. It seems you suggest it is made daily. How about the long one? Any others also?
Also daily snapshots, weekly scrubs... And I guess that's it.
And that scrub never raised any error again.

jgreco · Apr 23, 2021

A SMART test may take a long time to run. Scheduling them too frequently won't allow them time to complete, as the drive does it in the background in between your workload requests. I've been doing twice (thrice?) weekly longs with quadhour shorts for many years and it seems to work pretty well.

sretalla · Apr 23, 2021

jgreco said:
I've been doing twice (thrice?) weekly longs with quadhour shorts for many years and it seems to work pretty well

That feels a little overkill-y, but the important part is regularity and with enough time in between to allow test completion.

I have seen a lot of folks (and I can be counted in there) who run daily short and weekly long tests.

jgreco · Apr 23, 2021

It would be a little overkill-y if the servers were more busy. Otherwise, I take some comfort knowing that things are being tested frequently.

ghamauricio · Apr 24, 2021

Thanks jgreco. I think I'll stick to the daily short, weekly long schedule. Seems enough for the use case - it's basically an important stuff container: it contains all sorts of files I might want to share between my different computers (me and the wife) at home, including documents and the most used: RAW photos - which only gets read when using Lightroom).

Since they are two, mirroring each other (right now I'm using ResilioSync, but it's damn slow - so if anyone has a better solution then two rsyncs which can't propagate deletions, I'm listening) plus a cold backup of them on an external hard drive, I'm not really worried about losing data.

So I setup the long test for 2 hours after the short one. I hope it was enough for them both to finish. Here are the files again.

(BTW: what could be interesting from dmesg?

Important Announcement for the TrueNAS Community.

Sudden "POOL degraded" followed by "Healthy" upon reboot. No drives changed.

ghamauricio

Dabbler

sretalla

Powered by Neutrality

ghamauricio

Dabbler

Attachments

sretalla

Powered by Neutrality

ghamauricio

Dabbler

jgreco

Resident Grinch

sretalla

Powered by Neutrality

jgreco

Resident Grinch

ghamauricio

Dabbler

Attachments

Similar threads

Important Announcement for the TrueNAS Community.

Sudden "POOL degraded" followed by "Healthy" upon reboot. No drives changed.

Dabbler

Powered by Neutrality

Dabbler

Attachments

Powered by Neutrality

Dabbler

Resident Grinch

Powered by Neutrality

Resident Grinch

Dabbler

Attachments

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Sudden "POOL degraded" followed by "Healthy" upon reboot. No drives changed."

Similar threads