Register for the iXsystems Community to get an ad-free experience and exclusive discounts in our eBay Store.

Sudden "POOL degraded" followed by "Healthy" upon reboot. No drives changed.

ghamauricio

Junior Member
Joined
Mar 20, 2019
Messages
21
Hello guys,

So I have two TrueNAS Core servers running in a sort of poor-man's HA and one of them turned out to accuse "Degraded pool".
I went to the Storage > Disks page and one of the disks wasn't listing it's serial.

For some reason I decided to reboot the server, and when it came back, it managed to resilver the POOL in less then 20 minutes (it's a three-disk raidz1 pool with almost 5TB of data, mind you) and now says the pool is healthy.

How do I even begin to find out what happened? ‍[edit] just noticed the forum changes the "shurg" emoji for a bonker (male symbol), so I'll use the ASCIIart version: ¯\_(ツ)_/¯
 
Last edited:

sretalla

Wizened Sage
Joined
Jan 1, 2016
Messages
4,918
For some reason I decided to reboot the server, and when it came back, it managed to resilver the POOL in less then 20 minutes (it's a three-disk raidz1 pool with almost 5TB of data, mind you) and now says the pool is healthy.
The equivalent of a zpool clear happens on reboot, so the errors aren't showing because you haven't raised them yet. A scrub would possibly uncover them again.

How do I even begin to find out what happened? ‍
You should look at the SMART data for all disks (smartctl -a /dev/ada1) and assess if it's a problem with one (or more) of them.

Also look at dmesg
 

ghamauricio

Junior Member
Joined
Mar 20, 2019
Messages
21
Hello.

I don't use no scrub.
Scrub is a tech that get no love from me.
J/k, I had to say that.

So I ran a scrub in the morning and it's about 95% complete, but I'm going out for work and I won't be back for a while. Untill now, no other errors have come up on the TrueNAS Core web UI.

The SMART reports are as follows:

btw, huginn was the first NAS. muninn are the reports from the "second NAS", which was bought much later and has much less working hours
btw2: all SATA drives included here, including the SATADOM which the system boots off of.
 

Attachments

sretalla

Wizened Sage
Joined
Jan 1, 2016
Messages
4,918
So:

muninn-ada3 shows some recovered blocks and hasn't run a test for a very long time.
muninn-ada2 looks OK (and is running tests)
muninn-ada1 shows a couple of command timeouts (could indicate drive issues or connectivity issues) and the G-Sense error count is a few thousand... may indicate the drive is moving/vibrating too much.

huginn-ada3 shows almost the same as for muninn.
huginn-ada2 seems to be quite old and shows even more g-sense errors than muninn-ada2
huginn-ada1 hasn't run a SMART test for over 1000 hours, but looks OK based on the last run.
huginn-ada0 is probably OK... it is using the seagate logging complication, so hard to read without the command being adapted, but I think the drive looks OK.


As a general note, you don't seem to have run long tests recently, so please consider doing that.

Muninn dmesg shows CAM is waiting for something during boot... maybe an indication of issues with the boot media? (is that ada3?)

I also see you are running both on AMD systems... have you done the tuning of the sleep states and other optimizations to ensure things run well for that platform?
 

ghamauricio

Junior Member
Joined
Mar 20, 2019
Messages
21
Thank you for giving some attention to the subject, sir.

I've scheduled the long test for tonight, and tomorrow night I'll run the commands again and post here.

BTW: I had no SMART tests scheduled by default, I just setup a quick test weekly. It seems you suggest it is made daily. How about the long one? Any others also?
Also daily snapshots, weekly scrubs... And I guess that's it.
And that scrub never raised any error again.
 
Last edited:

jgreco

Resident Grinch
Moderator
Joined
May 29, 2011
Messages
14,159
A SMART test may take a long time to run. Scheduling them too frequently won't allow them time to complete, as the drive does it in the background in between your workload requests. I've been doing twice (thrice?) weekly longs with quadhour shorts for many years and it seems to work pretty well.
 

sretalla

Wizened Sage
Joined
Jan 1, 2016
Messages
4,918
I've been doing twice (thrice?) weekly longs with quadhour shorts for many years and it seems to work pretty well
That feels a little overkill-y, but the important part is regularity and with enough time in between to allow test completion.

I have seen a lot of folks (and I can be counted in there) who run daily short and weekly long tests.
 

jgreco

Resident Grinch
Moderator
Joined
May 29, 2011
Messages
14,159
It would be a little overkill-y if the servers were more busy. Otherwise, I take some comfort knowing that things are being tested frequently.
 

ghamauricio

Junior Member
Joined
Mar 20, 2019
Messages
21
Thanks jgreco. I think I'll stick to the daily short, weekly long schedule. Seems enough for the use case - it's basically an important stuff container: it contains all sorts of files I might want to share between my different computers (me and the wife) at home, including documents and the most used: RAW photos - which only gets read when using Lightroom).

Since they are two, mirroring each other (right now I'm using ResilioSync, but it's damn slow - so if anyone has a better solution then two rsyncs which can't propagate deletions, I'm listening) plus a cold backup of them on an external hard drive, I'm not really worried about losing data.

So I setup the long test for 2 hours after the short one. I hope it was enough for them both to finish. Here are the files again.

(BTW: what could be interesting from dmesg?
 

Attachments

Top