Evaluating a failing disk?

the_jest · Sep 5, 2022

I'm running TrueNAS Core 12.08; my main storage pool has three two-disk mirrors.

I recently got an alert that "One or more devices has experienced an unrecoverable error. An attempt was made to correct the error."; going to pool/status in the UI, or running zpool status -v, showed that one of the disks had a single error in the "READ" column. Meanwhile the pool showed as "degraded".

This was especially frustrating because this was not only a very new disk (and a dedicated NAS disk (IronWolf), rather than a shucked external drive), but in this case I'd bothered to follow all the recommendations and did a full burn-in series, which took over a week. But I digress.

While I was trying to figure out what to do, I pulled out the disk (it's in a hot-swap bay), confirmed that it was the one I thought (I had mis-labeled the bays), and reinserted it. Now, the pool shows as "RESILVER", "Status: FINISHED", "Errors: 0", and there are no errors in the table in the UI or in zpool status.

Now what? Is there an error hidden somewhere? Is the disk failing?

I see that the disks are running hotter than I expected; I'll fool around with the fans when I get a chance. But meanwhile, how do I evaluate the health of this disk?

Redcoat · Sep 5, 2022

Do you have short and long Smart tests scheduled?

I would run a Long Smart Test on the disk in question. Post the full results here in code tags for comment.

ChrisRJ · Sep 5, 2022

the_jest said:
how do I evaluate the health of this disk?

Seagate provides tools for this.

If not done already, please create a backup immediately.

Modern disks have a number of safety measures in place. If an error surfaces up to the FS/OS level, a number of things have gone wrong. Unless proven otherwise (e.g. the hot-swap caused the issue) I would operate under the assumption that there is indeed a serious problem. Better safe than sorry.

As to burn-in, I did a 3-month burn-in with my Seagte Exos X16 16 TB, and still had 3 drives fail after 6-9 months.

Also, it might help to know more about your system.

the_jest · Sep 6, 2022

I ran another backup yesterday after the problem arose; that took a long time to finish, because of an unrelated problem with my LAN. During the backup, I got another alert "Device: /dev/ada2, ATA error count increased from 5 to 8", but things still showed as OK rather than "degraded" or anything.

I run a short Smart test weekly, and I ran one after the backup finished; it showed as "passed". I'm running a long one now, and will post that when it's complete.

Important Announcement for the TrueNAS Community.

Evaluating a failing disk?

the_jest

Explorer

Redcoat

MVP

ChrisRJ

Wizard

the_jest

Explorer

Similar threads

Important Announcement for the TrueNAS Community.

Evaluating a failing disk?

the_jest

Explorer

Redcoat

MVP

ChrisRJ

Wizard

the_jest

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Evaluating a failing disk?"

Similar threads