Six drives failing?

ITGuy1024 · Nov 18, 2022

I had multiple alerts come through this morning. Pool shows healthy and all drives show online.

New alerts:
* Device: /dev/da0 [SAT], 808 Currently unreadable (pending) sectors.

Current alerts:
* Device: /dev/da11 [SAT], 11 Currently unreadable (pending) sectors.
* Device: /dev/da4 [SAT], 11 Currently unreadable (pending) sectors.
* Device: /dev/da1 [SAT], 40 Currently unreadable (pending) sectors.
* Device: /dev/da1 [SAT], 5 Offline uncorrectable sectors.
* Device: /dev/da0 [SAT], 808 Currently unreadable (pending) sectors.

root@freenas[~]# zpool status
pool: PoolA
state: ONLINE
scan: resilvered 72K in 00:00:01 with 0 errors on Thu Nov 17 10:58:18 2022
config:

NAME STATE READ WRITE CKSUM
PoolA ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/3a330c53-6108-11ed-9396-a8a15908b7b4 ONLINE 0 0 0
gptid/3b3b98da-6108-11ed-9396-a8a15908b7b4 ONLINE 0 0 0
gptid/3a249977-6108-11ed-9396-a8a15908b7b4 ONLINE 0 0 0
gptid/3b523fac-6108-11ed-9396-a8a15908b7b4 ONLINE 0 0 0
gptid/3a3526ff-6108-11ed-9396-a8a15908b7b4 ONLINE 0 0 0
gptid/3a29344e-6108-11ed-9396-a8a15908b7b4 ONLINE 0 0 0
gptid/3b4b7388-6108-11ed-9396-a8a15908b7b4 ONLINE 0 0 0
gptid/3b418073-6108-11ed-9396-a8a15908b7b4 ONLINE 0 0 0
gptid/3a1bb94f-6108-11ed-9396-a8a15908b7b4 ONLINE 0 0 0
gptid/3a3a8bbc-6108-11ed-9396-a8a15908b7b4 ONLINE 0 0 0
gptid/39394f2d-6108-11ed-9396-a8a15908b7b4 ONLINE 0 0 0

Samuel Tai · Nov 18, 2022

Run some SMART long tests on the alarming drives. Those alerts look like SMART error counters incrementing, and are worrisome. You very well could have several drives not long for this world. Do you run SMART tests regularly?

Arwen · Nov 18, 2022

You don't list the make, model & size of the disks, nor how they are connected, (to which ports on computer...). Please do so.

(Samuel already got the SMART test request out...)

Whattteva · Nov 18, 2022

Six drives all failing at the same time is extremely unlikely. Have you at least been running SMART tests regularly? Have you tried reseating the connectors? Have you tried checking/replacing the cables? Basically, I'm asking if you've done some very basic troubleshooting on your own.

ITGuy1024 · Nov 18, 2022

Arwen said:
You don't list the make, model & size of the disks, nor how they are connected, (to which ports on computer...). Please do so.

(Samuel already got the SMART test request out...)

Brands are HGST, Seagate and Dell
The drives are in Rosewill 3 x 5.25-Inch to 4 x 3.5-Inch Hot-swap bays connected to a Dell H310 and IBM ServerRaid 16.

Samuel Tai said:
Run some SMART long tests on the alarming drives. Those alerts look like SMART error counters incrementing, and are worrisome. You very well could have several drives not long for this world. Do you run SMART tests regularly?

I started long smart tests on them when I saw the alerts come in. I had weekly smart tests scheduled but it looks like that has stopped working. I set up a new scheduled job to see if it kicks off.

Whattteva said:
Six drives all failing at the same time is extremely unlikely. Have you at least been running SMART tests regularly? Have you tried reseating the connectors? Have you tried checking/replacing the cables? Basically, I'm asking if you've done some very basic troubleshooting on your own.

Indeed. I checked all the cables and connectors nothing seems to have vibrated loose. Most if not all these connectors have locks on them but never bet on those holding.

awasb · Nov 18, 2022

I suppose the controllers/adapters (Dell/IBM) are flashed to IT mode ... correct?

ITGuy1024 · Nov 18, 2022

awasb said:
I suppose the controllers/adapters (Dell/IBM) are flashed to IT mode ... correct?

That's correct.

awasb · Nov 18, 2022

Are there any errors in kernel log/syslog?

Code:

dmesg | less

Code:

less /var/log/messages

Whattteva · Nov 18, 2022

ITGuy1024 said:
Indeed. I checked all the cables and connectors nothing seems to have vibrated loose. Most if not all these connectors have locks on them but never bet on those holding.

What about the result of your SMART tests?

ITGuy1024 · Nov 18, 2022

awasb said:
Are there any errors in kernel log/syslog?

Code:
dmesg | less

Code:
less /var/log/messages

Samuel Tai · Nov 18, 2022

Were these new drives or refurbished on installation? If refurbished, you may have just reached the end of life of these platters, and it's time to replace these drives.

ITGuy1024 · Nov 18, 2022

They are used, had them laying around in my office. Very possible they need replaced but odd that they would all alert at once.

Samuel Tai · Nov 18, 2022

ITGuy1024 said:
They are used, had them laying around in my office. Very possible they need replaced but odd that they would all alert at once.

Not really. Coupled vibrations through the enclosure stresses drives in a server much differently than single drives in a desktop. ZFS is also a much more strenuous file system than NTFS or ext4 in terms of stressing the sectors on drives in a pool. This is why burning in drives before putting them into production in a ZFS application is a best practice.

ITGuy1024 · Nov 18, 2022

Odd from my experience. I've never had mass drive failures outside of a firmware issue or SAS card / cable going bad.

I have some replacement drives coming in. I'll see if they have the same behavior.
I'm also going to replace the power supply. The one in there has worked fine but getting pretty old and I had more concerns about that failing then I did hard drives.

awasb · Nov 18, 2022

As far as one could see (with the information given) nothing seems seriously wrong with the base system.

How critical is the stored data?

If not already done on a regular schedule and if the data is more or less "just private" (i.e. no job/money/home lost if something gets b0rked), I'd backup to any functioning external storage drive large enough via zfs replication, clear the pool from errors (zpool clear $POOLNAME), scrub the pool, get the smart-tests running and keep an eye on it.

If the data is critical and you rely on it, I'd get another box (with sufficient drives) to copy things over (again via zfs replication) as soon as possible and ditch the drives for new ones.

ITGuy1024 · Nov 18, 2022

awasb said:
As far as one could see (with the information given) nothing seems seriously wrong with the base system.

How critical is the stored data?

If not already done on a regular schedule and if the data is more or less "just private" (i.e. no job/money/home lost if something gets b0rked), I'd backup to any functioning external storage drive large enough via zfs replication, clear the pool from errors (zpool clear $POOLNAME), scrub the pool, get the smart-tests running and keep an eye on it.

If the data is critical and you rely on it, I'd get another box (with sufficient drives) to copy things over (again via zfs replication) as soon as possible and ditch the drives for new ones.

Nothing that I couldn't get back. It's most storage for my Emby server. The data is backed up on drives outside of this system.

Davvo · Nov 18, 2022

ITGuy1024 said:
The data is backed up on drives outside of this system.

This is great practice, I wish more people would do it.
And, that's a big vdev.
Anyway, please do share the smart tests data (long test should have ended by now).

Question: are all the six falling drives from the same pool?
Also, is the configuration in your signature up-to-date? I see some (probably) 2 TB drives in your last screenshot.

ITGuy1024 · Nov 18, 2022

Davvo said:
This is great practice, I wish more people would do it.
And, that's a big vdev.
Anyway, please do share the smart tests data (long test should have ended by now).

Question: are all the six falling drives from the same pool?
Also, is the configuration in your signature up-to-date? I see some (probably) 2 TB drives in your last screenshot.

Here is my current updated build. It's mostly the same.
TrueNAS-13.0-U3
MOBO: ASRock Rack MB X470D4U
CPU: Ryzen 5 2600
Dell H310
IBM ServerRaid 16 port
4x Rosewill 3 x 5.25-Inch to 4 x 3.5-Inch Hot-swap bays
PoolA: 12x 4TB (raid z2)
PoolB: 2x 2TB (mirror)
OS HDD: Crucial MX500 M.2
RAM: 32GB DDR4 ECC
CASE: Antec 1200 V2
PSU: Antec 750
UPS: APC Smart-UPS 1000 VA

Is it a big vdev? I didn't think 12 disks was very big.

Actually. You make a good point. One of the 2TB drives (da4) was on the alert list and that is in a different pool.

SMART for da0 and da1 failed

awasb · Nov 18, 2022

And da10? It showed smartd errors in /var/log/messages, too.

Davvo · Nov 19, 2022

@ITGuy1024 you don't see very often 12 spinners in a vdev on this forum, mostly because the use cases don't require it: those who have such a number of disks usually started with 6 in a vdev and then doubled their capacity, so 2 vdevs of 6 disks each.
Anyway, please post the smart data.

Important Announcement for the TrueNAS Community.

Six drives failing?

Explorer

Never underestimate your own stupidity

MVP

Wizard

Explorer

Patron

Explorer

Patron

Wizard

Explorer

Attachments

Never underestimate your own stupidity

Explorer

Never underestimate your own stupidity

Explorer

Patron

Explorer

MVP

Explorer

Patron

MVP

Similar threads