Six drives failing?

ITGuy1024

Explorer
Joined
Dec 13, 2014
Messages
89
I had multiple alerts come through this morning. Pool shows healthy and all drives show online.

New alerts:
* Device: /dev/da0 [SAT], 808 Currently unreadable (pending) sectors.

Current alerts:
* Device: /dev/da11 [SAT], 11 Currently unreadable (pending) sectors.
* Device: /dev/da4 [SAT], 11 Currently unreadable (pending) sectors.
* Device: /dev/da1 [SAT], 40 Currently unreadable (pending) sectors.
* Device: /dev/da1 [SAT], 5 Offline uncorrectable sectors.
* Device: /dev/da0 [SAT], 808 Currently unreadable (pending) sectors.


root@freenas[~]# zpool status
pool: PoolA
state: ONLINE
scan: resilvered 72K in 00:00:01 with 0 errors on Thu Nov 17 10:58:18 2022
config:

NAME STATE READ WRITE CKSUM
PoolA ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/3a330c53-6108-11ed-9396-a8a15908b7b4 ONLINE 0 0 0
gptid/3b3b98da-6108-11ed-9396-a8a15908b7b4 ONLINE 0 0 0
gptid/3a249977-6108-11ed-9396-a8a15908b7b4 ONLINE 0 0 0
gptid/3b523fac-6108-11ed-9396-a8a15908b7b4 ONLINE 0 0 0
gptid/3a3526ff-6108-11ed-9396-a8a15908b7b4 ONLINE 0 0 0
gptid/3a29344e-6108-11ed-9396-a8a15908b7b4 ONLINE 0 0 0
gptid/3b4b7388-6108-11ed-9396-a8a15908b7b4 ONLINE 0 0 0
gptid/3b418073-6108-11ed-9396-a8a15908b7b4 ONLINE 0 0 0
gptid/3a1bb94f-6108-11ed-9396-a8a15908b7b4 ONLINE 0 0 0
gptid/3a3a8bbc-6108-11ed-9396-a8a15908b7b4 ONLINE 0 0 0
gptid/39394f2d-6108-11ed-9396-a8a15908b7b4 ONLINE 0 0 0

1668784656782.png

1668784626402.png
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Run some SMART long tests on the alarming drives. Those alerts look like SMART error counters incrementing, and are worrisome. You very well could have several drives not long for this world. Do you run SMART tests regularly?
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
You don't list the make, model & size of the disks, nor how they are connected, (to which ports on computer...). Please do so.

(Samuel already got the SMART test request out...)
 

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824
Six drives all failing at the same time is extremely unlikely. Have you at least been running SMART tests regularly? Have you tried reseating the connectors? Have you tried checking/replacing the cables? Basically, I'm asking if you've done some very basic troubleshooting on your own.
 

ITGuy1024

Explorer
Joined
Dec 13, 2014
Messages
89
You don't list the make, model & size of the disks, nor how they are connected, (to which ports on computer...). Please do so.

(Samuel already got the SMART test request out...)
1668790112621.png

Brands are HGST, Seagate and Dell
The drives are in Rosewill 3 x 5.25-Inch to 4 x 3.5-Inch Hot-swap bays connected to a Dell H310 and IBM ServerRaid 16.

Run some SMART long tests on the alarming drives. Those alerts look like SMART error counters incrementing, and are worrisome. You very well could have several drives not long for this world. Do you run SMART tests regularly?
I started long smart tests on them when I saw the alerts come in. I had weekly smart tests scheduled but it looks like that has stopped working. I set up a new scheduled job to see if it kicks off.

Six drives all failing at the same time is extremely unlikely. Have you at least been running SMART tests regularly? Have you tried reseating the connectors? Have you tried checking/replacing the cables? Basically, I'm asking if you've done some very basic troubleshooting on your own.
Indeed. I checked all the cables and connectors nothing seems to have vibrated loose. Most if not all these connectors have locks on them but never bet on those holding.
 

awasb

Patron
Joined
Jan 11, 2021
Messages
415
I suppose the controllers/adapters (Dell/IBM) are flashed to IT mode ... correct?
 

ITGuy1024

Explorer
Joined
Dec 13, 2014
Messages
89

awasb

Patron
Joined
Jan 11, 2021
Messages
415
Are there any errors in kernel log/syslog?

Code:
dmesg | less


Code:
less /var/log/messages
 

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824
Indeed. I checked all the cables and connectors nothing seems to have vibrated loose. Most if not all these connectors have locks on them but never bet on those holding.
What about the result of your SMART tests?
 

ITGuy1024

Explorer
Joined
Dec 13, 2014
Messages
89

Attachments

  • log.txt
    26 KB · Views: 101
  • log2.txt
    16.3 KB · Views: 117

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Were these new drives or refurbished on installation? If refurbished, you may have just reached the end of life of these platters, and it's time to replace these drives.
 

ITGuy1024

Explorer
Joined
Dec 13, 2014
Messages
89
They are used, had them laying around in my office. Very possible they need replaced but odd that they would all alert at once.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
They are used, had them laying around in my office. Very possible they need replaced but odd that they would all alert at once.

Not really. Coupled vibrations through the enclosure stresses drives in a server much differently than single drives in a desktop. ZFS is also a much more strenuous file system than NTFS or ext4 in terms of stressing the sectors on drives in a pool. This is why burning in drives before putting them into production in a ZFS application is a best practice.
 

ITGuy1024

Explorer
Joined
Dec 13, 2014
Messages
89
Odd from my experience. I've never had mass drive failures outside of a firmware issue or SAS card / cable going bad.

I have some replacement drives coming in. I'll see if they have the same behavior.
I'm also going to replace the power supply. The one in there has worked fine but getting pretty old and I had more concerns about that failing then I did hard drives.
 

awasb

Patron
Joined
Jan 11, 2021
Messages
415
As far as one could see (with the information given) nothing seems seriously wrong with the base system.

How critical is the stored data?

If not already done on a regular schedule and if the data is more or less "just private" (i.e. no job/money/home lost if something gets b0rked), I'd backup to any functioning external storage drive large enough via zfs replication, clear the pool from errors (zpool clear $POOLNAME), scrub the pool, get the smart-tests running and keep an eye on it.

If the data is critical and you rely on it, I'd get another box (with sufficient drives) to copy things over (again via zfs replication) as soon as possible and ditch the drives for new ones.
 

ITGuy1024

Explorer
Joined
Dec 13, 2014
Messages
89
As far as one could see (with the information given) nothing seems seriously wrong with the base system.

How critical is the stored data?

If not already done on a regular schedule and if the data is more or less "just private" (i.e. no job/money/home lost if something gets b0rked), I'd backup to any functioning external storage drive large enough via zfs replication, clear the pool from errors (zpool clear $POOLNAME), scrub the pool, get the smart-tests running and keep an eye on it.

If the data is critical and you rely on it, I'd get another box (with sufficient drives) to copy things over (again via zfs replication) as soon as possible and ditch the drives for new ones.

Nothing that I couldn't get back. It's most storage for my Emby server. The data is backed up on drives outside of this system.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
The data is backed up on drives outside of this system.
This is great practice, I wish more people would do it.
And, that's a big vdev.
Anyway, please do share the smart tests data (long test should have ended by now).

Question: are all the six falling drives from the same pool?
Also, is the configuration in your signature up-to-date? I see some (probably) 2 TB drives in your last screenshot.
 
Last edited:

ITGuy1024

Explorer
Joined
Dec 13, 2014
Messages
89
This is great practice, I wish more people would do it.
And, that's a big vdev.
Anyway, please do share the smart tests data (long test should have ended by now).

Question: are all the six falling drives from the same pool?
Also, is the configuration in your signature up-to-date? I see some (probably) 2 TB drives in your last screenshot.
Here is my current updated build. It's mostly the same.
TrueNAS-13.0-U3
MOBO: ASRock Rack MB X470D4U
CPU: Ryzen 5 2600
Dell H310
IBM ServerRaid 16 port
4x Rosewill 3 x 5.25-Inch to 4 x 3.5-Inch Hot-swap bays
PoolA: 12x 4TB (raid z2)
PoolB: 2x 2TB (mirror)
OS HDD: Crucial MX500 M.2
RAM: 32GB DDR4 ECC
CASE: Antec 1200 V2
PSU: Antec 750
UPS: APC Smart-UPS 1000 VA

Is it a big vdev? I didn't think 12 disks was very big.

Actually. You make a good point. One of the 2TB drives (da4) was on the alert list and that is in a different pool.

SMART for da0 and da1 failed
 
Last edited:

awasb

Patron
Joined
Jan 11, 2021
Messages
415
And da10? It showed smartd errors in /var/log/messages, too.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
@ITGuy1024 you don't see very often 12 spinners in a vdev on this forum, mostly because the use cases don't require it: those who have such a number of disks usually started with 6 in a vdev and then doubled their capacity, so 2 vdevs of 6 disks each.
Anyway, please post the smart data.
 
Top