Drives keep failing?

ben_dover

Cadet
Joined
Jul 22, 2021
Messages
1
Hi All,

Running latest version of Truenas.. dell r720xd 96GB Ram. 12 6TB SAS, and one 1TB nvme for cache. Using a h310 mini HBA flashed. E5-2690 CPU

I keep having drives "fail" or become degraded. I would run zpool status and see lots of read/write on a particular drive. If I go to clear that error from just that drive it'll be fine then show back up on that same drive (This happens to any drive).... Then If I go to replace the drive as it's resilvering every other drive will throw errors, and when the pool is done with the resilver another drive will "fail" if I leave it like that the whole pool starts to throw read/write errors. Any ideas? I'm at loss for words there's no way the drives are failing at that rate (One every day)? i have an Identical nas and the drives that I ordered for both were all from the same batch.
(Drives were used btw... I know it's not recommended but they're 5x the price new)

Could it be the h310 mini that's failing? That's only thing I can think of that would cause so many drives to "fail"

What should I do I can't keep ordering more drives it's very expensive.

Thanks!

Bellow is the error from this morning after replacing a drive yesterday and waiting for the resilver to finish..

Code:
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 10.4G in 00:43:48 with 0 errors on Wed Jul 21 21:13:01 2021
config:

    NAME                                            STATE     READ WRITE CKSUM
    Tank2                                           DEGRADED     0     0     0
      raidz3-0                                      DEGRADED     0     0     0
        gptid/e9380b23-e630-11eb-8b55-ecf4bbc0e684  ONLINE     337 27.9K     0
        gptid/ea1e756a-e630-11eb-8b55-ecf4bbc0e684  ONLINE      50 3.40K     0
        gptid/eac53908-e630-11eb-8b55-ecf4bbc0e684  ONLINE      12 39.2K     0
        gptid/eab4836c-e630-11eb-8b55-ecf4bbc0e684  ONLINE     460 17.8K     0
        gptid/ed614cef-e960-11eb-b636-ecf4bbc0e684  ONLINE     105 31.1K     0
        gptid/ebf41b72-e630-11eb-8b55-ecf4bbc0e684  ONLINE      25 6.14K     0
        gptid/eb760390-e630-11eb-8b55-ecf4bbc0e684  ONLINE      80 44.6K     0
        gptid/ebbab4d7-e630-11eb-8b55-ecf4bbc0e684  ONLINE     391 28.5K     0
        gptid/ec7d62a3-e630-11eb-8b55-ecf4bbc0e684  DEGRADED    38 36.8K   265  too many errors
        gptid/ddf08d31-ea29-11eb-b636-ecf4bbc0e684  ONLINE       0     0    43
        gptid/eca02a58-e630-11eb-8b55-ecf4bbc0e684  ONLINE      21 9.13K     0
        gptid/aa75b32a-e665-11eb-8b55-ecf4bbc0e684  ONLINE      98 48.4K     0
    cache
      gptid/e9369611-e630-11eb-8b55-ecf4bbc0e684    ONLINE       0     0     0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:01:59 with 0 errors on Fri Jul 16 03:47:00 2021
config:

    NAME        STATE     READ WRITE CKSUM
    boot-pool   ONLINE       0     0     0
      da12p2    ONLINE       0     0     0



This is after running zpool clear Pool

Code:
 state: ONLINE
  scan: resilvered 10.4G in 00:43:48 with 0 errors on Wed Jul 21 21:13:01 2021
config:

    NAME                                            STATE     READ WRITE CKSUM
    Tank2                                           ONLINE       0     0     0
      raidz3-0                                      ONLINE       0     0     0
        gptid/e9380b23-e630-11eb-8b55-ecf4bbc0e684  ONLINE       0     0     0
        gptid/ea1e756a-e630-11eb-8b55-ecf4bbc0e684  ONLINE       0     0     0
        gptid/eac53908-e630-11eb-8b55-ecf4bbc0e684  ONLINE       0     0     0
        gptid/eab4836c-e630-11eb-8b55-ecf4bbc0e684  ONLINE       0     0     0
        gptid/ed614cef-e960-11eb-b636-ecf4bbc0e684  ONLINE       0     0     0
        gptid/ebf41b72-e630-11eb-8b55-ecf4bbc0e684  ONLINE       0     0     0
        gptid/eb760390-e630-11eb-8b55-ecf4bbc0e684  ONLINE       0     0     0
        gptid/ebbab4d7-e630-11eb-8b55-ecf4bbc0e684  ONLINE       0     0     0
        gptid/ec7d62a3-e630-11eb-8b55-ecf4bbc0e684  ONLINE       0     0     0
        gptid/ddf08d31-ea29-11eb-b636-ecf4bbc0e684  ONLINE       0     0     0
        gptid/eca02a58-e630-11eb-8b55-ecf4bbc0e684  ONLINE       0     0     0
        gptid/aa75b32a-e665-11eb-8b55-ecf4bbc0e684  ONLINE       0     0     0
    cache
      gptid/e9369611-e630-11eb-8b55-ecf4bbc0e684    ONLINE       0     0     0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:01:59 with 0 errors on Fri Jul 16 03:47:00 2021
config:

    NAME        STATE     READ WRITE CKSUM
    boot-pool   ONLINE       0     0     0
      da12p2    ONLINE       0     0     0

errors: No known data errors


In about 20min another drive will start to show errors...
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
It's either your HBA or backplane failing, and causing the checksum errors. I recently experienced a similar issue with one port on my backplane that went bad, causing read checksum errors on whichever drive happened to be in that slot. (I diagnosed it was the backplane by moving disks around, and seeing where the error repeated.) After I swapped my motherboard and drives to another case, with a working backplane, the checksum errors completely disappeared.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
It's either your HBA or backplane failing, and causing the checksum errors.
Or the SAS cabling leading to it. Reseat and/or replace the cabling first (it's cheaper than the backplane)

Regarding the H310 Mini - is it flashed with the LSI IT firmware? This is a different process than the "full-size" cards, so make sure the instructions you use are specifically referencing the "Mini/Mono" version.
 
Top