Harddisks keep failing

Junicast

Patron
Joined
Mar 6, 2015
Messages
206
Hi,
I'm running FreeNAS on this Hardware
Asrock Taichi X470 Extreme
AMD Ryzen 3700X
64 GB RAM
OS on SATA SSD 100G
RaidZ2 on 6 x 4 TB HDD
with FreeNAS-11.3-U3.2

One drive in the array was failing, giving Checksum errors on zpool status. Since those HDDs were quite old I decided to upgrade to newer drives with more capacity step by step. 4 TB drives out, 8 TB drives in. When all drives are replaced I will get more net storage. Those were my thoughts.

Steps:
1.
Replaced one drive with a Toshiba NAS N300 8TB drive. They are loud and drain energy but they are quite fast for spinning drives.
After some while this drive also started to have checksum errors (CKSUM). I changed the SATA cable because I thought it might be the cable. zpool clear and go.
The errors came back. I replaced the drive through manufacturer's warranty.

2.
Another 4TB drive started to suffer errors so I swapped it for another 8 TB drive, too.

3.
So now there are 2 x 8 TB and 4 x 4 TB drives left. I decided to replace all 4 x 4 TB drives with new 8 TB. All will be this Toshiba disks.
I bought them and started to replace one disk at at time. When the rebuild was finished the first of the 4 drives also started to suffer CKSUM errors. I ended up with 3 of those 4 brand new drives suffering the same error.

What do you think of that? Those cksum errors seem to follow me and I somehow don't believe that all those drives suffer the same or a very similar failure. Isn't that odd?
Can somehow my mainboard break those drives?

This is what /var/log/messages throw when the drives fail:
Code:
Aug  1 22:57:33 filer smartd[37912]: Device: /dev/ada1, FAILED SMART self-check. BACK UP DATA NOW!
Aug  1 22:57:33 filer smartd[37912]: Device: /dev/ada1, Failed SMART usage Attribute: 7 Seek_Error_Rate.


This is what zpool status looks like:
Code:
root@filer[~]# zpool status -v
  pool: fileserver
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub in progress since Sun Aug  2 20:21:17 2020
    11,8T scanned at 1,02G/s, 6,86T issued at 605M/s, 12,8T total
    24K repaired, 53,61% done, 0 days 02:51:12 to go
config:

    NAME                                                STATE     READ WRITE CKSUM
    fileserver                                          ONLINE       0     0    10
      raidz2-0                                          ONLINE       0     0    10
        gptid/910a4a4e-3338-11ea-85c3-7085c28d3862.eli  ONLINE       0     0     0
        gptid/5e48c24b-d43e-11ea-90ba-7085c28d3862.eli  ONLINE       0     0     6
        gptid/e7fff2c9-e611-11e9-8f46-7085c28d3862.eli  ONLINE       0     0     0
        gptid/e9d533ae-e611-11e9-8f46-7085c28d3862.eli  ONLINE       0     0     0
        gptid/0bf9cd82-d332-11ea-af81-7085c28d3862.eli  ONLINE       0     0    11
        gptid/c4618d66-d276-11ea-b25a-7085c28d3862.eli  ONLINE       0     0     1

 

cndk

Cadet
Joined
Jun 1, 2020
Messages
7
Hi,
I don't know how you get to chose the Toshiba NAS N300 8TB drives, but I can tell you they are terrible hdd, unreliable for NAS storage
I'me running on WD Black 4TB and one WD RED for testing purposes. Just to be sure, you might want to check the data cables and the PSU, but i'm sure that the Seek Error Rate it's hdd mechanical fault.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Checksum problems without read or write errors can indicate cabling issues so you were right to check that. Make sure there's not something wrong with the ports you're connecting the cables to... I have seen SATA ports on the motherboard die before.

Seek errors are a good indication of a dead drive, so the dmesg output matches with that.

Have you checked if your new drives are SMR? Those drives will not cope with the load that ZFS puts on them during resilver and either the resilver will fail and/or the drive will not perform correctly. (https://www.ixsystems.com/community/resources/list-of-known-smr-drives.141/)
 

Junicast

Patron
Joined
Mar 6, 2015
Messages
206
I replaced ALL cables with new ones. Still I'm getting CKSUM errors on plenty of drives.
I'm starting to believe the Mainboard is broken (SATA ports).
I shut down my server because I'm afraid I will loose all my data. The most important data is backed up of course, but there's plenty of stuff I don't wanna loose which I can't afford to backup.
The new drives are Toshiba N300 8 TB and I'm pretty sure they're not SMR.
I think next try will be to swap the drives to another PC and see if the drives are failing there, too.
If so, I guess the disks are actually broken, if not, it must be the mainboard.
 

Junicast

Patron
Joined
Mar 6, 2015
Messages
206
Update:
I tested all the new drives on a different computer.
SMART also fails on all 4 disks. I started to rebuild the array by replacing one Toshiba with a WD Red 8 TB (CMR).
While this is currently running an old 4 TB disks also begins to suffer failure:
Code:
root@filer[~]# zpool status -v
  pool: fileserver
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Aug  4 14:21:16 2020
    11,8T scanned at 546M/s, 10,7T issued at 497M/s, 12,8T total
    1,79T resilvered, 83,91% done, 0 days 01:12:23 to go
config:

    NAME                                                  STATE     READ WRITE CKSUM
    fileserver                                            DEGRADED     0     0     5
      raidz2-0                                            DEGRADED     0     0    10
        gptid/910a4a4e-3338-11ea-85c3-7085c28d3862.eli    ONLINE       0     0     0
        gptid/5e48c24b-d43e-11ea-90ba-7085c28d3862.eli    ONLINE       0     0     0
        gptid/e7fff2c9-e611-11e9-8f46-7085c28d3862.eli    DEGRADED     0     0   145  too many errors
        gptid/e9d533ae-e611-11e9-8f46-7085c28d3862.eli    ONLINE       0     0     0
        gptid/0bf9cd82-d332-11ea-af81-7085c28d3862.eli    ONLINE       0     0     0
        replacing-5                                       UNAVAIL      0     0     0
          12944826883504952763                            UNAVAIL      0     0     0  was /dev/gptid/c4618d66-d276-11ea-b25a-7085c28d3862.eli
          gptid/f5b5e122-d64c-11ea-a1da-7085c28d3862.eli  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

Wish me luck, I'll need it. :-/
 

Junicast

Patron
Joined
Mar 6, 2015
Messages
206
So resilvereing finished but with 5 erros.
What shall I do now?
Code:
root@filer[/var/log]# zpool status -v
  pool: fileserver
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 2,13T in 0 days 09:44:41 with 5 errors on Wed Aug  5 00:05:57 2020
config:

    NAME                                                  STATE     READ WRITE CKSUM
    fileserver                                            DEGRADED     0     0     5
      raidz2-0                                            DEGRADED     0     0    10
        gptid/910a4a4e-3338-11ea-85c3-7085c28d3862.eli    ONLINE       0     0     0
        gptid/5e48c24b-d43e-11ea-90ba-7085c28d3862.eli    ONLINE       0     0     1
        gptid/e7fff2c9-e611-11e9-8f46-7085c28d3862.eli    DEGRADED     0     0   145  too many errors
        gptid/e9d533ae-e611-11e9-8f46-7085c28d3862.eli    ONLINE       0     0     0
        gptid/0bf9cd82-d332-11ea-af81-7085c28d3862.eli    ONLINE       0     0     0
        replacing-5                                       UNAVAIL      0     0     0
          12944826883504952763                            UNAVAIL      0     0     0  was /dev/gptid/c4618d66-d276-11ea-b25a-7085c28d3862.eli
          gptid/f5b5e122-d64c-11ea-a1da-7085c28d3862.eli  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x0>
 

Junicast

Patron
Joined
Mar 6, 2015
Messages
206
I consider my array to be broken. Currently I'm copying all my files to external drives in order to make a new array with non-broken drives.
 

cndk

Cadet
Joined
Jun 1, 2020
Messages
7
If it's possible RMA those Toshiba and get some Seagate IronWolf Pro or Exos.
 
Top