Harddisks keep failing

Junicast · Aug 2, 2020

Hi,
I'm running FreeNAS on this Hardware
Asrock Taichi X470 Extreme
AMD Ryzen 3700X
64 GB RAM
OS on SATA SSD 100G
RaidZ2 on 6 x 4 TB HDD
with FreeNAS-11.3-U3.2

One drive in the array was failing, giving Checksum errors on zpool status. Since those HDDs were quite old I decided to upgrade to newer drives with more capacity step by step. 4 TB drives out, 8 TB drives in. When all drives are replaced I will get more net storage. Those were my thoughts.

Steps:
1.
Replaced one drive with a Toshiba NAS N300 8TB drive. They are loud and drain energy but they are quite fast for spinning drives.
After some while this drive also started to have checksum errors (CKSUM). I changed the SATA cable because I thought it might be the cable. zpool clear and go.
The errors came back. I replaced the drive through manufacturer's warranty.

2.
Another 4TB drive started to suffer errors so I swapped it for another 8 TB drive, too.

3.
So now there are 2 x 8 TB and 4 x 4 TB drives left. I decided to replace all 4 x 4 TB drives with new 8 TB. All will be this Toshiba disks.
I bought them and started to replace one disk at at time. When the rebuild was finished the first of the 4 drives also started to suffer CKSUM errors. I ended up with 3 of those 4 brand new drives suffering the same error.

What do you think of that? Those cksum errors seem to follow me and I somehow don't believe that all those drives suffer the same or a very similar failure. Isn't that odd?
Can somehow my mainboard break those drives?

This is what /var/log/messages throw when the drives fail:

Code:

Aug  1 22:57:33 filer smartd[37912]: Device: /dev/ada1, FAILED SMART self-check. BACK UP DATA NOW!
Aug  1 22:57:33 filer smartd[37912]: Device: /dev/ada1, Failed SMART usage Attribute: 7 Seek_Error_Rate.

This is what zpool status looks like:

Code:

root@filer[~]# zpool status -v
  pool: fileserver
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub in progress since Sun Aug  2 20:21:17 2020
    11,8T scanned at 1,02G/s, 6,86T issued at 605M/s, 12,8T total
    24K repaired, 53,61% done, 0 days 02:51:12 to go
config:

    NAME                                                STATE     READ WRITE CKSUM
    fileserver                                          ONLINE       0     0    10
      raidz2-0                                          ONLINE       0     0    10
        gptid/910a4a4e-3338-11ea-85c3-7085c28d3862.eli  ONLINE       0     0     0
        gptid/5e48c24b-d43e-11ea-90ba-7085c28d3862.eli  ONLINE       0     0     6
        gptid/e7fff2c9-e611-11e9-8f46-7085c28d3862.eli  ONLINE       0     0     0
        gptid/e9d533ae-e611-11e9-8f46-7085c28d3862.eli  ONLINE       0     0     0
        gptid/0bf9cd82-d332-11ea-af81-7085c28d3862.eli  ONLINE       0     0    11
        gptid/c4618d66-d276-11ea-b25a-7085c28d3862.eli  ONLINE       0     0     1

cndk · Aug 2, 2020

Hi,
I don't know how you get to chose the Toshiba NAS N300 8TB drives, but I can tell you they are terrible hdd, unreliable for NAS storage

Toshiba N300 8TB - brand new - seek error and head parking problem

Demonstration of frequent head parking problem die seek errors while continuous write operation to two files placed at 34GB and 128BG disk offsets.I bought 4...

www.youtube.com

I'me running on WD Black 4TB and one WD RED for testing purposes. Just to be sure, you might want to check the data cables and the PSU, but i'm sure that the Seek Error Rate it's hdd mechanical fault.

sretalla · Aug 3, 2020

Checksum problems without read or write errors can indicate cabling issues so you were right to check that. Make sure there's not something wrong with the ports you're connecting the cables to... I have seen SATA ports on the motherboard die before.

Seek errors are a good indication of a dead drive, so the dmesg output matches with that.

Have you checked if your new drives are SMR? Those drives will not cope with the load that ZFS puts on them during resilver and either the resilver will fail and/or the drive will not perform correctly. (https://www.ixsystems.com/community/resources/list-of-known-smr-drives.141/)

Junicast · Aug 3, 2020

I replaced ALL cables with new ones. Still I'm getting CKSUM errors on plenty of drives.
I'm starting to believe the Mainboard is broken (SATA ports).
I shut down my server because I'm afraid I will loose all my data. The most important data is backed up of course, but there's plenty of stuff I don't wanna loose which I can't afford to backup.
The new drives are Toshiba N300 8 TB and I'm pretty sure they're not SMR.
I think next try will be to swap the drives to another PC and see if the drives are failing there, too.
If so, I guess the disks are actually broken, if not, it must be the mainboard.

Junicast · Aug 4, 2020

Update:
I tested all the new drives on a different computer.
SMART also fails on all 4 disks. I started to rebuild the array by replacing one Toshiba with a WD Red 8 TB (CMR).
While this is currently running an old 4 TB disks also begins to suffer failure:

Code:

root@filer[~]# zpool status -v
  pool: fileserver
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Aug  4 14:21:16 2020
    11,8T scanned at 546M/s, 10,7T issued at 497M/s, 12,8T total
    1,79T resilvered, 83,91% done, 0 days 01:12:23 to go
config:

    NAME                                                  STATE     READ WRITE CKSUM
    fileserver                                            DEGRADED     0     0     5
      raidz2-0                                            DEGRADED     0     0    10
        gptid/910a4a4e-3338-11ea-85c3-7085c28d3862.eli    ONLINE       0     0     0
        gptid/5e48c24b-d43e-11ea-90ba-7085c28d3862.eli    ONLINE       0     0     0
        gptid/e7fff2c9-e611-11e9-8f46-7085c28d3862.eli    DEGRADED     0     0   145  too many errors
        gptid/e9d533ae-e611-11e9-8f46-7085c28d3862.eli    ONLINE       0     0     0
        gptid/0bf9cd82-d332-11ea-af81-7085c28d3862.eli    ONLINE       0     0     0
        replacing-5                                       UNAVAIL      0     0     0
          12944826883504952763                            UNAVAIL      0     0     0  was /dev/gptid/c4618d66-d276-11ea-b25a-7085c28d3862.eli
          gptid/f5b5e122-d64c-11ea-a1da-7085c28d3862.eli  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

Wish me luck, I'll need it. :-/

Junicast · Aug 4, 2020

So resilvereing finished but with 5 erros.
What shall I do now?

Code:

root@filer[/var/log]# zpool status -v
  pool: fileserver
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 2,13T in 0 days 09:44:41 with 5 errors on Wed Aug  5 00:05:57 2020
config:

    NAME                                                  STATE     READ WRITE CKSUM
    fileserver                                            DEGRADED     0     0     5
      raidz2-0                                            DEGRADED     0     0    10
        gptid/910a4a4e-3338-11ea-85c3-7085c28d3862.eli    ONLINE       0     0     0
        gptid/5e48c24b-d43e-11ea-90ba-7085c28d3862.eli    ONLINE       0     0     1
        gptid/e7fff2c9-e611-11e9-8f46-7085c28d3862.eli    DEGRADED     0     0   145  too many errors
        gptid/e9d533ae-e611-11e9-8f46-7085c28d3862.eli    ONLINE       0     0     0
        gptid/0bf9cd82-d332-11ea-af81-7085c28d3862.eli    ONLINE       0     0     0
        replacing-5                                       UNAVAIL      0     0     0
          12944826883504952763                            UNAVAIL      0     0     0  was /dev/gptid/c4618d66-d276-11ea-b25a-7085c28d3862.eli
          gptid/f5b5e122-d64c-11ea-a1da-7085c28d3862.eli  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x0>

Junicast · Aug 6, 2020

I consider my array to be broken. Currently I'm copying all my files to external drives in order to make a new array with non-broken drives.

cndk · Aug 6, 2020

If it's possible RMA those Toshiba and get some Seagate IronWolf Pro or Exos.

Important Announcement for the TrueNAS Community.

Harddisks keep failing

Junicast

Patron

cndk

Cadet

Toshiba N300 8TB - brand new - seek error and head parking problem

sretalla

Powered by Neutrality

Junicast

Patron

Junicast

Patron

Junicast

Patron

Junicast

Patron

cndk

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Harddisks keep failing

Patron

Cadet

Powered by Neutrality

Patron

Patron

Patron

Patron

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Harddisks keep failing"

Similar threads