8 drives faulted at the same time?

T4ke · May 15, 2023

Hi there,
I'm on TrueNAS Scale 22.12.2, running on ESXi 7.0 U3.
Since about two weeks I get read, write and especially checksum errors all the time on the pool, and they are getting more and more in number.
At the beginning there were only 20-50 checksum errors, now there are 200k and more within a very short time.
I have already changed the controller twice, checked and replaced all cables, re-seated all drives, changed drive bays and even migrated the system via vMotion to another ESXi host, because I did not want to exclude a damaged backplane either. Unfortunately, the errors are also present on the second ESXi host.
I really have no clue what could cause this many errors, on practically new drives, they were purchased in January this year. The controllers also seem to be fine.
The drives are 8 x ST12000NM002G, configured in RAID-Z2. The SMART data doesn't show anything suspicious, disks seem to be fine.

Specs ESXi host 01:
VMware ESXi, 7.0.3, 21686933
Supermicro X12SPL-F
Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz
192GB DDR4 ECC RAM

Specs ESXi host 02:
VMware ESXi, 7.0.3, 21686933
Supermicro X11SRA-RF
Intel(R) Xeon(R) W-2150B CPU @ 3.00GHz
64GB DDR4 ECC RAM

HBAs tested:
LSI 9300-16i (original HBA)
2 HPE Smart Array H240 (backup HBAs)
All flashed / configured in IT mode on latest firmware.
The HBAs are in PCI passthrough mode to the TrueNAS VM.
I still also got an old Adaptec 71605 laying around but didn't test it yet.

'zpool status tank01 -v' currently shows the following:

Code:

  pool: tank01
  pool: tank01
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 49.7G in 00:12:14 with 2045 errors on Mon May 15 15:18:36 2023
config:

        NAME                                      STATE     READ WRITE CKSUM
        tank01                                    DEGRADED     0     0     0
          raidz2-0                                DEGRADED     0     0     0
            c751ab6b-6fe5-49e2-a637-c373611c4c77  DEGRADED     0     0 5.89K  too many errors
            9d5db49f-15b6-44e2-8d50-3768cd32304b  DEGRADED     0     0 5.90K  too many errors
            8fa742b0-bd8f-4423-a1d7-36e3a6d0781e  DEGRADED     0     0 5.90K  too many errors
            b490113d-cc59-4ccd-ab8e-8ca953f9820f  DEGRADED     0     0 5.75K  too many errors
            eed08a51-f1f3-4aa9-895a-9fef0aad0ecf  DEGRADED     0     0 5.58K  too many errors
            6a58ae25-6998-4ea1-af30-83e0f1546f66  DEGRADED     0     0 5.41K  too many errors
            7e31423d-2a9f-412c-a1e2-954a85708974  DEGRADED     0     0 5.72K  too many errors
            62e33e68-1443-406f-becc-60e486da5889  DEGRADED     0     0 5.75K  too many errors

errors: Permanent errors have been detected in the following files:

        tank01/Backup:<0x0>
        tank01/Backup:<0x1903>
        tank01/Backup:<0x1905>
        tank01/Backup:<0x1719>
        tank01/Backup:<0x1830>
        tank01/Backup:<0x1940>
        tank01/Backup:<0x1943>
        tank01/Backup:<0x1945>
        /mnt/tank01/Backup/wordpress_backups/file1.zip
        /mnt/tank01/Backup/wordpress_backups/file2.zip
        /mnt/tank01/Backup/wordpress_backups/file3.zip
        /mnt/tank01/Backup/wordpress_backups/file4.zip
        tank01/Backup:<0x1755>
        tank01/Backup:<0x195a>
        tank01/Backup:<0x18a5>
        tank01/Backup:<0x19ad>
        tank01/Backup:<0x17ae>
        tank01/Backup:<0x17af>
        tank01/Backup:<0x17b0>
        /mnt/tank01/Backup/wordpress_backups/file5.zip

I just can't imagine 8 practically new hard drives going belly up at the same time.
Does anyone have a hint that could get me going in the right direction?

sretalla · May 15, 2023

T4ke said:
I still also got an old Adaptec 71605 laying around but didn't test it yet.

Don't do that, it won't be helpful in the long-term (even if it does work initially).

T4ke said:
even migrated the system via vMotion to another ESXi host,

I don't think you can do that with a PCI passthrough in place... what's happening to allow that?

T4ke · May 15, 2023

sretalla said:
Don't do that, it won't be helpful in the long-term (even if it does work initially).

I don't think you can do that with a PCI passthrough in place... what's happening to allow that?

My bad, you are right. I wasn't 'vMotion'-ing, I shut down the VM, removed the PCI passthrough and cold migrated the VM to the other host. There I re-added the HBA in passthrough. Thanks for correcting me.

HoneyBadger · May 15, 2023

I recall there was an issue with Ironwolf drives and dropouts several years ago that was related to write caches and command queuing. Drives would experience command timeouts under loads. Disabling write cache and NCQ prevented these failures but at a significant cost to performance.

Seagate IronWolf 10TB (ST10000VN0004) vs LSI IT firmware controllers

I've been looking for the right place to post this and I believe this is it. To be brief, I've been working on a YouTube series where I'm building a 100TB ZFS based server. Not all Enterprise grade hardware (it's for at home) and not using FreeBSD/FreeNAS (I have some different requirements)...

www.truenas.com

According to the users there, it was ultimately resolved by a firmware update being released from Seagate.

Can you check to see if there's a firmware update available for your drive(s)? Seagate requires a serial number to search.

Exos X16 | Seagate Canada

T4ke · May 15, 2023

Hey mate, thanks for the advice. Unfortunately there are no firmware updates available for my drives.
Right now I'm scrubbing the pool on the second esxi host and see where it goes, drives are attached to one of the HP H240s, checksum errors still popping up but at least there are no write or read errors any more.
Any advice is welcomed.

Code:

  pool: tank01
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Mon May 15 16:10:00 2023
        21.6T scanned at 1.42G/s, 20.5T issued at 1.35G/s, 30.6T total
        16K repaired, 66.87% done, 02:08:36 to go
config:

        NAME                                      STATE     READ WRITE CKSUM
        tank01                                    DEGRADED     0     0     0
          raidz2-0                                DEGRADED     0     0     0
            c751ab6b-6fe5-49e2-a637-c373611c4c77  DEGRADED     0     0 8.07K  too many errors
            9d5db49f-15b6-44e2-8d50-3768cd32304b  DEGRADED     0     0 8.08K  too many errors
            8fa742b0-bd8f-4423-a1d7-36e3a6d0781e  DEGRADED     0     0 8.08K  too many errors
            b490113d-cc59-4ccd-ab8e-8ca953f9820f  DEGRADED     0     0 7.90K  too many errors
            eed08a51-f1f3-4aa9-895a-9fef0aad0ecf  DEGRADED     0     0 7.44K  too many errors
            6a58ae25-6998-4ea1-af30-83e0f1546f66  DEGRADED     0     0 7.10K  too many errors
            7e31423d-2a9f-412c-a1e2-954a85708974  DEGRADED     0     0 7.72K  too many errors
            62e33e68-1443-406f-becc-60e486da5889  DEGRADED     0     0 7.92K  too many errors

blanchet · May 17, 2023

Try to boot your virtualized TrueNAS VM in UEFI instead of BIOS

https://blogs.vmware.com/apps/2018/09/using-gpus-with-virtual-machines-on-vsphere-part-2-vmdirectpath-i-o.html

WI_Hedgehog · May 17, 2023

T4ke said:
The SMART data doesn't show anything suspicious, disks seem to be fine.
I just can't imagine 8 practically new hard drives going belly up at the same time.
Does anyone have a hint that could get me going in the right direction?

@T4ke : Did you run smartctl -xall on the bare metal machines?

Important Announcement for the TrueNAS Community.

8 drives faulted at the same time?

T4ke

Cadet

sretalla

Powered by Neutrality

T4ke

Cadet

HoneyBadger

actually does care

Seagate IronWolf 10TB (ST10000VN0004) vs LSI IT firmware controllers

T4ke

Cadet

blanchet

Guru

WI_Hedgehog

Guru

Similar threads

Important Announcement for the TrueNAS Community.

8 drives faulted at the same time?

Cadet

Powered by Neutrality

Cadet

actually does care

Cadet

Guru

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "8 drives faulted at the same time?"

Similar threads