HGST SAS Pool Unhealthy

xeroiv

Cadet
Joined
May 18, 2018
Messages
9
I have a pool of mirrored vdevs consisting of HGST SAS drives. I am having trouble with the pool status unhealthy. I have scrubbed the pool a few times but each time results in new errors after I have removed the offending files from the previous scrub. Since the drives are SAS and not SATA I don't get a ton of useful information from smartctl. I did run an extended test on all the drive but non of them had reported elements in the defect list. I am lost on what steps to take next. Here is the output from my latest scrub.

Code:
root@truenas[~]# zpool status -v               
  pool: FreeNAS-ZFS
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 1.31G in 04:48:42 with 94 errors on Mon Dec  5 19:07:35 2022
config:

        NAME                                      STATE     READ WRITE CKSUM
        FreeNAS-ZFS                               ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            b9ffac7d-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0 1.76K
            bae98b8f-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0 1.74K
          mirror-1                                ONLINE       0     0     0
            ba5b159c-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0 1.85K
            bab4f9d5-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0 1.73K
          mirror-2                                ONLINE       0     0     0
            ba74c31d-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0 1.87K
            bade0ac1-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0 1.75K
          mirror-3                                ONLINE       0     0     0
            b91a957c-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0 1.78K
            bac6e69a-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0 1.82K
          mirror-4                                ONLINE       0     0     0
            b998d99a-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0 1.74K
            baf2fb74-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0 1.86K
          mirror-5                                ONLINE       0     0     0
            b6f44c9c-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0    2K
            b725f5c7-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0 2.00K
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Please follow the forum rules, listed in the red link at the top of this page. Specifically hardware devices & configuration, then software version would be helpful.
 

xeroiv

Cadet
Joined
May 18, 2018
Messages
9
I am running TrueNAS-SCALE as a KVM guest in Proxmox 7.2. The underlaying hardware is an HP DL360G8. I have 12 drives connected to an LSI 9211 which has been passed through to the guest. Here are the specs from the TrueNAS gui:

OS Version:TrueNAS-SCALE-22.02.4
Product:Standard PC (i440FX + PIIX, 1996)
Model: Common KVM processor
Memory:16 GiB

I have recently imported this pool into a TrueNAS-SCALE VM and am trying to rule out incompatibilities with the SCALE vs coincidental hardware issues as it was previously running without issue on a FreeNAS 11.3 VM.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I have a pool of mirrored vdevs consisting of HGST SAS drives. I am having trouble with the pool status unhealthy. I have scrubbed the pool a few times but each time results in new errors after I have removed the offending files from the previous scrub. Since the drives are SAS and not SATA I don't get a ton of useful information from smartctl. I did run an extended test on all the drive but non of them had reported elements in the defect list. I am lost on what steps to take next. Here is the output from my latest scrub.

Code:
root@truenas[~]# zpool status -v              
  pool: FreeNAS-ZFS
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 1.31G in 04:48:42 with 94 errors on Mon Dec  5 19:07:35 2022
config:

        NAME                                      STATE     READ WRITE CKSUM
        FreeNAS-ZFS                               ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            b9ffac7d-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0 1.76K
            bae98b8f-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0 1.74K
          mirror-1                                ONLINE       0     0     0
            ba5b159c-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0 1.85K
            bab4f9d5-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0 1.73K
          mirror-2                                ONLINE       0     0     0
            ba74c31d-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0 1.87K
            bade0ac1-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0 1.75K
          mirror-3                                ONLINE       0     0     0
            b91a957c-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0 1.78K
            bac6e69a-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0 1.82K
          mirror-4                                ONLINE       0     0     0
            b998d99a-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0 1.74K
            baf2fb74-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0 1.86K
          mirror-5                                ONLINE       0     0     0
            b6f44c9c-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0    2K
            b725f5c7-3815-11ea-9474-616a02a5cdb8  ONLINE       0     0 2.00K
CKSUM (Checksum) errors often result from a unreliable connection rather than media failure.

For SATA devices, I would first point at the cable - but given that these are SAS devices, and they are all experiencing roughly the same number of CKSUM errors, I assume they are on a backplane - so my thoughts are with the connection from HBA to backplane. Can you check and reseat the cables running between the two HBA and backplane? Look for any signs of corrosion or damage to the pins that might cause intermittent connectivity.

Potential other sources of this problem can include an overheating HBA or expander chip - do you have sufficient airflow in your chassis, especially around the PCIe slots?

As mentioned by @Arwen please provide a detailed hardware configuration - paying particular attention to the motherboard, HBA, and method of drive connection. HGST SAS devices should definitely not be SMR, at least.
 

xeroiv

Cadet
Joined
May 18, 2018
Messages
9
So I checked out the temperatures in hpe iLo4 and there weren't anything that would be close to what I would say is hot. The drives were about 26C and the hottest part of the chassis was near the PCIE devices as I would expect peaking at 36C. That said the internals were extremely dusty so I took the air compressor to it and about choked on the amount of dust that came out of it. I also took the opportunity to reseat the cables going from the HBA to the backplane. After putting it all back together and clearing the errors on the pool I redid the scrub and so far no checksum errors have come back. I will update here if the errors come back after some reads/writes to the pool.

On the topic of the hardware config, the server is an almost entirely stock HPE DL380e G8 except for the LSI 9211-8i HBA that I have installed. The 12 drives are connected to the integrated backplane for the server.
 
Top