snackmasterx
Cadet
- Joined
 - May 12, 2018
 
- Messages
 - 3
 
Hello,
I am getting intermittent failures on one of my zpools, and need some help narrowing down the cause of the failure.
	
	
		
			
	
I noticed some drives reported as REMOVED in 'Manage Devices' for zpool HDD, in the UI. I ran a short SMART test against those drives, and they all passed. Whenever I run 'zpool clear HDD' drives which previously reported as REMOVED will sometimes show as online, and ones previously marked as online may randomly show as REMOVED. In short, it's not always the same drives reporting REMOVED.
The zpool status -v failed due to the suspended IO to the pool, so I had to chain my commands:
I don't believe the failure is due to an issue with the drives themselves. I've experienced this same failure previously, at the time I just blasted and rebuilt the pool because it had no data at that time.
I'll periodically see this appear on the console as well:
I ran a smart test against this drive, no failures:
I checked dmesg for anything interesting, and see the following:
Unfortunately, I don't quite understand the output I'm seeing here.
The drives in use for zpool HDD are currently shown as unassigned, despite the zpool still existing. The zpool HDD still shows under Storage. If I select 'Storage > Disks' view, then I see the disks associated with HDD zpool from there.
Some possible important notes:
The data on zpool HDD is not important, data recovery is not required.
The HDD zpool is connected via external NetApp DS4246 shelf, but using the same onboard HBA as my healthy zpool which has never shown errors.
I've already tried swapping the cable to the DS4246 and seemingly no change in behavior.
My other healthy zpool was created back in the FreeNAS days, I have yet to update the ZFS version on the pool.
I've tried capturing the zpool version for my healthy pool, but the zpool get version command does not report a version.
Any help with this is greatly appreciated.
	
		
			
		
		
	
			
			I am getting intermittent failures on one of my zpools, and need some help narrowing down the cause of the failure.
Currently running TrueNAS-SCALE-22.12.2 - zpool HDD was created on TrueNAS-SCALE-22.12.1
Motherboard: Supermicro X10SRH-CLN4F - https://www.supermicro.com/en/products/motherboard/X10SRH-CLN4F / https://www.supermicro.com/en/products/system/2U/5028/SSG-5028R-E1CR12L.cfm
CPU: Intel Xeon CPU E5-2620 v3
RAM: 256GB
Hard drives:
Hard disk controller:
Network cards - Supermicro AOC-STGN-I2S
		Motherboard: Supermicro X10SRH-CLN4F - https://www.supermicro.com/en/products/motherboard/X10SRH-CLN4F / https://www.supermicro.com/en/products/system/2U/5028/SSG-5028R-E1CR12L.cfm
CPU: Intel Xeon CPU E5-2620 v3
RAM: 256GB
Hard drives:
Code:
1x KDM-SA.71-016GMJ - SATADOM boot drive 8x WDC WD4002FFWX-68TZ4N0 in RAID Z2 - Healthy zpool, connected via onboard chassis hot-swap sleds 12x TOSHIBA MG07ACA14TE in RAID Z2 - Unealthy zpool, connected via NetApp DS4246 shelf
Hard disk controller:
Code:
root@atlas[~]# sas3flash -list
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02)
Copyright 2008-2017 Avago Technologies. All rights reserved.
        Adapter Selected is a Avago SAS: SAS3008(C0)
        Controller Number              : 0
        Controller                     : SAS3008(C0)
        PCI Address                    : 00:01:00:00
        SAS Address                    : 5003048-0-1cb4-a300
        NVDATA Version (Default)       : 0e.01.00.07
        NVDATA Version (Persistent)    : 0e.01.00.07
        Firmware Product ID            : 0x2221 (IT)
        Firmware Version               : 16.00.10.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : SAS9300-8i
        BIOS Version                   : 08.37.00.00
        UEFI BSD Version               : 18.00.00.00
        FCODE Version                  : N/A
        Board Name                     : SAS9300-8i
        Board Assembly                 : N/A
        Board Tracer Number            : N/A
        Finished Processing Commands Successfully.
        Exiting SAS3Flash.
root@atlas[~]#Network cards - Supermicro AOC-STGN-I2S
I noticed some drives reported as REMOVED in 'Manage Devices' for zpool HDD, in the UI. I ran a short SMART test against those drives, and they all passed. Whenever I run 'zpool clear HDD' drives which previously reported as REMOVED will sometimes show as online, and ones previously marked as online may randomly show as REMOVED. In short, it's not always the same drives reporting REMOVED.
Code:
root@atlas[~]# zpool status HDD
  pool: HDD
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-JQ
  scan: resilvered 456K in 00:00:00 with 0 errors on Wed Apr 12 13:38:37 2023
config:
        NAME                                      STATE     READ WRITE CKSUM
        HDD                                       UNAVAIL      0     0     0  insufficient replicas
          raidz2-0                                UNAVAIL      0     0     0  insufficient replicas
            e8a54291-d2ea-47c5-8bd9-a2b30dc07e9f  ONLINE       0     0     0
            fdfee4b0-ce4f-4b05-9ce7-c9fa1396faf9  ONLINE       0     0     0
            9bb2a58a-8da5-4baf-9e70-aad12ffc8f65  ONLINE       0     0     0
            dfac7d2f-85ea-43a5-b4cc-7cc777eccab6  ONLINE       0     0     0
            913befe9-d758-4b65-b4e8-f5f580a64ba2  REMOVED      0     0     0
            58b9a0a2-de7b-4ce1-87b5-e03b6ecd1897  ONLINE       0     0     0
            70e42a65-22bc-47f9-bd3a-a7c9580af1d6  ONLINE       0     0     0
            68300e12-e21f-4c33-a029-5747a0873a3c  ONLINE       0     0     0
            65c88c1e-056e-4e43-a30d-e0b4d2814d65  ONLINE       0     0     0
            439165af-c339-4da9-9ce6-7f62ff306607  REMOVED      0     0     0
            f282d399-faac-4204-b27a-35e016657eed  ONLINE       0     0     0
            9ad64a82-c7b5-42df-b5b8-b2933625c8b7  REMOVED      0     0     0
errors: List of errors unavailable: pool I/O is currently suspended
errors: 4 data errors, use '-v' for a list
root@atlas[~]#The zpool status -v failed due to the suspended IO to the pool, so I had to chain my commands:
Code:
root@atlas[~]# zpool clear HDD && zpool status -v HDD
  pool: HDD
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 412K in 00:00:00 with 0 errors on Wed Apr 12 13:07:03 2023
config:
        NAME                                      STATE     READ WRITE CKSUM
        HDD                                       ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            e8a54291-d2ea-47c5-8bd9-a2b30dc07e9f  ONLINE       0     0     0
            fdfee4b0-ce4f-4b05-9ce7-c9fa1396faf9  ONLINE       0     0     0
            9bb2a58a-8da5-4baf-9e70-aad12ffc8f65  ONLINE       0     0     0
            dfac7d2f-85ea-43a5-b4cc-7cc777eccab6  ONLINE       0     0     0
            913befe9-d758-4b65-b4e8-f5f580a64ba2  ONLINE       0     0     0
            58b9a0a2-de7b-4ce1-87b5-e03b6ecd1897  ONLINE       0     0     0
            70e42a65-22bc-47f9-bd3a-a7c9580af1d6  ONLINE       0     0     0
            68300e12-e21f-4c33-a029-5747a0873a3c  ONLINE       0     0     0
            65c88c1e-056e-4e43-a30d-e0b4d2814d65  ONLINE       0     0     0
            439165af-c339-4da9-9ce6-7f62ff306607  ONLINE       0     0     0
            f282d399-faac-4204-b27a-35e016657eed  ONLINE       0     0     0
            9ad64a82-c7b5-42df-b5b8-b2933625c8b7  ONLINE       0     0     0
errors: Permanent errors have been detected in the following files:
        <metadata>:<0x0>
        <metadata>:<0x3d>
root@atlas[~]#I don't believe the failure is due to an issue with the drives themselves. I've experienced this same failure previously, at the time I just blasted and rebuilt the pool because it had no data at that time.
I'll periodically see this appear on the console as well:
Code:
2023 Apr 12 13:29:18 atlas Device: /dev/sdaf [SAT], 168 Currently unreadable (pending) sectors 2023 Apr 12 13:29:18 atlas Device: /dev/sdaf [SAT], 21 Offline uncorrectable sectors
I ran a smart test against this drive, no failures:
Code:
root@atlas[~]# smartctl -a /dev/sdaf smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.79+truenas] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Toshiba MG07ACA... Enterprise Capacity HDD Device Model: TOSHIBA MG07ACA14TE Serial Number: Z810A01LF94G LU WWN Device Id: 5 000039 918c80421 Firmware Version: 0101 User Capacity: 14,000,519,643,136 bytes [14.0 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-3 T13/2161-D revision 5 SATA Version is: SATA 3.3, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Wed Apr 12 14:00:29 2023 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
I checked dmesg for anything interesting, and see the following:
Code:
[ 967.806118] INFO: task agents:13957 blocked for more than 241 seconds. [ 967.806777] Tainted: P OE 5.15.79+truenas #1 [ 967.807456] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 967.808182] task:agents state:D stack: 0 pid:13957 ppid: 1 flags:0x00000000 [ 967.808886] Call Trace: [ 967.809573] <TASK> [ 967.810282] __schedule+0x2f0/0x950 [ 967.810980] schedule+0x5b/0xd0 [ 967.811676] io_schedule+0x42/0x70 [ 967.812379] cv_wait_common+0xaa/0x130 [spl] [ 967.813091] ? finish_wait+0x90/0x90 [ 967.813804] txg_wait_synced_impl+0x92/0x110 [zfs] [ 967.814686] txg_wait_synced+0xc/0x40 [zfs] [ 967.815766] spa_vdev_state_exit+0x8a/0x170 [zfs] [ 967.816852] zfs_ioc_vdev_set_state+0xe2/0x1b0 [zfs] [ 967.817966] zfsdev_ioctl_common+0x698/0x750 [zfs] [ 967.819076] ? __kmalloc_node+0x3d6/0x480 [ 967.820013] ? _copy_from_user+0x28/0x60 [ 967.820759] zfsdev_ioctl+0x53/0xe0 [zfs] [ 967.821676] __x64_sys_ioctl+0x8b/0xc0 [ 967.822432] do_syscall_64+0x3b/0xc0 [ 967.823169] entry_SYSCALL_64_after_hwframe+0x61/0xcb [ 967.823885] RIP: 0033:0x7fb2ad6386b7 [ 967.824581] RSP: 002b:00007fb2ac884308 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 967.825288] RAX: ffffffffffffffda RBX: 00007fb2ac884320 RCX: 00007fb2ad6386b7 [ 967.825998] RDX: 00007fb2ac884320 RSI: 0000000000005a0d RDI: 000000000000000d [ 967.826715] RBP: 00007fb2ac887d10 R08: 000000000006ebf4 R09: 0000000000000000 [ 967.827424] R10: 00007fb2a0014ab0 R11: 0000000000000246 R12: 00007fb2a00215c0 [ 967.828141] R13: 00007fb2ac8878d0 R14: 0000558f339b60d0 R15: 00007fb2a0026e80 [ 967.828864] </TASK>
Unfortunately, I don't quite understand the output I'm seeing here.
The drives in use for zpool HDD are currently shown as unassigned, despite the zpool still existing. The zpool HDD still shows under Storage. If I select 'Storage > Disks' view, then I see the disks associated with HDD zpool from there.
Some possible important notes:
The data on zpool HDD is not important, data recovery is not required.
The HDD zpool is connected via external NetApp DS4246 shelf, but using the same onboard HBA as my healthy zpool which has never shown errors.
I've already tried swapping the cable to the DS4246 and seemingly no change in behavior.
My other healthy zpool was created back in the FreeNAS days, I have yet to update the ZFS version on the pool.
I've tried capturing the zpool version for my healthy pool, but the zpool get version command does not report a version.
Any help with this is greatly appreciated.