snackmasterx
Cadet
- Joined
- May 12, 2018
- Messages
- 3
Hello,
I am getting intermittent failures on one of my zpools, and need some help narrowing down the cause of the failure.
I noticed some drives reported as REMOVED in 'Manage Devices' for zpool HDD, in the UI. I ran a short SMART test against those drives, and they all passed. Whenever I run 'zpool clear HDD' drives which previously reported as REMOVED will sometimes show as online, and ones previously marked as online may randomly show as REMOVED. In short, it's not always the same drives reporting REMOVED.
The zpool status -v failed due to the suspended IO to the pool, so I had to chain my commands:
I don't believe the failure is due to an issue with the drives themselves. I've experienced this same failure previously, at the time I just blasted and rebuilt the pool because it had no data at that time.
I'll periodically see this appear on the console as well:
I ran a smart test against this drive, no failures:
I checked dmesg for anything interesting, and see the following:
Unfortunately, I don't quite understand the output I'm seeing here.
The drives in use for zpool HDD are currently shown as unassigned, despite the zpool still existing. The zpool HDD still shows under Storage. If I select 'Storage > Disks' view, then I see the disks associated with HDD zpool from there.
Some possible important notes:
The data on zpool HDD is not important, data recovery is not required.
The HDD zpool is connected via external NetApp DS4246 shelf, but using the same onboard HBA as my healthy zpool which has never shown errors.
I've already tried swapping the cable to the DS4246 and seemingly no change in behavior.
My other healthy zpool was created back in the FreeNAS days, I have yet to update the ZFS version on the pool.
I've tried capturing the zpool version for my healthy pool, but the zpool get version command does not report a version.
Any help with this is greatly appreciated.
I am getting intermittent failures on one of my zpools, and need some help narrowing down the cause of the failure.
Currently running TrueNAS-SCALE-22.12.2 - zpool HDD was created on TrueNAS-SCALE-22.12.1
Motherboard: Supermicro X10SRH-CLN4F - https://www.supermicro.com/en/products/motherboard/X10SRH-CLN4F / https://www.supermicro.com/en/products/system/2U/5028/SSG-5028R-E1CR12L.cfm
CPU: Intel Xeon CPU E5-2620 v3
RAM: 256GB
Hard drives:
Hard disk controller:
Network cards - Supermicro AOC-STGN-I2S
Motherboard: Supermicro X10SRH-CLN4F - https://www.supermicro.com/en/products/motherboard/X10SRH-CLN4F / https://www.supermicro.com/en/products/system/2U/5028/SSG-5028R-E1CR12L.cfm
CPU: Intel Xeon CPU E5-2620 v3
RAM: 256GB
Hard drives:
Code:
1x KDM-SA.71-016GMJ - SATADOM boot drive 8x WDC WD4002FFWX-68TZ4N0 in RAID Z2 - Healthy zpool, connected via onboard chassis hot-swap sleds 12x TOSHIBA MG07ACA14TE in RAID Z2 - Unealthy zpool, connected via NetApp DS4246 shelf
Hard disk controller:
Code:
root@atlas[~]# sas3flash -list Avago Technologies SAS3 Flash Utility Version 16.00.00.00 (2017.05.02) Copyright 2008-2017 Avago Technologies. All rights reserved. Adapter Selected is a Avago SAS: SAS3008(C0) Controller Number : 0 Controller : SAS3008(C0) PCI Address : 00:01:00:00 SAS Address : 5003048-0-1cb4-a300 NVDATA Version (Default) : 0e.01.00.07 NVDATA Version (Persistent) : 0e.01.00.07 Firmware Product ID : 0x2221 (IT) Firmware Version : 16.00.10.00 NVDATA Vendor : LSI NVDATA Product ID : SAS9300-8i BIOS Version : 08.37.00.00 UEFI BSD Version : 18.00.00.00 FCODE Version : N/A Board Name : SAS9300-8i Board Assembly : N/A Board Tracer Number : N/A Finished Processing Commands Successfully. Exiting SAS3Flash. root@atlas[~]#
Network cards - Supermicro AOC-STGN-I2S
I noticed some drives reported as REMOVED in 'Manage Devices' for zpool HDD, in the UI. I ran a short SMART test against those drives, and they all passed. Whenever I run 'zpool clear HDD' drives which previously reported as REMOVED will sometimes show as online, and ones previously marked as online may randomly show as REMOVED. In short, it's not always the same drives reporting REMOVED.
Code:
root@atlas[~]# zpool status HDD pool: HDD state: UNAVAIL status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-JQ scan: resilvered 456K in 00:00:00 with 0 errors on Wed Apr 12 13:38:37 2023 config: NAME STATE READ WRITE CKSUM HDD UNAVAIL 0 0 0 insufficient replicas raidz2-0 UNAVAIL 0 0 0 insufficient replicas e8a54291-d2ea-47c5-8bd9-a2b30dc07e9f ONLINE 0 0 0 fdfee4b0-ce4f-4b05-9ce7-c9fa1396faf9 ONLINE 0 0 0 9bb2a58a-8da5-4baf-9e70-aad12ffc8f65 ONLINE 0 0 0 dfac7d2f-85ea-43a5-b4cc-7cc777eccab6 ONLINE 0 0 0 913befe9-d758-4b65-b4e8-f5f580a64ba2 REMOVED 0 0 0 58b9a0a2-de7b-4ce1-87b5-e03b6ecd1897 ONLINE 0 0 0 70e42a65-22bc-47f9-bd3a-a7c9580af1d6 ONLINE 0 0 0 68300e12-e21f-4c33-a029-5747a0873a3c ONLINE 0 0 0 65c88c1e-056e-4e43-a30d-e0b4d2814d65 ONLINE 0 0 0 439165af-c339-4da9-9ce6-7f62ff306607 REMOVED 0 0 0 f282d399-faac-4204-b27a-35e016657eed ONLINE 0 0 0 9ad64a82-c7b5-42df-b5b8-b2933625c8b7 REMOVED 0 0 0 errors: List of errors unavailable: pool I/O is currently suspended errors: 4 data errors, use '-v' for a list root@atlas[~]#
The zpool status -v failed due to the suspended IO to the pool, so I had to chain my commands:
Code:
root@atlas[~]# zpool clear HDD && zpool status -v HDD pool: HDD state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A scan: resilvered 412K in 00:00:00 with 0 errors on Wed Apr 12 13:07:03 2023 config: NAME STATE READ WRITE CKSUM HDD ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 e8a54291-d2ea-47c5-8bd9-a2b30dc07e9f ONLINE 0 0 0 fdfee4b0-ce4f-4b05-9ce7-c9fa1396faf9 ONLINE 0 0 0 9bb2a58a-8da5-4baf-9e70-aad12ffc8f65 ONLINE 0 0 0 dfac7d2f-85ea-43a5-b4cc-7cc777eccab6 ONLINE 0 0 0 913befe9-d758-4b65-b4e8-f5f580a64ba2 ONLINE 0 0 0 58b9a0a2-de7b-4ce1-87b5-e03b6ecd1897 ONLINE 0 0 0 70e42a65-22bc-47f9-bd3a-a7c9580af1d6 ONLINE 0 0 0 68300e12-e21f-4c33-a029-5747a0873a3c ONLINE 0 0 0 65c88c1e-056e-4e43-a30d-e0b4d2814d65 ONLINE 0 0 0 439165af-c339-4da9-9ce6-7f62ff306607 ONLINE 0 0 0 f282d399-faac-4204-b27a-35e016657eed ONLINE 0 0 0 9ad64a82-c7b5-42df-b5b8-b2933625c8b7 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: <metadata>:<0x0> <metadata>:<0x3d> root@atlas[~]#
I don't believe the failure is due to an issue with the drives themselves. I've experienced this same failure previously, at the time I just blasted and rebuilt the pool because it had no data at that time.
I'll periodically see this appear on the console as well:
Code:
2023 Apr 12 13:29:18 atlas Device: /dev/sdaf [SAT], 168 Currently unreadable (pending) sectors 2023 Apr 12 13:29:18 atlas Device: /dev/sdaf [SAT], 21 Offline uncorrectable sectors
I ran a smart test against this drive, no failures:
Code:
root@atlas[~]# smartctl -a /dev/sdaf smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.79+truenas] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Toshiba MG07ACA... Enterprise Capacity HDD Device Model: TOSHIBA MG07ACA14TE Serial Number: Z810A01LF94G LU WWN Device Id: 5 000039 918c80421 Firmware Version: 0101 User Capacity: 14,000,519,643,136 bytes [14.0 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-3 T13/2161-D revision 5 SATA Version is: SATA 3.3, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Wed Apr 12 14:00:29 2023 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
I checked dmesg for anything interesting, and see the following:
Code:
[ 967.806118] INFO: task agents:13957 blocked for more than 241 seconds. [ 967.806777] Tainted: P OE 5.15.79+truenas #1 [ 967.807456] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 967.808182] task:agents state:D stack: 0 pid:13957 ppid: 1 flags:0x00000000 [ 967.808886] Call Trace: [ 967.809573] <TASK> [ 967.810282] __schedule+0x2f0/0x950 [ 967.810980] schedule+0x5b/0xd0 [ 967.811676] io_schedule+0x42/0x70 [ 967.812379] cv_wait_common+0xaa/0x130 [spl] [ 967.813091] ? finish_wait+0x90/0x90 [ 967.813804] txg_wait_synced_impl+0x92/0x110 [zfs] [ 967.814686] txg_wait_synced+0xc/0x40 [zfs] [ 967.815766] spa_vdev_state_exit+0x8a/0x170 [zfs] [ 967.816852] zfs_ioc_vdev_set_state+0xe2/0x1b0 [zfs] [ 967.817966] zfsdev_ioctl_common+0x698/0x750 [zfs] [ 967.819076] ? __kmalloc_node+0x3d6/0x480 [ 967.820013] ? _copy_from_user+0x28/0x60 [ 967.820759] zfsdev_ioctl+0x53/0xe0 [zfs] [ 967.821676] __x64_sys_ioctl+0x8b/0xc0 [ 967.822432] do_syscall_64+0x3b/0xc0 [ 967.823169] entry_SYSCALL_64_after_hwframe+0x61/0xcb [ 967.823885] RIP: 0033:0x7fb2ad6386b7 [ 967.824581] RSP: 002b:00007fb2ac884308 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 967.825288] RAX: ffffffffffffffda RBX: 00007fb2ac884320 RCX: 00007fb2ad6386b7 [ 967.825998] RDX: 00007fb2ac884320 RSI: 0000000000005a0d RDI: 000000000000000d [ 967.826715] RBP: 00007fb2ac887d10 R08: 000000000006ebf4 R09: 0000000000000000 [ 967.827424] R10: 00007fb2a0014ab0 R11: 0000000000000246 R12: 00007fb2a00215c0 [ 967.828141] R13: 00007fb2ac8878d0 R14: 0000558f339b60d0 R15: 00007fb2a0026e80 [ 967.828864] </TASK>
Unfortunately, I don't quite understand the output I'm seeing here.
The drives in use for zpool HDD are currently shown as unassigned, despite the zpool still existing. The zpool HDD still shows under Storage. If I select 'Storage > Disks' view, then I see the disks associated with HDD zpool from there.
Some possible important notes:
The data on zpool HDD is not important, data recovery is not required.
The HDD zpool is connected via external NetApp DS4246 shelf, but using the same onboard HBA as my healthy zpool which has never shown errors.
I've already tried swapping the cable to the DS4246 and seemingly no change in behavior.
My other healthy zpool was created back in the FreeNAS days, I have yet to update the ZFS version on the pool.
I've tried capturing the zpool version for my healthy pool, but the zpool get version command does not report a version.
Any help with this is greatly appreciated.