Drives in zpool randomly being marked as REMOVED, now showing unassigned

snackmasterx · Apr 12, 2023

Hello,

I am getting intermittent failures on one of my zpools, and need some help narrowing down the cause of the failure.

Currently running TrueNAS-SCALE-22.12.2 - zpool HDD was created on TrueNAS-SCALE-22.12.1
Motherboard: Supermicro X10SRH-CLN4F - https://www.supermicro.com/en/products/motherboard/X10SRH-CLN4F / https://www.supermicro.com/en/products/system/2U/5028/SSG-5028R-E1CR12L.cfm
CPU: Intel Xeon CPU E5-2620 v3
RAM: 256GB
Hard drives:

Code:

1x KDM-SA.71-016GMJ - SATADOM boot drive
8x WDC WD4002FFWX-68TZ4N0 in RAID Z2 - Healthy zpool, connected via onboard chassis hot-swap sleds
12x TOSHIBA MG07ACA14TE in RAID Z2 - Unealthy zpool, connected via NetApp DS4246 shelf

Hard disk controller:

Code:

root@atlas[~]# sas3flash -list
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02)
Copyright 2008-2017 Avago Technologies. All rights reserved.

        Adapter Selected is a Avago SAS: SAS3008(C0)

        Controller Number              : 0
        Controller                     : SAS3008(C0)
        PCI Address                    : 00:01:00:00
        SAS Address                    : 5003048-0-1cb4-a300
        NVDATA Version (Default)       : 0e.01.00.07
        NVDATA Version (Persistent)    : 0e.01.00.07
        Firmware Product ID            : 0x2221 (IT)
        Firmware Version               : 16.00.10.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : SAS9300-8i
        BIOS Version                   : 08.37.00.00
        UEFI BSD Version               : 18.00.00.00
        FCODE Version                  : N/A
        Board Name                     : SAS9300-8i
        Board Assembly                 : N/A
        Board Tracer Number            : N/A

        Finished Processing Commands Successfully.
        Exiting SAS3Flash.
root@atlas[~]#

Network cards - Supermicro AOC-STGN-I2S

I noticed some drives reported as REMOVED in 'Manage Devices' for zpool HDD, in the UI. I ran a short SMART test against those drives, and they all passed. Whenever I run 'zpool clear HDD' drives which previously reported as REMOVED will sometimes show as online, and ones previously marked as online may randomly show as REMOVED. In short, it's not always the same drives reporting REMOVED.

Code:

root@atlas[~]# zpool status HDD
  pool: HDD
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-JQ
  scan: resilvered 456K in 00:00:00 with 0 errors on Wed Apr 12 13:38:37 2023
config:

        NAME                                      STATE     READ WRITE CKSUM
        HDD                                       UNAVAIL      0     0     0  insufficient replicas
          raidz2-0                                UNAVAIL      0     0     0  insufficient replicas
            e8a54291-d2ea-47c5-8bd9-a2b30dc07e9f  ONLINE       0     0     0
            fdfee4b0-ce4f-4b05-9ce7-c9fa1396faf9  ONLINE       0     0     0
            9bb2a58a-8da5-4baf-9e70-aad12ffc8f65  ONLINE       0     0     0
            dfac7d2f-85ea-43a5-b4cc-7cc777eccab6  ONLINE       0     0     0
            913befe9-d758-4b65-b4e8-f5f580a64ba2  REMOVED      0     0     0
            58b9a0a2-de7b-4ce1-87b5-e03b6ecd1897  ONLINE       0     0     0
            70e42a65-22bc-47f9-bd3a-a7c9580af1d6  ONLINE       0     0     0
            68300e12-e21f-4c33-a029-5747a0873a3c  ONLINE       0     0     0
            65c88c1e-056e-4e43-a30d-e0b4d2814d65  ONLINE       0     0     0
            439165af-c339-4da9-9ce6-7f62ff306607  REMOVED      0     0     0
            f282d399-faac-4204-b27a-35e016657eed  ONLINE       0     0     0
            9ad64a82-c7b5-42df-b5b8-b2933625c8b7  REMOVED      0     0     0
errors: List of errors unavailable: pool I/O is currently suspended

errors: 4 data errors, use '-v' for a list
root@atlas[~]#

The zpool status -v failed due to the suspended IO to the pool, so I had to chain my commands:

Code:

root@atlas[~]# zpool clear HDD && zpool status -v HDD
  pool: HDD
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 412K in 00:00:00 with 0 errors on Wed Apr 12 13:07:03 2023
config:

        NAME                                      STATE     READ WRITE CKSUM
        HDD                                       ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            e8a54291-d2ea-47c5-8bd9-a2b30dc07e9f  ONLINE       0     0     0
            fdfee4b0-ce4f-4b05-9ce7-c9fa1396faf9  ONLINE       0     0     0
            9bb2a58a-8da5-4baf-9e70-aad12ffc8f65  ONLINE       0     0     0
            dfac7d2f-85ea-43a5-b4cc-7cc777eccab6  ONLINE       0     0     0
            913befe9-d758-4b65-b4e8-f5f580a64ba2  ONLINE       0     0     0
            58b9a0a2-de7b-4ce1-87b5-e03b6ecd1897  ONLINE       0     0     0
            70e42a65-22bc-47f9-bd3a-a7c9580af1d6  ONLINE       0     0     0
            68300e12-e21f-4c33-a029-5747a0873a3c  ONLINE       0     0     0
            65c88c1e-056e-4e43-a30d-e0b4d2814d65  ONLINE       0     0     0
            439165af-c339-4da9-9ce6-7f62ff306607  ONLINE       0     0     0
            f282d399-faac-4204-b27a-35e016657eed  ONLINE       0     0     0
            9ad64a82-c7b5-42df-b5b8-b2933625c8b7  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x0>
        <metadata>:<0x3d>
root@atlas[~]#

I don't believe the failure is due to an issue with the drives themselves. I've experienced this same failure previously, at the time I just blasted and rebuilt the pool because it had no data at that time.

I'll periodically see this appear on the console as well:

Code:

2023 Apr 12 13:29:18 atlas Device: /dev/sdaf [SAT], 168 Currently unreadable (pending) sectors
2023 Apr 12 13:29:18 atlas Device: /dev/sdaf [SAT], 21 Offline uncorrectable sectors

I ran a smart test against this drive, no failures:

Code:

root@atlas[~]# smartctl -a /dev/sdaf
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.79+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba MG07ACA... Enterprise Capacity HDD
Device Model:     TOSHIBA MG07ACA14TE
Serial Number:    Z810A01LF94G
LU WWN Device Id: 5 000039 918c80421
Firmware Version: 0101
User Capacity:    14,000,519,643,136 bytes [14.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed Apr 12 14:00:29 2023 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

I checked dmesg for anything interesting, and see the following:

Code:

[  967.806118] INFO: task agents:13957 blocked for more than 241 seconds.
[  967.806777]       Tainted: P           OE     5.15.79+truenas #1
[  967.807456] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  967.808182] task:agents          state:D stack:    0 pid:13957 ppid:     1 flags:0x00000000
[  967.808886] Call Trace:
[  967.809573]  <TASK>
[  967.810282]  __schedule+0x2f0/0x950
[  967.810980]  schedule+0x5b/0xd0
[  967.811676]  io_schedule+0x42/0x70
[  967.812379]  cv_wait_common+0xaa/0x130 [spl]
[  967.813091]  ? finish_wait+0x90/0x90
[  967.813804]  txg_wait_synced_impl+0x92/0x110 [zfs]
[  967.814686]  txg_wait_synced+0xc/0x40 [zfs]
[  967.815766]  spa_vdev_state_exit+0x8a/0x170 [zfs]
[  967.816852]  zfs_ioc_vdev_set_state+0xe2/0x1b0 [zfs]
[  967.817966]  zfsdev_ioctl_common+0x698/0x750 [zfs]
[  967.819076]  ? __kmalloc_node+0x3d6/0x480
[  967.820013]  ? _copy_from_user+0x28/0x60
[  967.820759]  zfsdev_ioctl+0x53/0xe0 [zfs]
[  967.821676]  __x64_sys_ioctl+0x8b/0xc0
[  967.822432]  do_syscall_64+0x3b/0xc0
[  967.823169]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[  967.823885] RIP: 0033:0x7fb2ad6386b7
[  967.824581] RSP: 002b:00007fb2ac884308 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  967.825288] RAX: ffffffffffffffda RBX: 00007fb2ac884320 RCX: 00007fb2ad6386b7
[  967.825998] RDX: 00007fb2ac884320 RSI: 0000000000005a0d RDI: 000000000000000d
[  967.826715] RBP: 00007fb2ac887d10 R08: 000000000006ebf4 R09: 0000000000000000
[  967.827424] R10: 00007fb2a0014ab0 R11: 0000000000000246 R12: 00007fb2a00215c0
[  967.828141] R13: 00007fb2ac8878d0 R14: 0000558f339b60d0 R15: 00007fb2a0026e80
[  967.828864]  </TASK>

Unfortunately, I don't quite understand the output I'm seeing here.

The drives in use for zpool HDD are currently shown as unassigned, despite the zpool still existing. The zpool HDD still shows under Storage. If I select 'Storage > Disks' view, then I see the disks associated with HDD zpool from there.

Some possible important notes:
The data on zpool HDD is not important, data recovery is not required.
The HDD zpool is connected via external NetApp DS4246 shelf, but using the same onboard HBA as my healthy zpool which has never shown errors.
I've already tried swapping the cable to the DS4246 and seemingly no change in behavior.
My other healthy zpool was created back in the FreeNAS days, I have yet to update the ZFS version on the pool.
I've tried capturing the zpool version for my healthy pool, but the zpool get version command does not report a version.

Any help with this is greatly appreciated.

jake15x0 · Jun 27, 2023

Bumping this. I'm having exactly the same issue. Using DS4246's myself. I see the task agent blocked message / tainted. The pools "work". I can connect over NFS and write to them. However, the drives are showing as unassigned in the GUI. Pool was created in TrueNAS Scale originally and I'm running latest stable release.
@snackmasterx Did you solve this yourself?

sretalla · Jun 28, 2023

snackmasterx said:
Adapter Selected is a Avago SAS: SAS3008(C0) Controller Number : 0 Controller : SAS3008(C0) PCI Address : 00:01:00:00 SAS Address : 5003048-0-1cb4-a300 NVDATA Version (Default) : 0e.01.00.07 NVDATA Version (Persistent) : 0e.01.00.07 Firmware Product ID : 0x2221 (IT) Firmware Version : 16.00.10.00

That adapter has a private firmware version 12 (you're on 10) which may address something similar to the issue you're seeing:

LSI 9300-xx Firmware Update

Hey Community, If you are using an LSI 9300 HBA with FreeNAS or the soon-to-be TrueNAS CORE, you may experience some performance issues causing the controller to reset when using SATA HDDs. After working with Broadcom, we’ve come up with a...

www.truenas.com

snackmasterx · Jun 28, 2023

I had empty bays on my supermicro chassis and moved all drives for my zpool into the supermicro, out of the DS4246. When I did that, instead of all drives in the pool randomly failing out and throwing errors, I had a single drive consistently fail and throw errors. These errors did follow the drive when I moved it to another slot. I also moved all drives except the failed drive back to the DS4246, and the zpool was stable.

I ended up ditching the DS4246 because for some reason, a single drive failure does not behave like a single drive failure when using a DS4246. I'm not sure if this is a known issue, or if others have experienced this. Best I can suggest, is to test each drive individually outside of your DS4246 and see if perhaps you have a failed drive in your pool that's causing your problems.

That said, there could be a number of other possible causes. Bad cable, IOM6, HBA, or backplane.

Regarding the firmware version, I'll do some googling to see if I can track down firmware version 12. Last time I looked around, this was the latest version I could find. Unfortunately since I don't have the DS4246 anymore, nor a failed drive to replicate my issue, I can't confirm if the newer firmware resolves my issue.

sretalla · Jun 28, 2023

snackmasterx said:
Regarding the firmware version, I'll do some googling to see if I can track down firmware version 12

Don't, it's right there in the linked resource (download button at the top right).

snackmasterx · Jun 28, 2023

sretalla said:
Don't, it's right there in the linked resource (download button at the top right).

Thanks, I missed the download button initially.

Important Announcement for the TrueNAS Community.

Drives in zpool randomly being marked as REMOVED, now showing unassigned

snackmasterx

Cadet

jake15x0

Cadet

sretalla

Powered by Neutrality

LSI 9300-xx Firmware Update

snackmasterx

Cadet

sretalla

Powered by Neutrality

snackmasterx

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Drives in zpool randomly being marked as REMOVED, now showing unassigned

snackmasterx

Cadet

jake15x0

Cadet

sretalla

Powered by Neutrality

LSI 9300-xx Firmware Update

snackmasterx

Cadet

sretalla

Powered by Neutrality

snackmasterx

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Drives in zpool randomly being marked as REMOVED, now showing unassigned"

Similar threads