all of my mirrored pools swap partitions degraded at the same time (1 hd each)

pinoli · Dec 22, 2022

I run 6 mirrored pools on my TrueNAS SCALE Bluefin 22.12 install:

3x through a Broadcom 9405W-16i
- 2x18TB Seagate X18 ST18000NM004J SAS
- 2x18TB Seagate X18 ST18000NM004J SAS
- 2x18TB Seagate X18 ST18000NM000J SATA<--- new pool
2x through the mobo SATA controller
- 2x 256GB Samsung 870 Evo <-- boot-pool
1x through the NVMe PCIe controller on the mobo
- 2x 2TB Samsung 970 Evo Plus <-- system & app pool

Yesterday I installed the two new Seagate Exos X18 ST18000NM000J (flashed with latest SN04 fw before installation), created a new encrypted pool, a new dataset and started copying around 6TB of data from another pool (always on the Broadcom).

After 3-4 hours at 3:00 AM, I received a SCALE notification together with an email saying one of the new drive experienced too many write errors (I believe it was 72) and was put offline, so the pool was now degraded.
I checked dmseg and this was the output for that event

[13145.247058] sd 0:0:1:0: attempting task abort!scmd(0x00000000143350ad), outstanding for 31936 ms & timeout 30000 ms
[13145.247952] sd 0:0:1:0: [sdf] tag#213 CDB: Write(16) 8a 00 00 00 00 01 9c d3 4f 60 00 00 08 00 00 00
[13145.249217] scsi target0:0:1: handle(0x0022), sas_address(0x300605b00fe37e21), phy(1)
[13145.250063] scsi target0:0:1: enclosure logical id(0x500605b00fe37e20), slot(8)
[13145.250920] scsi target0:0:1: enclosure level(0x0000), connector name( C2 )
[13145.279663] sd 0:0:1:0: task abort: SUCCESS scmd(0x00000000143350ad)
[13145.279688] mpt3sas_cm0: log_info(0x31130000): originator(PL), code(0x13), sub_code(0x0000)
[13145.281103] sd 0:0:1:0: [sdf] tag#213 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[13145.283409] sd 0:0:1:0: [sdf] tag#213 CDB: Write(16) 8a 00 00 00 00 01 9c d3 4f 60 00 00 08 00 00 00
[13145.284861] blk_update_request: I/O error, dev sdf, sector 6926061408 op 0x1:(WRITE) flags 0x700 phys_seg 70 prio class 0
[13145.286329] zio pool=exos-x18-c vdev=/dev/disk/by-partuuid/5ae37f3c-2783-47d6-87ef-4f49769037cd error=5 type=2 offset=3543995891712 size=1048576 flags=40080c80
[13145.287862] sd 0:0:1:0: attempting task abort!scmd(0x0000000008b8b645), outstanding for 31920 ms & timeout 30000 ms
[13145.289387] sd 0:0:1:0: [sdf] tag#211 CDB: Write(16) 8a 00 00 00 00 01 9c d3 77 10 00 00 07 d0 00 00
[13145.290828] scsi target0:0:1: handle(0x0022), sas_address(0x300605b00fe37e21), phy(1)
[13145.291828] scsi target0:0:1: enclosure logical id(0x500605b00fe37e20), slot(8)
[13145.292732] scsi target0:0:1: enclosure level(0x0000), connector name( C2 )
[13145.293643] sd 0:0:1:0: No reference found at driver, assuming scmd(0x0000000008b8b645) might have completed
[13145.294569] sd 0:0:1:0: task abort: SUCCESS scmd(0x0000000008b8b645)
[13145.295500] sd 0:0:1:0: [sdf] tag#211 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[13145.296430] sd 0:0:1:0: [sdf] tag#211 CDB: Write(16) 8a 00 00 00 00 01 9c d3 77 10 00 00 07 d0 00 00
[13145.297355] blk_update_request: I/O error, dev sdf, sector 6926071568 op 0x1:(WRITE) flags 0x4700 phys_seg 73 prio class 0
[13145.298299] zio pool=exos-x18-c vdev=/dev/disk/by-partuuid/5ae37f3c-2783-47d6-87ef-4f49769037cd error=5 type=2 offset=3544001093632 size=1024000 flags=40080c80
[13145.299293] sd 0:0:1:0: attempting task abort!scmd(0x00000000bd4f552d), outstanding for 31932 ms & timeout 30000 ms
[13145.300279] sd 0:0:1:0: [sdf] tag#212 CDB: Write(16) 8a 00 00 00 00 01 9c d3 7e e0 00 00 08 00 00 00
[13145.301253] scsi target0:0:1: handle(0x0022), sas_address(0x300605b00fe37e21), phy(1)
[13145.302242] scsi target0:0:1: enclosure logical id(0x500605b00fe37e20), slot(8)
[13145.303242] scsi target0:0:1: enclosure level(0x0000), connector name( C2 )
[13145.304241] sd 0:0:1:0: No reference found at driver, assuming scmd(0x00000000bd4f552d) might have completed
[13145.305259] sd 0:0:1:0: task abort: SUCCESS scmd(0x00000000bd4f552d)
[13145.306265] sd 0:0:1:0: [sdf] tag#212 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[13145.307295] sd 0:0:1:0: [sdf] tag#212 CDB: Write(16) 8a 00 00 00 00 01 9c d3 7e e0 00 00 08 00 00 00
[13145.308327] blk_update_request: I/O error, dev sdf, sector 6926073568 op 0x1:(WRITE) flags 0x4700 phys_seg 103 prio class 0
[13145.309368] zio pool=exos-x18-c vdev=/dev/disk/by-partuuid/5ae37f3c-2783-47d6-87ef-4f49769037cd error=5 type=2 offset=3544002117632 size=1048576 flags=40080c80
[13145.310444] sd 0:0:1:0: attempting task abort!scmd(0x00000000b61440d8), outstanding for 31940 ms & timeout 30000 ms
[13145.311510] sd 0:0:1:0: [sdf] tag#209 CDB: Write(16) 8a 00 00 00 00 01 9c d3 6f 10 00 00 08 00 00 00
[13145.312581] scsi target0:0:1: handle(0x0022), sas_address(0x300605b00fe37e21), phy(1)
[13145.313691] scsi target0:0:1: enclosure logical id(0x500605b00fe37e20), slot(8)
[13145.314773] scsi target0:0:1: enclosure level(0x0000), connector name( C2 )
[13145.315854] sd 0:0:1:0: No reference found at driver, assuming scmd(0x00000000b61440d8) might have completed
[13145.316951] sd 0:0:1:0: task abort: SUCCESS scmd(0x00000000b61440d8)
[13145.318053] sd 0:0:1:0: [sdf] tag#209 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[13145.319167] sd 0:0:1:0: [sdf] tag#209 CDB: Write(16) 8a 00 00 00 00 01 9c d3 6f 10 00 00 08 00 00 00
[13145.320285] blk_update_request: I/O error, dev sdf, sector 6926069520 op 0x1:(WRITE) flags 0x4700 phys_seg 79 prio class 0
[13145.321417] zio pool=exos-x18-c vdev=/dev/disk/by-partuuid/5ae37f3c-2783-47d6-87ef-4f49769037cd error=5 type=2 offset=3544000045056 size=1048576 flags=40080c80
[13145.322784] sd 0:0:1:0: attempting task abort!scmd(0x0000000057516cfd), outstanding for 31952 ms & timeout 30000 ms
[13145.324079] sd 0:0:1:0: [sdf] tag#207 CDB: Write(16) 8a 00 00 00 00 01 9c d3 5f 40 00 00 08 00 00 00
[13145.325317] scsi target0:0:1: handle(0x0022), sas_address(0x300605b00fe37e21), phy(1)
[13145.326560] scsi target0:0:1: enclosure logical id(0x500605b00fe37e20), slot(8)
[13145.327815] scsi target0:0:1: enclosure level(0x0000), connector name( C2 )
[13145.329083] sd 0:0:1:0: No reference found at driver, assuming scmd(0x0000000057516cfd) might have completed
[13145.330338] sd 0:0:1:0: task abort: SUCCESS scmd(0x0000000057516cfd)
[13145.331608] sd 0:0:1:0: [sdf] tag#207 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[13145.332894] sd 0:0:1:0: [sdf] tag#207 CDB: Write(16) 8a 00 00 00 00 01 9c d3 5f 40 00 00 08 00 00 00
[13145.334159] blk_update_request: I/O error, dev sdf, sector 6926065472 op 0x1:(WRITE) flags 0x4700 phys_seg 70 prio class 0
[13145.335459] zio pool=exos-x18-c vdev=/dev/disk/by-partuuid/5ae37f3c-2783-47d6-87ef-4f49769037cd error=5 type=2 offset=3543997972480 size=1048576 flags=40080c80
[13145.336793] sd 0:0:1:0: attempting task abort!scmd(0x000000009f37fabf), outstanding for 31968 ms & timeout 30000 ms
[13145.338060] sd 0:0:1:0: [sdf] tag#208 CDB: Write(16) 8a 00 00 00 00 01 9c d3 67 40 00 00 07 d0 00 00
[13145.339332] scsi target0:0:1: handle(0x0022), sas_address(0x300605b00fe37e21), phy(1)
[13145.340598] scsi target0:0:1: enclosure logical id(0x500605b00fe37e20), slot(8)
[13145.341864] scsi target0:0:1: enclosure level(0x0000), connector name( C2 )
[13145.343132] sd 0:0:1:0: No reference found at driver, assuming scmd(0x000000009f37fabf) might have completed
[13145.344417] sd 0:0:1:0: task abort: SUCCESS scmd(0x000000009f37fabf)
[13145.345705] sd 0:0:1:0: [sdf] tag#208 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[13145.346998] sd 0:0:1:0: [sdf] tag#208 CDB: Write(16) 8a 00 00 00 00 01 9c d3 67 40 00 00 07 d0 00 00
[13145.348299] blk_update_request: I/O error, dev sdf, sector 6926067520 op 0x1:(WRITE) flags 0x4700 phys_seg 65 prio class 0
[13145.349612] zio pool=exos-x18-c vdev=/dev/disk/by-partuuid/5ae37f3c-2783-47d6-87ef-4f49769037cd error=5 type=2 offset=3543999021056 size=1024000 flags=40080c80
[13145.350959] sd 0:0:1:0: attempting task abort!scmd(0x00000000768d6a02), outstanding for 32032 ms & timeout 30000 ms
[13145.352313] sd 0:0:1:0: [sdf] tag#206 CDB: Write(16) 8a 00 00 00 00 01 9c d3 57 60 00 00 07 e0 00 00
[13145.353659] scsi target0:0:1: handle(0x0022), sas_address(0x300605b00fe37e21), phy(1)
[13145.355015] scsi target0:0:1: enclosure logical id(0x500605b00fe37e20), slot(8)
[13145.356373] scsi target0:0:1: enclosure level(0x0000), connector name( C2 )
[13145.357748] sd 0:0:1:0: No reference found at driver, assuming scmd(0x00000000768d6a02) might have completed
[13145.359121] sd 0:0:1:0: task abort: SUCCESS scmd(0x00000000768d6a02)
[13145.360507] sd 0:0:1:0: [sdf] tag#206 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=32s
[13145.361905] sd 0:0:1:0: [sdf] tag#206 CDB: Write(16) 8a 00 00 00 00 01 9c d3 57 60 00 00 07 e0 00 00
[13145.363309] blk_update_request: I/O error, dev sdf, sector 6926063456 op 0x1:(WRITE) flags 0x4700 phys_seg 63 prio class 0
[13145.364720] zio pool=exos-x18-c vdev=/dev/disk/by-partuuid/5ae37f3c-2783-47d6-87ef-4f49769037cd error=5 type=2 offset=3543996940288 size=1032192 flags=40080c80
[13145.366161] sd 0:0:1:0: attempting task abort!scmd(0x000000003eb005f8), outstanding for 32056 ms & timeout 30000 ms
[13145.367602] sd 0:0:1:0: [sdf] tag#205 CDB: Write(16) 8a 00 00 00 00 01 9c d3 47 90 00 00 07 d0 00 00
[13145.369045] scsi target0:0:1: handle(0x0022), sas_address(0x300605b00fe37e21), phy(1)
[13145.370498] scsi target0:0:1: enclosure logical id(0x500605b00fe37e20), slot(8)
[13145.371953] scsi target0:0:1: enclosure level(0x0000), connector name( C2 )
[13145.373420] sd 0:0:1:0: No reference found at driver, assuming scmd(0x000000003eb005f8) might have completed
[13145.374884] sd 0:0:1:0: task abort: SUCCESS scmd(0x000000003eb005f8)
[13145.376354] sd 0:0:1:0: [sdf] tag#205 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=32s
[13145.377832] sd 0:0:1:0: [sdf] tag#205 CDB: Write(16) 8a 00 00 00 00 01 9c d3 47 90 00 00 07 d0 00 00
[13145.379316] blk_update_request: I/O error, dev sdf, sector 6926059408 op 0x1:(WRITE) flags 0x700 phys_seg 76 prio class 0
[13145.380816] zio pool=exos-x18-c vdev=/dev/disk/by-partuuid/5ae37f3c-2783-47d6-87ef-4f49769037cd error=5 type=2 offset=3543994867712 size=1024000 flags=40080c80
[13146.114654] sd 0:0:1:0: Power-on or device reset occurred

After 36 minutes, at 3:36 AM I receive 6 emails, one for each pool, telling me "A Fail event had been detected on md device /dev/md12x." This time I got no notifications inside SCALE, but after checking dmseg for this event I see:

[15306.052673] md/raid1:md122: Disk failure on sdj1, disabling device.
md/raid1:md122: Operation continuing on 1 devices.
[15306.419884] md122: detected capacity change from 4190080 to 0
[15306.421398] md: md122 stopped.
[15308.804717] md/raid1:md123: Disk failure on sdh1, disabling device.
md/raid1:md123: Operation continuing on 1 devices.
[15308.995956] md123: detected capacity change from 4190080 to 0
[15308.997482] md: md123 stopped.
[15309.901251] md/raid1:md124: Disk failure on sdi1, disabling device.
md/raid1:md124: Operation continuing on 1 devices.
[15310.107277] md124: detected capacity change from 4190080 to 0
[15310.108818] md: md124 stopped.
[15310.628166] md/raid1:md126: Disk failure on sdb4, disabling device.
md/raid1:md126: Operation continuing on 1 devices.
[15310.955525] md126: detected capacity change from 33519616 to 0
[15310.956959] md: md126 stopped.
[15311.164649] md/raid1:md127: Disk failure on nvme0n1p1, disabling device.
md/raid1:md127: Operation continuing on 1 devices.
[15311.344211] md127: detected capacity change from 4190080 to 0
[15311.345664] md: md127 stopped.
[15311.787724] md/raid1:md125: Disk failure on sdd1, disabling device.
md/raid1:md125: Operation continuing on 1 devices.
[15311.972964] md125: detected capacity change from 4190080 to 0
[15311.974416] md: md125 stopped.
[15316.471064] md: md126 stopped.
[15316.490812] md: md124 stopped.
[15316.510425] md: md123 stopped.
[15316.634605] md/raid1:md127: not clean -- starting background reconstruction
[15316.636111] md/raid1:md127: active with 2 out of 2 mirrors
[15316.637676] md127: detected capacity change from 0 to 4190080
[15316.647024] md: resync of RAID array md127
[15316.986642] md/raid1:md126: not clean -- starting background reconstruction
[15316.988101] md/raid1:md126: active with 2 out of 2 mirrors
[15316.989590] md126: detected capacity change from 0 to 4190080
[15317.008469] md: resync of RAID array md126
[15317.884426] md/raid1:md125: not clean -- starting background reconstruction
[15317.885910] md/raid1:md125: active with 2 out of 2 mirrors
[15317.887366] md125: detected capacity change from 0 to 4190080
[15318.053212] md: resync of RAID array md125
[15318.344159] md/raid1:md124: not clean -- starting background reconstruction
[15318.345630] md/raid1:md124: active with 2 out of 2 mirrors
[15318.347084] md124: detected capacity change from 0 to 4190080
[15318.357314] md: resync of RAID array md124
[15319.282522] md/raid1:md123: not clean -- starting background reconstruction
[15319.283987] md/raid1:md123: active with 2 out of 2 mirrors
[15319.285473] md123: detected capacity change from 0 to 4190080
[15319.288991] md: resync of RAID array md123
[15319.575761] Adding 2095036k swap on /dev/mapper/md127. Priority:-2 extents:1 across:2095036k FS
[15320.844446] Adding 2095036k swap on /dev/mapper/md126. Priority:-3 extents:1 across:2095036k FS
[15321.866974] Adding 2095036k swap on /dev/mapper/md125. Priority:-4 extents:1 across:2095036k FS
[15323.368532] Adding 2095036k swap on /dev/mapper/md124. Priority:-5 extents:1 across:2095036k FS
[15323.461809] Adding 2095036k swap on /dev/mapper/md123. Priority:-6 extents:1 across:2095036k SSFS
[15327.498384] md: md127: resync done.
[15329.690523] md: md124: resync done.
[15329.898522] md: md123: resync done.
[15330.833792] md: md126: resync done.
[15336.982469] md/raid1:md123: Disk failure on nvme0n1p1, disabling device.
md/raid1:md123: Operation continuing on 1 devices.
[15337.159918] md123: detected capacity change from 4190080 to 0
[15337.161167] md: md123 stopped.
[15337.684205] md/raid1:md124: Disk failure on sdd1, disabling device.
md/raid1:md124: Operation continuing on 1 devices.
[15338.235562] md124: detected capacity change from 4190080 to 0
[15338.236860] md: md124 stopped.
[15339.518137] md/raid1:md125: Disk failure on sdj1, disabling device.
md/raid1:md125: Operation continuing on 1 devices.
[15339.518823] md: md125: resync interrupted.
[15339.580922] md: resync of RAID array md125
[15339.582335] md: md125: resync done.
[15339.702637] md125: detected capacity change from 4190080 to 0
[15339.703817] md: md125 stopped.
[15340.162468] md/raid1:md126: Disk failure on sdh1, disabling device.
md/raid1:md126: Operation continuing on 1 devices.
[15340.333880] md126: detected capacity change from 4190080 to 0
[15340.335029] md: md126 stopped.
[15340.838505] md/raid1:md127: Disk failure on sdf1, disabling device.
md/raid1:md127: Operation continuing on 1 devices.
[15341.002913] md127: detected capacity change from 4190080 to 0
[15341.004041] md: md127 stopped.
[15347.515596] md/raid1:md127: not clean -- starting background reconstruction
[15347.516762] md/raid1:md127: active with 2 out of 2 mirrors
[15347.518042] md127: detected capacity change from 0 to 4190080
[15347.545119] md: resync of RAID array md127
[15347.829981] md/raid1:md126: not clean -- starting background reconstruction
[15347.831034] md/raid1:md126: active with 2 out of 2 mirrors
[15347.832085] md126: detected capacity change from 0 to 4190080
[15347.833305] md: resync of RAID array md126
[15348.762812] md/raid1:md125: not clean -- starting background reconstruction
[15348.763824] md/raid1:md125: active with 2 out of 2 mirrors
[15348.764907] md125: detected capacity change from 0 to 4190080
[15348.766123] md: resync of RAID array md125
[15349.745678] md/raid1:md124: not clean -- starting background reconstruction
[15349.746671] md/raid1:md124: active with 2 out of 2 mirrors
[15349.747692] md124: detected capacity change from 0 to 4190080
[15349.755828] md: resync of RAID array md124
[15351.189289] md/raid1:md123: not clean -- starting background reconstruction
[15351.190267] md/raid1:md123: active with 2 out of 2 mirrors
[15351.191242] md123: detected capacity change from 0 to 33519616
[15351.193047] md: resync of RAID array md123
[15351.902026] Adding 2095036k swap on /dev/mapper/md127. Priority:-2 extents:1 across:2095036k FS
[15352.553954] Adding 2095036k swap on /dev/mapper/md126. Priority:-3 extents:1 across:2095036k FS
[15354.081531] Adding 2095036k swap on /dev/mapper/md125. Priority:-4 extents:1 across:2095036k FS
[15359.073898] Adding 2095036k swap on /dev/mapper/md124. Priority:-5 extents:1 across:2095036k FS
[15359.180002] Adding 16759804k swap on /dev/mapper/md123. Priority:-6 extents:1 across:16759804k SSFS
[15360.887870] md: md124: resync done.
[15366.238511] md: md125: resync done.
[15366.519767] md: md127: resync done.
[15367.425958] md: md126: resync done.
[15434.410112] md: md123: resync done.

One drive for each pool issued a Fail event, across 3 controllers. What is weird here is that zpool status didn't show any problem with the other pools, it would only show me the newly installed pool was degraded. Nor TrueNAS SCALE showed any notification (aside from the emails).

This happened while the initial copy was still going on smoothly, even though the destination pool now had one drive.

After almost 9 hours, at 12:12 PM, I received another email telling me that now the other disk in the new pool experienced write errors and was now degraded, but wasn't put offline since files were still copying onto it.
Here the dmseg for this event, identical to the first one in the chain.

[46280.116429] sd 0:0:5:0: attempting task abort!scmd(0x0000000081c33f81), outstanding for 31612 ms & timeout 30000 ms
[46280.117518] sd 0:0:5:0: [sdj] tag#1446 CDB: Write(16) 8a 00 00 00 00 00 04 47 1d 20 00 00 00 20 00 00
[46280.118726] scsi target0:0:5: handle(0x0025), sas_address(0x300605b00fe37e2f), phy(15)
[46280.119868] scsi target0:0:5: enclosure logical id(0x500605b00fe37e20), slot(3)
[46280.121003] scsi target0:0:5: enclosure level(0x0000), connector name( C0 )
[46280.150026] sd 0:0:5:0: task abort: SUCCESS scmd(0x0000000081c33f81)
[46280.151910] sd 0:0:5:0: [sdj] tag#1446 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[46280.153834] sd 0:0:5:0: [sdj] tag#1446 CDB: Write(16) 8a 00 00 00 00 00 04 47 1d 20 00 00 00 20 00 00
[46280.155740] blk_update_request: I/O error, dev sdj, sector 71769376 op 0x1:(WRITE) flags 0x700 phys_seg 2 prio class 0
[46280.157689] zio pool=exos-x18-c vdev=/dev/disk/by-partuuid/5c736f36-b9e2-4c47-8c2f-6eea4732f2cd error=5 type=2 offset=34598371328 size=16384 flags=180880
[46280.159756] sd 0:0:5:0: attempting task abort!scmd(0x0000000080c04a03), outstanding for 31792 ms & timeout 30000 ms
[46280.161812] sd 0:0:5:0: [sdj] tag#1445 CDB: Write(16) 8a 00 00 00 00 00 02 d5 1b d0 00 00 00 20 00 00
[46280.163793] scsi target0:0:5: handle(0x0025), sas_address(0x300605b00fe37e2f), phy(15)
[46280.165821] scsi target0:0:5: enclosure logical id(0x500605b00fe37e20), slot(3)
[46280.167831] scsi target0:0:5: enclosure level(0x0000), connector name( C0 )
[46280.169864] sd 0:0:5:0: No reference found at driver, assuming scmd(0x0000000080c04a03) might have completed
[46280.172028] sd 0:0:5:0: task abort: SUCCESS scmd(0x0000000080c04a03)
[46280.174085] sd 0:0:5:0: [sdj] tag#1445 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[46280.176158] sd 0:0:5:0: [sdj] tag#1445 CDB: Write(16) 8a 00 00 00 00 00 02 d5 1b d0 00 00 00 20 00 00
[46280.178246] blk_update_request: I/O error, dev sdj, sector 47520720 op 0x1:(WRITE) flags 0x700 phys_seg 2 prio class 0
[46280.180358] zio pool=exos-x18-c vdev=/dev/disk/by-partuuid/5c736f36-b9e2-4c47-8c2f-6eea4732f2cd error=5 type=2 offset=22183059456 size=16384 flags=180880
[46280.182571] sd 0:0:5:0: attempting task abort!scmd(0x00000000bd69602c), outstanding for 31676 ms & timeout 30000 ms
[46280.184397] sd 0:0:5:0: [sdj] tag#1442 CDB: Write(16) 8a 00 00 00 00 06 10 92 08 d8 00 00 08 00 00 00
[46280.186206] scsi target0:0:5: handle(0x0025), sas_address(0x300605b00fe37e2f), phy(15)
[46280.187916] scsi target0:0:5: enclosure logical id(0x500605b00fe37e20), slot(3)
[46280.189114] scsi target0:0:5: enclosure level(0x0000), connector name( C0 )
[46280.190147] sd 0:0:5:0: No reference found at driver, assuming scmd(0x00000000bd69602c) might have completed
[46280.191190] sd 0:0:5:0: task abort: SUCCESS scmd(0x00000000bd69602c)
[46280.192340] sd 0:0:5:0: [sdj] tag#1442 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[46280.193390] sd 0:0:5:0: [sdj] tag#1442 CDB: Write(16) 8a 00 00 00 00 06 10 92 08 d8 00 00 08 00 00 00
[46280.194436] blk_update_request: I/O error, dev sdj, sector 26047809752 op 0x1:(WRITE) flags 0x700 phys_seg 83 prio class 0
[46280.195510] zio pool=exos-x18-c vdev=/dev/disk/by-partuuid/5c736f36-b9e2-4c47-8c2f-6eea4732f2cd error=5 type=2 offset=13334331043840 size=1048576 flags=40080c80
[46280.196627] sd 0:0:5:0: attempting task abort!scmd(0x0000000005af84fe), outstanding for 31692 ms & timeout 30000 ms
[46280.197728] sd 0:0:5:0: [sdj] tag#1444 CDB: Write(16) 8a 00 00 00 00 06 10 92 16 68 00 00 09 d8 00 00
[46280.198812] scsi target0:0:5: handle(0x0025), sas_address(0x300605b00fe37e2f), phy(15)
[46280.199903] scsi target0:0:5: enclosure logical id(0x500605b00fe37e20), slot(3)
[46280.201013] scsi target0:0:5: enclosure level(0x0000), connector name( C0 )
[46280.202236] sd 0:0:5:0: No reference found at driver, assuming scmd(0x0000000005af84fe) might have completed
[46280.203356] sd 0:0:5:0: task abort: SUCCESS scmd(0x0000000005af84fe)
[46280.204479] sd 0:0:5:0: [sdj] tag#1444 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[46280.205733] sd 0:0:5:0: [sdj] tag#1444 CDB: Write(16) 8a 00 00 00 00 06 10 92 16 68 00 00 09 d8 00 00
[46280.207337] blk_update_request: I/O error, dev sdj, sector 26047813224 op 0x1:(WRITE) flags 0x700 phys_seg 106 prio class 0
[46280.208493] zio pool=exos-x18-c vdev=/dev/disk/by-partuuid/5c736f36-b9e2-4c47-8c2f-6eea4732f2cd error=5 type=2 offset=13334333063168 size=1048576 flags=40080c80
[46280.209688] sd 0:0:5:0: attempting task abort!scmd(0x00000000e209cf3f), outstanding for 31704 ms & timeout 30000 ms
[46280.210889] sd 0:0:5:0: [sdj] tag#1443 CDB: Write(16) 8a 00 00 00 00 06 10 92 10 d8 00 00 05 90 00 00
[46280.212118] scsi target0:0:5: handle(0x0025), sas_address(0x300605b00fe37e2f), phy(15)
[46280.213365] scsi target0:0:5: enclosure logical id(0x500605b00fe37e20), slot(3)
[46280.214574] scsi target0:0:5: enclosure level(0x0000), connector name( C0 )
[46280.215790] sd 0:0:5:0: No reference found at driver, assuming scmd(0x00000000e209cf3f) might have completed
[46280.217093] sd 0:0:5:0: task abort: SUCCESS scmd(0x00000000e209cf3f)
[46280.218372] sd 0:0:5:0: [sdj] tag#1443 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[46280.219566] sd 0:0:5:0: [sdj] tag#1443 CDB: Write(16) 8a 00 00 00 00 06 10 92 10 d8 00 00 05 90 00 00
[46280.220825] blk_update_request: I/O error, dev sdj, sector 26047811800 op 0x1:(WRITE) flags 0x4700 phys_seg 128 prio class 0
[46280.222134] zio pool=exos-x18-c vdev=/dev/disk/by-partuuid/5c736f36-b9e2-4c47-8c2f-6eea4732f2cd error=5 type=2 offset=13334332092416 size=970752 flags=40080c80
[46280.223435] sd 0:0:5:0: attempting task abort!scmd(0x00000000c519bc86), outstanding for 31716 ms & timeout 30000 ms
[46280.224750] sd 0:0:5:0: [sdj] tag#1441 CDB: Write(16) 8a 00 00 00 00 06 10 92 01 08 00 00 07 d0 00 00
[46280.226054] scsi target0:0:5: handle(0x0025), sas_address(0x300605b00fe37e2f), phy(15)
[46280.227300] scsi target0:0:5: enclosure logical id(0x500605b00fe37e20), slot(3)
[46280.228583] scsi target0:0:5: enclosure level(0x0000), connector name( C0 )
[46280.229825] sd 0:0:5:0: No reference found at driver, assuming scmd(0x00000000c519bc86) might have completed
[46280.231139] sd 0:0:5:0: task abort: SUCCESS scmd(0x00000000c519bc86)
[46280.232443] sd 0:0:5:0: [sdj] tag#1441 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[46280.233760] sd 0:0:5:0: [sdj] tag#1441 CDB: Write(16) 8a 00 00 00 00 06 10 92 01 08 00 00 07 d0 00 00
[46280.235017] blk_update_request: I/O error, dev sdj, sector 26047807752 op 0x1:(WRITE) flags 0x4700 phys_seg 88 prio class 0
[46280.236298] zio pool=exos-x18-c vdev=/dev/disk/by-partuuid/5c736f36-b9e2-4c47-8c2f-6eea4732f2cd error=5 type=2 offset=13334330019840 size=1024000 flags=40080c80
[46280.237655] sd 0:0:5:0: attempting task abort!scmd(0x00000000171572b2), outstanding for 31732 ms & timeout 30000 ms
[46280.239006] sd 0:0:5:0: [sdj] tag#1439 CDB: Write(16) 8a 00 00 00 00 06 10 91 f1 88 00 00 07 c0 00 00
[46280.240307] scsi target0:0:5: handle(0x0025), sas_address(0x300605b00fe37e2f), phy(15)
[46280.241683] scsi target0:0:5: enclosure logical id(0x500605b00fe37e20), slot(3)
[46280.243026] scsi target0:0:5: enclosure level(0x0000), connector name( C0 )
[46280.244342] sd 0:0:5:0: No reference found at driver, assuming scmd(0x00000000171572b2) might have completed
[46280.245712] sd 0:0:5:0: task abort: SUCCESS scmd(0x00000000171572b2)
[46280.247030] sd 0:0:5:0: [sdj] tag#1439 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[46280.248789] sd 0:0:5:0: [sdj] tag#1439 CDB: Write(16) 8a 00 00 00 00 06 10 91 f1 88 00 00 07 c0 00 00
[46280.250353] blk_update_request: I/O error, dev sdj, sector 26047803784 op 0x1:(WRITE) flags 0x4700 phys_seg 88 prio class 0
[46280.251703] zio pool=exos-x18-c vdev=/dev/disk/by-partuuid/5c736f36-b9e2-4c47-8c2f-6eea4732f2cd error=5 type=2 offset=13334327988224 size=1015808 flags=40080c80
[46280.253220] sd 0:0:5:0: attempting task abort!scmd(0x00000000ff934634), outstanding for 31748 ms & timeout 30000 ms
[46280.254606] sd 0:0:5:0: [sdj] tag#1440 CDB: Write(16) 8a 00 00 00 00 06 10 91 f9 48 00 00 07 c0 00 00
[46280.255993] scsi target0:0:5: handle(0x0025), sas_address(0x300605b00fe37e2f), phy(15)
[46280.257427] scsi target0:0:5: enclosure logical id(0x500605b00fe37e20), slot(3)
[46280.258821] scsi target0:0:5: enclosure level(0x0000), connector name( C0 )
[46280.260247] sd 0:0:5:0: No reference found at driver, assuming scmd(0x00000000ff934634) might have completed
[46280.261724] sd 0:0:5:0: task abort: SUCCESS scmd(0x00000000ff934634)
[46280.263168] sd 0:0:5:0: [sdj] tag#1440 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[46280.264643] sd 0:0:5:0: [sdj] tag#1440 CDB: Write(16) 8a 00 00 00 00 06 10 91 f9 48 00 00 07 c0 00 00
[46280.266093] blk_update_request: I/O error, dev sdj, sector 26047805768 op 0x1:(WRITE) flags 0x4700 phys_seg 86 prio class 0
[46280.267547] zio pool=exos-x18-c vdev=/dev/disk/by-partuuid/5c736f36-b9e2-4c47-8c2f-6eea4732f2cd error=5 type=2 offset=13334329004032 size=1015808 flags=40080c80
TRUNCATED FOR MAX CHAR LIMIT BUT IT'S IDENTICAL TO THE FIRST ERROR, THIS TIME FOR SDJ

After 3 more hours, files finished copying and the new dataset was still there standing, with one disk in a degraded state.
The other pools were active and healthy like nothing happened. I run smartctl -x /dev/sdX for each drive and have zero error, including on the two new drives: smart tests passed all with flying colors. (can't enclose the outputs of smartctl -x for 10 drives, if needed I'll do it in subsequent posts).

This is quite bizarre and very scary, considering all of my pools flipped at one time if what dmseg says is true.
Also, this cannot be related to the HBA PCIe controller, or any other controller since this was across all the available controllers inside my rig.
Moreover, the two new drives were sourced from different vendors to avoid the same batch, and I find it highly suspicious that both failed in the same way immediately after installing them, also SMART tests tell me the drives are all right. I also put the two drives on two different ports on the HBA, using separate cables.
Plus, I copied other 700GB just in case on the new (now degraded) pool, and didn't hit any new error.

My hunch tells me drives and controllers are not at fault here, but other than that I have no clue. Also not sure if I should open a ticket about this.
If there any other test I can perform on the new drives let me know, thanks.

P.S.: after the new pool got degraded and one of the drive was put offline, I started a SMART long test on this drive that is still running (almost at 24hrs). I intend on resilvering the pool once the test is over.

c77dk · Dec 22, 2022

Sounds quite weird. My initial though is: how are the drives connected to power?

pinoli · Dec 22, 2022

c77dk said:
Sounds quite weird. My initial though is: how are the drives connected to power?

SATA connectors straight to a Seasonic TX-650 Titanium PSU.
Also, the rig is connected to the wall through an APC UPS SMT-1000, so very unlikely this is a power issue.

P.S.: except of course the 2x NVMe drives, they are attached on M2 connectors directly on the motherboard.

pinoli · Dec 22, 2022

the SMART long test on one of the new drive has finished and guess what: drive is perfect.
below the output of smartctl -x for both drives

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.79+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: ST18000NM000J-2TV103
Serial Number: XXXXXXX
LU WWN Device Id: 5 000c50 0e59a9a8a
Firmware Version: SN04
User Capacity: 18,000,207,937,536 bytes [18.0 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-4 (minor revision not indicated)
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Dec 23 04:25:58 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM feature is: Unavailable
Rd look-ahead is: Enabled
Write cache is: Enabled
DSN feature is: Disabled
ATA Security is: Disabled, NOT FROZEN [SEC1]
Write SCT (Get) Feature Control Command failed: scsi error unsupported field in scsi command
Wt Cache Reorder: Unknown (SCT Feature Control command failed)

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 567) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: (1571) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x70bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-- 084 066 044 - 226810912
3 Spin_Up_Time PO---- 096 096 000 - 0
4 Start_Stop_Count -O--CK 100 100 020 - 4
5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0
7 Seek_Error_Rate POSR-- 070 060 045 - 10271867
9 Power_On_Hours -O--CK 100 100 000 - 189
10 Spin_Retry_Count PO--C- 100 100 097 - 0
12 Power_Cycle_Count -O--CK 100 100 020 - 3
18 Unknown_Attribute PO-R-- 100 100 050 - 0
187 Reported_Uncorrect -O--CK 100 100 000 - 0
188 Command_Timeout -O--CK 100 100 000 - 0
190 Airflow_Temperature_Cel -O---K 063 053 000 - 37 (Min/Max 32/37)
192 Power-Off_Retract_Count -O--CK 100 100 000 - 2
193 Load_Cycle_Count -O--CK 100 100 000 - 283
194 Temperature_Celsius -O---K 037 047 000 - 37 (0 22 0 0 0)
197 Current_Pending_Sector -O--C- 100 100 000 - 0
198 Offline_Uncorrectable ----C- 100 100 000 - 0
199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0
200 Multi_Zone_Error_Rate PO---K 100 100 001 - 0
240 Head_Flying_Hours ------ 100 100 000 - 36 (242 214 0)
241 Total_LBAs_Written ------ 100 253 000 - 2177952505
242 Total_LBAs_Read ------ 100 253 000 - 4714428
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning

General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x01 SL R/O 1 Summary SMART error log
0x02 SL R/O 5 Comprehensive SMART error log
0x03 GPL R/O 5 Ext. Comprehensive SMART error log
0x04 GPL R/O 256 Device Statistics log
0x04 SL R/O 8 Device Statistics log
0x06 SL R/O 1 SMART self-test log
0x07 GPL R/O 1 Extended self-test log
0x08 GPL R/O 2 Power Conditions log
0x09 SL R/W 1 Selective self-test log
0x0a GPL R/W 8 Device Statistics Notification
0x0c GPL R/O 2048 Pending Defects log
0x10 GPL R/O 1 NCQ Command Error log
0x11 GPL R/O 1 SATA Phy Event Counters log
0x13 GPL R/O 1 SATA NCQ Send and Receive log
0x21 GPL R/O 1 Write stream error log
0x22 GPL R/O 1 Read stream error log
0x24 GPL R/O 768 Current Device Internal Status Data log
0x2f GPL - 1 Set Sector Configuration
0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
0xa1 GPL,SL VS 160 Device vendor specific log
0xa2 GPL VS 16320 Device vendor specific log
0xa4 GPL,SL VS 160 Device vendor specific log
0xa6 GPL VS 192 Device vendor specific log
0xa8-0xa9 GPL,SL VS 136 Device vendor specific log
0xab GPL VS 1 Device vendor specific log
0xad GPL VS 16 Device vendor specific log
0xb1 GPL,SL VS 160 Device vendor specific log
0xb6 GPL VS 1920 Device vendor specific log
0xbe-0xbf GPL VS 65535 Device vendor specific log
0xc1 GPL,SL VS 8 Device vendor specific log
0xc3 GPL,SL VS 24 Device vendor specific log
0xc6 GPL VS 5184 Device vendor specific log
0xc7 GPL,SL VS 8 Device vendor specific log
0xc9 GPL,SL VS 8 Device vendor specific log
0xca GPL,SL VS 16 Device vendor specific log
0xcd GPL,SL VS 1 Device vendor specific log
0xce GPL VS 1 Device vendor specific log
0xcf GPL VS 512 Device vendor specific log
0xd1 GPL VS 656 Device vendor specific log
0xd2 GPL VS 10256 Device vendor specific log
0xd4 GPL VS 2048 Device vendor specific log
0xda GPL,SL VS 1 Device vendor specific log
0xe0 GPL,SL R/W 1 SCT Command/Status
0xe1 GPL,SL R/W 1 SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 188 -
# 2 Short offline Completed without error 00% 164 -
# 3 Short offline Completed without error 00% 2 -
# 4 Short offline Completed without error 00% 2 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version: 3
SCT Version (vendor specific): 522 (0x020a)
Device State: Active (0)
Current Temperature: 37 Celsius
Power Cycle Min/Max Temperature: 32/37 Celsius
Lifetime Min/Max Temperature: 22/47 Celsius
Under/Over Temperature Limit Count: 0/0
SMART Status: 0xc24f (PASSED)
Vendor specific:
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version: 2
Temperature Sampling Period: 4 minutes
Temperature Logging Interval: 59 minutes
Min/Max recommended Temperature: 10/40 Celsius
Min/Max Temperature Limit: 5/60 Celsius
Temperature History Size (Index): 128 (65)

Index Estimated Time Temperature Celsius
66 2022-12-17 23:17 32 *************
... ..( 2 skipped). .. *************
69 2022-12-18 02:14 32 *************
70 2022-12-18 03:13 30 ***********
71 2022-12-18 04:12 31 ************
72 2022-12-18 05:11 32 *************
73 2022-12-18 06:10 31 ************
74 2022-12-18 07:09 32 *************
... ..( 11 skipped). .. *************
86 2022-12-18 18:57 32 *************
87 2022-12-18 19:56 31 ************
88 2022-12-18 20:55 31 ************
89 2022-12-18 21:54 31 ************
90 2022-12-18 22:53 32 *************
91 2022-12-18 23:52 31 ************
92 2022-12-19 00:51 31 ************
93 2022-12-19 01:50 32 *************
94 2022-12-19 02:49 31 ************
95 2022-12-19 03:48 30 ***********
96 2022-12-19 04:47 31 ************
... ..( 5 skipped). .. ************
102 2022-12-19 10:41 31 ************
103 2022-12-19 11:40 30 ***********
104 2022-12-19 12:39 30 ***********
105 2022-12-19 13:38 30 ***********
106 2022-12-19 14:37 31 ************
107 2022-12-19 15:36 31 ************
108 2022-12-19 16:35 32 *************
109 2022-12-19 17:34 31 ************
110 2022-12-19 18:33 31 ************
111 2022-12-19 19:32 32 *************
112 2022-12-19 20:31 32 *************
113 2022-12-19 21:30 31 ************
... ..( 2 skipped). .. ************
116 2022-12-20 00:27 31 ************
117 2022-12-20 01:26 32 *************
118 2022-12-20 02:25 31 ************
... ..( 5 skipped). .. ************
124 2022-12-20 08:19 31 ************
125 2022-12-20 09:18 32 *************
126 2022-12-20 10:17 32 *************
127 2022-12-20 11:16 31 ************
... ..( 3 skipped). .. ************
3 2022-12-20 15:12 31 ************
4 2022-12-20 16:11 30 ***********
5 2022-12-20 17:10 30 ***********
6 2022-12-20 18:09 31 ************
7 2022-12-20 19:08 30 ***********
8 2022-12-20 20:07 31 ************
9 2022-12-20 21:06 31 ************
10 2022-12-20 22:05 31 ************
11 2022-12-20 23:04 30 ***********
12 2022-12-21 00:03 31 ************
13 2022-12-21 01:02 30 ***********
14 2022-12-21 02:01 30 ***********
15 2022-12-21 03:00 31 ************
16 2022-12-21 03:59 31 ************
17 2022-12-21 04:58 30 ***********
18 2022-12-21 05:57 31 ************
19 2022-12-21 06:56 31 ************
20 2022-12-21 07:55 31 ************
21 2022-12-21 08:54 32 *************
22 2022-12-21 09:53 31 ************
... ..( 7 skipped). .. ************
30 2022-12-21 17:45 31 ************
31 2022-12-21 18:44 30 ***********
32 2022-12-21 19:43 30 ***********
33 2022-12-21 20:42 30 ***********
34 2022-12-21 21:41 ? -
35 2022-12-21 22:40 22 ***
36 2022-12-21 23:39 33 **************
37 2022-12-22 00:38 34 ***************
38 2022-12-22 01:37 37 ******************
39 2022-12-22 02:36 34 ***************
40 2022-12-22 03:35 39 ********************
41 2022-12-22 04:34 40 *********************
42 2022-12-22 05:33 40 *********************
43 2022-12-22 06:32 39 ********************
... ..( 2 skipped). .. ********************
46 2022-12-22 09:29 39 ********************
47 2022-12-22 10:28 41 **********************
48 2022-12-22 11:27 39 ********************
49 2022-12-22 12:26 41 **********************
50 2022-12-22 13:25 40 *********************
51 2022-12-22 14:24 42 ***********************
52 2022-12-22 15:23 43 ************************
53 2022-12-22 16:22 42 ***********************
54 2022-12-22 17:21 41 **********************
55 2022-12-22 18:20 45 **************************
56 2022-12-22 19:19 43 ************************
57 2022-12-22 20:18 44 *************************
58 2022-12-22 21:17 44 *************************
59 2022-12-22 22:16 43 ************************
60 2022-12-22 23:15 42 ***********************
61 2022-12-23 00:14 42 ***********************
62 2022-12-23 01:13 45 **************************
63 2022-12-23 02:12 43 ************************
64 2022-12-23 03:11 ? -
65 2022-12-23 04:10 32 *************

SCT Error Recovery Control:
Read: 100 (10.0 seconds)
Write: 100 (10.0 seconds)

Device Statistics (GP Log 0x04)
Page Offset Size Value Flags Description
0x01 ===== = = === == General Statistics (rev 1) ==
0x01 0x008 4 3 --- Lifetime Power-On Resets
0x01 0x010 4 189 --- Power-on Hours
0x01 0x018 6 2177956465 --- Logical Sectors Written
0x01 0x020 6 2658924 --- Number of Write Commands
0x01 0x028 6 4714428 --- Logical Sectors Read
0x01 0x030 6 9861 --- Number of Read Commands
0x01 0x038 6 - --- Date and Time TimeStamp
0x03 ===== = = === == Rotating Media Statistics (rev 1) ==
0x03 0x008 4 189 --- Spindle Motor Power-on Hours
0x03 0x010 4 36 --- Head Flying Hours
0x03 0x018 4 283 --- Head Load Events
0x03 0x020 4 0 --- Number of Reallocated Logical Sectors
0x03 0x028 4 0 --- Read Recovery Attempts
0x03 0x030 4 0 --- Number of Mechanical Start Failures
0x03 0x038 4 0 --- Number of Realloc. Candidate Logical Sectors
0x03 0x040 4 2 --- Number of High Priority Unload Events
0x04 ===== = = === == General Errors Statistics (rev 1) ==
0x04 0x008 4 0 --- Number of Reported Uncorrectable Errors
0x04 0x010 4 0 --- Resets Between Cmd Acceptance and Completion
0x04 0x018 4 0 -D- Physical Element Status Changed
0x05 ===== = = === == Temperature Statistics (rev 1) ==
0x05 0x008 1 37 --- Current Temperature
0x05 0x010 1 38 --- Average Short Term Temperature
0x05 0x018 1 - --- Average Long Term Temperature
0x05 0x020 1 45 --- Highest Temperature
0x05 0x028 1 22 --- Lowest Temperature
0x05 0x030 1 38 --- Highest Average Short Term Temperature
0x05 0x038 1 30 --- Lowest Average Short Term Temperature
0x05 0x040 1 - --- Highest Average Long Term Temperature
0x05 0x048 1 - --- Lowest Average Long Term Temperature
0x05 0x050 4 0 --- Time in Over-Temperature
0x05 0x058 1 60 --- Specified Maximum Operating Temperature
0x05 0x060 4 0 --- Time in Under-Temperature
0x05 0x068 1 5 --- Specified Minimum Operating Temperature
0x06 ===== = = === == Transport Statistics (rev 1) ==
0x06 0x008 4 10 --- Number of Hardware Resets
0x06 0x010 4 4 --- Number of ASR Events
0x06 0x018 4 0 --- Number of Interface CRC Errors
0xff ===== = = === == Vendor Specific Statistics (rev 1) ==
0xff 0x008 7 0 --- Vendor Specific
0xff 0x010 7 0 --- Vendor Specific
0xff 0x018 7 0 --- Vendor Specific
|||_ C monitored condition met
||__ D supports DSN
|___ N normalized value

Pending Defects log (GP Log 0x0c)
No Defects Logged

SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x000a 2 2 Device-to-host register FISes sent due to a COMRESET
0x0001 2 0 Command failed due to ICRC error
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS

pinoli · Dec 22, 2022

and this is the second one.

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.79+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: ST18000NM000J-2TV103
Serial Number: XXXXXX
LU WWN Device Id: 5 000c50 0e5bbf310
Firmware Version: SN04
User Capacity: 18,000,207,937,536 bytes [18.0 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-4 (minor revision not indicated)
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Dec 23 04:27:08 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM feature is: Unavailable
Rd look-ahead is: Enabled
Write cache is: Enabled
DSN feature is: Disabled
ATA Security is: Disabled, NOT FROZEN [SEC1]
Write SCT (Get) Feature Control Command failed: scsi error unsupported field in scsi command
Wt Cache Reorder: Unknown (SCT Feature Control command failed)

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 575) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: (1584) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x70bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-- 082 066 044 - 143758600
3 Spin_Up_Time PO---- 096 096 000 - 0
4 Start_Stop_Count -O--CK 100 100 020 - 4
5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0
7 Seek_Error_Rate POSR-- 067 060 045 - 4742283
9 Power_On_Hours -O--CK 100 100 000 - 28
10 Spin_Retry_Count PO--C- 100 100 097 - 0
12 Power_Cycle_Count -O--CK 100 100 020 - 3
18 Unknown_Attribute PO-R-- 100 100 050 - 0
187 Reported_Uncorrect -O--CK 100 100 000 - 0
188 Command_Timeout -O--CK 100 100 000 - 0
190 Airflow_Temperature_Cel -O---K 062 059 000 - 38 (Min/Max 32/38)
192 Power-Off_Retract_Count -O--CK 100 100 000 - 2
193 Load_Cycle_Count -O--CK 100 100 000 - 163
194 Temperature_Celsius -O---K 038 041 000 - 38 (0 18 0 0 0)
197 Current_Pending_Sector -O--C- 100 100 000 - 0
198 Offline_Uncorrectable ----C- 100 100 000 - 0
199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0
200 Multi_Zone_Error_Rate PO---K 100 100 001 - 0
240 Head_Flying_Hours ------ 100 100 000 - 22 (59 224 0)
241 Total_LBAs_Written ------ 100 253 000 - 15156921030
242 Total_LBAs_Read ------ 100 253 000 - 381810594
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning

General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x01 SL R/O 1 Summary SMART error log
0x02 SL R/O 5 Comprehensive SMART error log
0x03 GPL R/O 5 Ext. Comprehensive SMART error log
0x04 GPL R/O 256 Device Statistics log
0x04 SL R/O 8 Device Statistics log
0x06 SL R/O 1 SMART self-test log
0x07 GPL R/O 1 Extended self-test log
0x08 GPL R/O 2 Power Conditions log
0x09 SL R/W 1 Selective self-test log
0x0a GPL R/W 8 Device Statistics Notification
0x0c GPL R/O 2048 Pending Defects log
0x10 GPL R/O 1 NCQ Command Error log
0x11 GPL R/O 1 SATA Phy Event Counters log
0x13 GPL R/O 1 SATA NCQ Send and Receive log
0x21 GPL R/O 1 Write stream error log
0x22 GPL R/O 1 Read stream error log
0x24 GPL R/O 768 Current Device Internal Status Data log
0x2f GPL - 1 Set Sector Configuration
0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
0xa1 GPL,SL VS 160 Device vendor specific log
0xa2 GPL VS 16320 Device vendor specific log
0xa4 GPL,SL VS 160 Device vendor specific log
0xa6 GPL VS 192 Device vendor specific log
0xa8-0xa9 GPL,SL VS 136 Device vendor specific log
0xab GPL VS 1 Device vendor specific log
0xad GPL VS 16 Device vendor specific log
0xb1 GPL,SL VS 160 Device vendor specific log
0xb6 GPL VS 1920 Device vendor specific log
0xbe-0xbf GPL VS 65535 Device vendor specific log
0xc1 GPL,SL VS 8 Device vendor specific log
0xc3 GPL,SL VS 24 Device vendor specific log
0xc6 GPL VS 5184 Device vendor specific log
0xc7 GPL,SL VS 8 Device vendor specific log
0xc9 GPL,SL VS 8 Device vendor specific log
0xca GPL,SL VS 16 Device vendor specific log
0xcd GPL,SL VS 1 Device vendor specific log
0xce GPL VS 1 Device vendor specific log
0xcf GPL VS 512 Device vendor specific log
0xd1 GPL VS 656 Device vendor specific log
0xd2 GPL VS 10256 Device vendor specific log
0xd4 GPL VS 2048 Device vendor specific log
0xda GPL,SL VS 1 Device vendor specific log
0xe0 GPL,SL R/W 1 SCT Command/Status
0xe1 GPL,SL R/W 1 SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 3 -
# 2 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version: 3
SCT Version (vendor specific): 522 (0x020a)
Device State: Active (0)
Current Temperature: 37 Celsius
Power Cycle Min/Max Temperature: 32/37 Celsius
Lifetime Min/Max Temperature: 18/41 Celsius
Under/Over Temperature Limit Count: 0/0
SMART Status: 0xc24f (PASSED)
Vendor specific:
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version: 2
Temperature Sampling Period: 4 minutes
Temperature Logging Interval: 59 minutes
Min/Max recommended Temperature: 10/40 Celsius
Min/Max Temperature Limit: 5/60 Celsius
Temperature History Size (Index): 128 (36)

Index Estimated Time Temperature Celsius
37 2022-12-17 23:17 ? -
... ..( 89 skipped). .. -
127 2022-12-21 15:47 ? -
0 2022-12-21 16:46 26 *******
1 2022-12-21 17:45 ? -
2 2022-12-21 18:44 18 -
3 2022-12-21 19:43 ? -
4 2022-12-21 20:42 23 ****
5 2022-12-21 21:41 ? -
6 2022-12-21 22:40 23 ****
7 2022-12-21 23:39 33 **************
8 2022-12-22 00:38 33 **************
9 2022-12-22 01:37 36 *****************
10 2022-12-22 02:36 36 *****************
11 2022-12-22 03:35 37 ******************
... ..( 5 skipped). .. ******************
17 2022-12-22 09:29 37 ******************
18 2022-12-22 10:28 39 ********************
19 2022-12-22 11:27 37 ******************
20 2022-12-22 12:26 39 ********************
21 2022-12-22 13:25 38 *******************
22 2022-12-22 14:24 35 ****************
23 2022-12-22 15:23 35 ****************
24 2022-12-22 16:22 39 ********************
25 2022-12-22 17:21 39 ********************
26 2022-12-22 18:20 38 *******************
27 2022-12-22 19:19 36 *****************
28 2022-12-22 20:18 37 ******************
29 2022-12-22 21:17 36 *****************
30 2022-12-22 22:16 37 ******************
31 2022-12-22 23:15 34 ***************
32 2022-12-23 00:14 35 ****************
33 2022-12-23 01:13 36 *****************
34 2022-12-23 02:12 35 ****************
35 2022-12-23 03:11 ? -
36 2022-12-23 04:10 32 *************

SCT Error Recovery Control:
Read: 100 (10.0 seconds)
Write: 100 (10.0 seconds)

Device Statistics (GP Log 0x04)
Page Offset Size Value Flags Description
0x01 ===== = = === == General Statistics (rev 1) ==
0x01 0x008 4 3 --- Lifetime Power-On Resets
0x01 0x010 4 28 --- Power-on Hours
0x01 0x018 6 15156919030 --- Logical Sectors Written
0x01 0x020 6 10958630 --- Number of Write Commands
0x01 0x028 6 381834698 --- Logical Sectors Read
0x01 0x030 6 1008725 --- Number of Read Commands
0x01 0x038 6 - --- Date and Time TimeStamp
0x03 ===== = = === == Rotating Media Statistics (rev 1) ==
0x03 0x008 4 28 --- Spindle Motor Power-on Hours
0x03 0x010 4 21 --- Head Flying Hours
0x03 0x018 4 163 --- Head Load Events
0x03 0x020 4 0 --- Number of Reallocated Logical Sectors
0x03 0x028 4 0 --- Read Recovery Attempts
0x03 0x030 4 0 --- Number of Mechanical Start Failures
0x03 0x038 4 0 --- Number of Realloc. Candidate Logical Sectors
0x03 0x040 4 2 --- Number of High Priority Unload Events
0x04 ===== = = === == General Errors Statistics (rev 1) ==
0x04 0x008 4 0 --- Number of Reported Uncorrectable Errors
0x04 0x010 4 0 --- Resets Between Cmd Acceptance and Completion
0x04 0x018 4 0 -D- Physical Element Status Changed
0x05 ===== = = === == Temperature Statistics (rev 1) ==
0x05 0x008 1 38 --- Current Temperature
0x05 0x010 1 - --- Average Short Term Temperature
0x05 0x018 1 - --- Average Long Term Temperature
0x05 0x020 1 40 --- Highest Temperature
0x05 0x028 1 18 --- Lowest Temperature
0x05 0x030 1 - --- Highest Average Short Term Temperature
0x05 0x038 1 - --- Lowest Average Short Term Temperature
0x05 0x040 1 - --- Highest Average Long Term Temperature
0x05 0x048 1 - --- Lowest Average Long Term Temperature
0x05 0x050 4 0 --- Time in Over-Temperature
0x05 0x058 1 60 --- Specified Maximum Operating Temperature
0x05 0x060 4 0 --- Time in Under-Temperature
0x05 0x068 1 5 --- Specified Minimum Operating Temperature
0x06 ===== = = === == Transport Statistics (rev 1) ==
0x06 0x008 4 10 --- Number of Hardware Resets
0x06 0x010 4 3 --- Number of ASR Events
0x06 0x018 4 0 --- Number of Interface CRC Errors
0xff ===== = = === == Vendor Specific Statistics (rev 1) ==
0xff 0x008 7 0 --- Vendor Specific
0xff 0x010 7 0 --- Vendor Specific
0xff 0x018 7 0 --- Vendor Specific
|||_ C monitored condition met
||__ D supports DSN
|___ N normalized value

Pending Defects log (GP Log 0x0c)
No Defects Logged

SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x000a 2 2 Device-to-host register FISes sent due to a COMRESET
0x0001 2 0 Command failed due to ICRC error
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS

(I am not going to bother running a SMART long test on the other drive.)

After the test finished I went looking up for firmware and obsolete drivers and I found that my HBA was on some incredibly old firmware.
I was stuck at version 5.00.00.00 while the current firmware for this specific board is 24.00.00.00. Also please consider there are 12 versions only to go from version 23.00.00.00 to 24.00.00.00, so I was 60-70 versions behind at least

So I shut down the rig, unplugged the HBA, updated both fw and BIOS through storcli in an EFI shell, and plugged it back in.

Now that I upgraded such a major component inside my rig I kinda reset my expectations, so I am now resilvering the new pool and see if I still get these weird errors.
After the resilver, I could replicate the same situation (destroy the pool, recreate it and start a 6TB transfer) to see if it happens again.

Last thing worth noting, this card gets extremely hot. It was very hot to the touch when I plugged it out of the server and I can see how such high temps can lead to some weird error, but it's just guesswork.

NugentS · Dec 22, 2022

"A Fail event had been detected on md device /dev/md12x"
Doesn't the md12x indicate a RAID controller / using RAID?
You are using a tri-mode controller - which is (for NAS storage) kinda new and not so well tested.
Also if its overheating, then you want to do something about that quite quickly. If HBA's overheat then that can and will cause errors and eventually kill the card.

pinoli · Dec 22, 2022

NugentS said:
"A Fail event had been detected on md device /dev/md12x"
Doesn't the md12x indicate a RAID controller / using RAID?
You are using a tri-mode controller - which is (for NAS storage) kinda new and not so well tested.
Also if its overheating, then you want to do something about that quite quickly. If HBA's overheat then that can and will cause errors and eventually kill the card.

about the TriMode, yeah I am kind of taking a bullet for the community with it.
one thing though, once I studied all the necessary firmware stuff I realized Broadcom give you two firmware:

HBA_9405W-16i_Mixed_Profile.bin
HBA_9405W-16i_SAS_SATA_Profile.bin

Mixed Profile is the full fledged TriMode while the SAS_SATA profile should be the IT mode for this card.
I obviously flashed the card with the SAS+SATA only profile fw.

Also, running dmesg | grep mpt3sas gives the following output

[ 1.478861] mpt3sas version 39.100.00.00 loaded
[ 1.479324] mpt3sas_cm0: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (131651784 kB)
[ 1.532290] mpt3sas_cm0: MSI-X vectors supported: 128
[ 1.533204] mpt3sas_cm0: 0 20 20
[ 1.534076] mpt3sas_cm0: High IOPs queues : disabled
[ 1.534383] mpt3sas0-msix0: PCI-MSI-X enabled: IRQ 179
[ 1.534687] mpt3sas0-msix1: PCI-MSI-X enabled: IRQ 180
[ 1.534987] mpt3sas0-msix2: PCI-MSI-X enabled: IRQ 181
[ 1.535284] mpt3sas0-msix3: PCI-MSI-X enabled: IRQ 182
[ 1.535576] mpt3sas0-msix4: PCI-MSI-X enabled: IRQ 183
[ 1.535868] mpt3sas0-msix5: PCI-MSI-X enabled: IRQ 184
[ 1.536151] mpt3sas0-msix6: PCI-MSI-X enabled: IRQ 185
[ 1.536423] mpt3sas0-msix7: PCI-MSI-X enabled: IRQ 186
[ 1.536694] mpt3sas0-msix8: PCI-MSI-X enabled: IRQ 187
[ 1.536957] mpt3sas0-msix9: PCI-MSI-X enabled: IRQ 188
[ 1.537287] mpt3sas0-msix10: PCI-MSI-X enabled: IRQ 189
[ 1.537546] mpt3sas0-msix11: PCI-MSI-X enabled: IRQ 190
[ 1.537800] mpt3sas0-msix12: PCI-MSI-X enabled: IRQ 191
[ 1.538054] mpt3sas0-msix13: PCI-MSI-X enabled: IRQ 192
[ 1.538298] mpt3sas0-msix14: PCI-MSI-X enabled: IRQ 193
[ 1.538534] mpt3sas0-msix15: PCI-MSI-X enabled: IRQ 194
[ 1.538766] mpt3sas0-msix16: PCI-MSI-X enabled: IRQ 195
[ 1.538998] mpt3sas0-msix17: PCI-MSI-X enabled: IRQ 196
[ 1.539217] mpt3sas0-msix18: PCI-MSI-X enabled: IRQ 197
[ 1.539431] mpt3sas0-msix19: PCI-MSI-X enabled: IRQ 198
[ 1.539640] mpt3sas_cm0: iomem(0x0000006000100000), mapped(0x000000006f9054dc), size(1048576)
[ 1.539859] mpt3sas_cm0: ioport(0x0000000000004000), size(256)
[ 1.595941] mpt3sas_cm0: sending message unit reset !!
[ 1.597771] mpt3sas_cm0: message unit reset: SUCCESS
[ 1.624300] mpt3sas_cm0: scatter gather: sge_in_main_msg(1), sge_per_chain(7), sge_per_io(128), chains_per_io(19)
[ 1.626184] mpt3sas_cm0: request pool(0x00000000c5de6466) - dma(0x101800000): depth(6756), frame_size(128), pool_size(844 kB)
[ 1.743948] mpt3sas_cm0: sense pool(0x00000000cfbaf0ae) - dma(0x101900000): depth(6551), element_size(96), pool_size (614 kB)
[ 1.744183] mpt3sas_cm0: sense pool(0x00000000cfbaf0ae)- dma(0x101900000): depth(6551),element_size(96), pool_size(4 kB)
[ 1.744743] mpt3sas_cm0: reply pool(0x000000005923b377) - dma(0x101a00000): depth(6820), frame_size(128), pool_size(852 kB)
[ 1.745197] mpt3sas_cm0: config page(0x00000000fdec4d74) - dma(0x1334b6000): size(512)
[ 1.745435] mpt3sas_cm0: Allocated physical memory: size(31938 kB)
[ 1.745674] mpt3sas_cm0: Current Controller Queue Depth(6548),Max Controller Queue Depth(6656)
[ 1.745924] mpt3sas_cm0: Scatter Gather Elements per IO(128)
[ 1.880478] mpt3sas_cm0: _base_display_fwpkg_version: complete
[ 1.880785] mpt3sas_cm0: FW Package Ver(24.00.00.00)
[ 1.881688] mpt3sas_cm0: SAS3616: FWVersion(24.00.00.00), ChipRevision(0x02), BiosVersion(09.09.00.00)
[ 1.882040] mpt3sas_cm0: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Diag Trace Buffer,Task Set Full,NCQ)
[ 1.882827] mpt3sas 0000:01:00.0: Max SCSIIO MPT commands: 6548 shared with nr_hw_queues = 20
[ 1.893400] mpt3sas_cm0: registering trace buffer support
[ 1.898289] mpt3sas_cm0: Trace buffer memory 2048 KB allocated
[ 1.898619] mpt3sas_cm0: sending port enable !!
[ 1.899399] mpt3sas_cm0: hba_port entry: 000000006767e113, port: 0 is added to hba_port list
[ 1.899850] mpt3sas_cm0: hba_port entry: 00000000ba342ab9, port: 2 is added to hba_port list
[ 1.900289] mpt3sas_cm0: hba_port entry: 00000000a54f32ef, port: 1 is added to hba_port list
[ 1.901060] mpt3sas_cm0: hba_port entry: 00000000267131b2, port: 8 is added to hba_port list
[ 1.901586] mpt3sas_cm0: hba_port entry: 00000000c6472a4e, port: 10 is added to hba_port list
[ 1.902319] mpt3sas_cm0: hba_port entry: 000000004b941e8d, port: 9 is added to hba_port list
[ 1.902757] mpt3sas_cm0: vphy entry: 00000000d95670ed, port id: 0, phy:16 is added to port's vphys_list
[ 1.903456] mpt3sas_cm0: host_add: handle(0x0001), sas_addr(0x500605b00fe37e20), phys(21)
[ 1.904193] mpt3sas_cm0: handle(0x17) sas_address(0x5000c500ca8199a9) port_type(0x1)
[ 1.911482] mpt3sas_cm0: handle(0x1c) sas_address(0x300605b00fe37e21) port_type(0x1)
[ 1.919195] mpt3sas_cm0: handle(0x19) sas_address(0x5000c500d85c582d) port_type(0x1)
[ 1.928365] mpt3sas_cm0: handle(0x18) sas_address(0x5000c500cabf7dc5) port_type(0x1)
[ 1.937164] mpt3sas_cm0: handle(0x1a) sas_address(0x5000c500d88548ad) port_type(0x1)
[ 1.946236] mpt3sas_cm0: handle(0x1b) sas_address(0x300605b00fe37e2f) port_type(0x1)
[ 1.954966] mpt3sas_cm0: handle(0x11) sas_address(0x300705b00fe37e20) port_type(0x0)
[ 1.976288] mpt3sas_cm0: port enable: SUCCESS

Worth noting are the line

Code:

[    1.880785] mpt3sas_cm0: FW Package Ver(24.00.00.00)
[    1.881688] mpt3sas_cm0: SAS3616: FWVersion(24.00.00.00), ChipRevision(0x02), BiosVersion(09.09.00.00)
[    1.882040] mpt3sas_cm0: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Diag Trace Buffer,Task Set Full,NCQ)

where you can see the card is indeed in IT mode --> Protocol=(Initiator,Target).

As for the high temps, I will see if I can install a fan directly blowing onto the card heatsink.

Arwen · Dec 23, 2022

NugentS said:
"A Fail event had been detected on md device /dev/md12x"
Doesn't the md12x indicate a RAID controller / using RAID?
...

Linux uses /dev/mdXXX for MD-RAID software RAID and has nothing to do with the controller or HBA.

In TrueNAS SCALE, the swap space(s) is mirrored with MD-RAID, thus, I would assume any and all use of MD-RAID devices on TrueNAS SCALE would be swap. This is shown below, (though I don't know why one is MUCH larger than the others):

Code:

[15351.902026] Adding 2095036k swap on /dev/mapper/md127.  Priority:-2 extents:1 across:2095036k FS
[15352.553954] Adding 2095036k swap on /dev/mapper/md126.  Priority:-3 extents:1 across:2095036k FS
[15354.081531] Adding 2095036k swap on /dev/mapper/md125.  Priority:-4 extents:1 across:2095036k FS
[15359.073898] Adding 2095036k swap on /dev/mapper/md124.  Priority:-5 extents:1 across:2095036k FS
[15359.180002] Adding 16759804k swap on /dev/mapper/md123.  Priority:-6 extents:1 across:16759804k SSFS

HoneyBadger · Dec 23, 2022

Arwen said:
This is shown below, (though I don't know why one is MUCH larger than the others):

SCALE will offer to create a 16GB swap partition on the boot device if it detects one larger than a certain size (32GB I believe)

pinoli said:
Last thing worth noting, this card gets extremely hot. It was very hot to the touch when I plugged it out of the server and I can see how such high temps can lead to some weird error, but it's just guesswork.

If the heatsink is hot enough that you're concerned with being burned by it, that can most definitely lead to errors and corrupted data. I have to assume that LSI still has not added a temperature probe or thermal throttling onto even their newest cards, sadly.

This has historically been a challenge when these HBAs (designed for rackmount or server chassis) are placed into a tower-style or "gaming" focused chassis. Server components are often built assuming the presence of loud, powerful fans with lots of static pressure and linear feet per minute of airflow; those same factors are often minimized for office workstations or machines intended to sit closer to anyone with a working pair of ears.

pinoli · Dec 23, 2022

Arwen said:
Linux uses /dev/mdXXX for MD-RAID software RAID and has nothing to do with the controller or HBA.

In TrueNAS SCALE, the swap space(s) is mirrored with MD-RAID, thus, I would assume any and all use of MD-RAID devices on TrueNAS SCALE would be swap. This is shown below, (though I don't know why one is MUCH larger than the others):

Code:
[15351.902026] Adding 2095036k swap on /dev/mapper/md127. Priority:-2 extents:1 across:2095036k FS [15352.553954] Adding 2095036k swap on /dev/mapper/md126. Priority:-3 extents:1 across:2095036k FS [15354.081531] Adding 2095036k swap on /dev/mapper/md125. Priority:-4 extents:1 across:2095036k FS [15359.073898] Adding 2095036k swap on /dev/mapper/md124. Priority:-5 extents:1 across:2095036k FS [15359.180002] Adding 16759804k swap on /dev/mapper/md123. Priority:-6 extents:1 across:16759804k SSFS

since I am no expert in this field, let me get this straight. the mdXXX I see are just the swap partitions for my pools?
also fwiw, in my case md123 should be the boot-pool swap partition, that indeed is 16GB in size.

pinoli · Dec 23, 2022

HoneyBadger said:
SCALE will offer to create a 16GB swap partition on the boot device if it detects one larger than a certain size (32GB I believe)

If the heatsink is hot enough that you're concerned with being burned by it, that can most definitely lead to errors and corrupted data. I have to assume that LSI still has not added a temperature probe or thermal throttling onto even their newest cards, sadly.

This has historically been a challenge when these HBAs (designed for rackmount or server chassis) are placed into a tower-style or "gaming" focused chassis. Server components are often built assuming the presence of loud, powerful fans with lots of static pressure and linear feet per minute of airflow; those same factors are often minimized for office workstations or machines intended to sit closer to anyone with a working pair of ears.

thanks for the feedback. I tried to create some airflow with 10x Noctua NF-A14 fans, all 140x140, but yes static pressure is totally different from a rack chassis.
and since I opened up my case yesterday, specifically my HBA is not getting much air cause it's sitting right below the Noctua NH-D15 and the exhaust fan. I will look into some solution for positioning an additional fan in front of the HBA.

pinoli said:
Now that I upgraded such a major component inside my rig I kinda reset my expectations, so I am now resilvering the new pool and see if I still get these weird errors.

as an update on this, resilvering finished and guess what:

Code:

ZFS has finished a resilver:

eid: 45
class: resilver_finish
host: truenas
time: 2022-12-23 16:29:06+0100
pool: exos-x18-c
state: ONLINE
scan: resilvered 6.95T in 12:21:54 with 0 errors on Fri Dec 23 16:29:06 2022
config:

NAME STATE READ WRITE CKSUM
exos-x18-c ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
5c736f36-b9e2-4c47-8c2f-6eea4732f2cd ONLINE 0 0 0
52615843-7ef5-4055-ba07-9e3a105168c3 ONLINE 0 0 0

errors: No known data errors

considering resilvering is comparable to a scrub, I expected to see some errors if the drives were indeed defective, while everything went fast and smooth, and SMART tests keep coming clean.
at this point, I am tempted to recreate the condition that triggered the initial error, e.g.: destroy the pool and let a new 6TB data transfer finish overnight.

Arwen · Dec 24, 2022

pinoli said:
since I am no expert in this field, let me get this straight. the mdXXX I see are just the swap partitions for my pools?
also fwiw, in my case md123 should be the boot-pool swap partition, that indeed is 16GB in size.

Swap is like pseudo memory. If you use too much memory and the ZFS ARC can't be reduced further, then the OS start swapping memory off to swap space. The same MD-RAID software device naming convention will be used for all swap spaces, on boot devices or data pool devices. ZFS will never exist on "/dev/mdXXX" devices, (unless you manually make that happen, and that is not a good idea).

@HoneyBadger's comment about 16GB swap space on the boot device makes more sense. It's a new feature that does, (or did not), exist on Core. Glad it was added. Some of us have largish & reliable boot devices.

Daisuke · Dec 24, 2022

HoneyBadger said:
SCALE will offer to create a 16GB swap partition on the boot device if it detects one larger than a certain size

Is it recommended to create that swap partition? I never do, since it should rely on the large amount of memory available.

Arwen said:
Some of us have largish & reliable boot devices.

Same here, I'm wondering I did a bad move not adding that partition on the clean Bluefin install. Just to follow best practices...

morganL · Dec 24, 2022

Daisuke said:
Is it recommended to create that swap partition? I never do, since it should rely on the large amount of memory available.

View attachment 61474

Same here, I'm wondering I did a bad move not adding that partition on the clean Bluefin install. Just to follow best practices...

I have not seen cases where its necessary when you have so much RAM and plenty of Free/ARC space.

skylus · Dec 28, 2022

pinoli said:
as an update on this, resilvering finished and guess what:

Code:
ZFS has finished a resilver: ... errors: No known data errors

considering resilvering is comparable to a scrub, I expected to see some errors if the drives were indeed defective, while everything went fast and smooth, and SMART tests keep coming clean.
at this point, I am tempted to recreate the condition that triggered the initial error, e.g.: destroy the pool and let a new 6TB data transfer finish overnight.

I read your thread with much interest because what you describe is a rather unlikely event. However, when I read about the emails you received, it instantly rang a bell. I have received similar emails from my TrueNAS box and initially been quite confused to say the least. In my case, the emails are always sent right when the main pool (6x 16TB in RAIDZ2) is exported (to reduce the noise). Since I have this export scheduled at night via cron, the emails just show up in my inbox in the morning.

I am currently in the tedious process of removing the 2GB swap partitions from the drives comprising this pool by taking one drive offline, then wiping and replacing it, followed by a 24 hour-long resilvering process, following this hint by sretalla.

To keep TrueNAS from creating swap partitions on new pool drives, I used this midclt command provided by sretalla as the GUI does not expose this option in TrueNAS SCALE.

pinoli · Dec 28, 2022

skylus said:
I read your thread with much interest because what you describe is a rather unlikely event. However, when I read about the emails you received, it instantly rang a bell. I have received similar emails from my TrueNAS box and initially been quite confused to say the least. In my case, the emails are always sent right when the main pool (6x 16TB in RAIDZ2) is exported (to reduce the noise). Since I have this export scheduled at night via cron, the emails just show up in my inbox in the morning.

I am currently in the tedious process of removing the 2GB swap partitions from the drives comprising this pool by taking one drive offline, then wiping and replacing it, followed by a 24 hour-long resilvering process, following this hint by sretalla.

To keep TrueNAS from creating swap partitions on new pool drives, I used this midclt command provided by sretalla as the GUI does not expose this option in TrueNAS SCALE.

Thanks for the feedback, after @Arwen explanation about swap partition naming and your reply, it's clear that in my case something happened causing all the swap partitions to flip. Since I only use mirrors, every swap partition degraded with just one of the mirror pair going down. And since these were only swap partitions, this also explains why zpool status didn't show anything wrong with the pools at that moment.

I still can't explain why this happened, but the slew of write errors with the new disks suspiciously appeared 30 minutes before the swap partitions degraded. As I said, I realized I had an incredibly ancient firmware on my HBA (5 vs 24), and reading the release logs of subsequent fw updates I realize the coexistence of SAS, SATA and NVMe on that board was vastly improved. I say this because this was the first SATA pool I installed on that HBA, the other two being all SAS drives, and this might have created coexistence issues between SAS and SATA drives with such an outdated firmware.

After I flashed my HBA fw and BIOS to the latest version, I haven't had a single issue with the new drives or with my SCALE in general, it went back to just work™. Actually, I feel overall stability and transfer speeds have both improved so maybe this error was in fact useful in signaling something wasn't behaving properly. By the way, I performed a resilver of the new pool, then a full scrub after a day, and SMART short tests each day, and there were no error whatsoever. Everything is working as expected with no issues.

At the same time, I really have no idea. I saved a whole TrueNAS SCALE debug archive after that happened, so maybe the solution is hidden inside those logs, but someone more knowledgeable than me would need to give a look at them. I do hope it's tied to the HBA fw so that I can forget about it and move on

@skylus I understand what you are doing, you are removing all the swap partitions in order to avoid all those alarming emails when exporting your pools routinely. I don't have this requirement, nor this happened after exporting my pools, so for now I see no reason in disabling swaps, but thanks for the links, they might come in handy.

P.S.: since the problem involved swap partitions specifically, I changed the title of the thread to reflect that.

pinoli · Dec 28, 2022

I see there is a thread with users complaining of receiving the same weird messages after upgrading to Bluefin
Fail event had been detected on md device /dev/md127
the author specifically describes what happened to me, so emails with Fail events and zpool status returning clean nevertheless.

well I would be relieved if this was a nothingburger.

Daisuke · Dec 28, 2022

pinoli said:
I see no reason in disabling swaps

There should not be any swap partitions on zfs drives, the only swap needed is the 16GB partition created on OS disk like @HoneyBadger mentioned. In 2020, iX Systems disabled the swap option in UI and set the "swapondrive": 0 in TrueNAS Core. Somehow this did not make it into Scale, @morganL?

Code:

# midclt call system.advanced.config | jq '.swapondrive'
2

The "swapondrive": 0 setting is needed because if a disk with swap partition fails, it will force the swap to be used. Also, spares will always be 2GB too small to do a disk replacement. I don't understand why this changed in Scale, this was a method tested in battle, to have the swapondrive set to 0.

Now I'm forced to wipe one disk at the time and resilver, like @skylus. Should take me 2 weeks, while living on the edge with a degraded RaidZ2 array... Really not happy.

pinoli · Dec 29, 2022

Daisuke said:
There should not be any swap partitions on zfs drives, the only swap needed is the 16GB partition created on OS disk like @HoneyBadger mentioned. In 2020, iX Systems disabled the swap option in UI and set the "swapondrive": 0 in TrueNAS Core. Somehow this did not make it into Scale, @morganL?

Code:
# midclt call system.advanced.config | jq '.swapondrive' 2

The "swapondrive": 0 setting is needed because if a disk with swap partition fails, it will force the swap to be used. Also, spares will always be 2GB too small to do a disk replacement. I don't understand why this changed in Scale, this was a method tested in battle, to have the swapondrive set to 0.

Now I'm forced to wipe one disk at the time and resilver, like @skylus. Should take me 2 weeks, while living on the edge with a degraded RaidZ2 array... Really not happy.

ok I did not know there shouldn’t be swap partitions on pools. in my case I definitely have 2GB swap partitions on every pool and 16GB swap on my boot-pool (chosen during install). also, I always had them cause I haven‘t destroyed/recreated my pools.
is this considered a “bug” or it’s by design in SCALE (the fact they let swap partition on pools might not necessarily be a wrong setting)?
aside from this hiccup, I never had a problem even with 2GB swap partitions on my pools, so why is it recommended to remove them?

more confused than before

Daisuke · Dec 29, 2022

pinoli said:
more confused than before

We are discussing into this post.

Important Announcement for the TrueNAS Community.

all of my mirrored pools swap partitions degraded at the same time (1 hd each)

Dabbler

Patron

Dabbler

Dabbler

Dabbler

MVP

Dabbler

MVP

actually does care

Dabbler

Dabbler

MVP

Contributor

Captain Morgan

Cadet

Dabbler

Dabbler

Contributor

Dabbler

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "all of my mirrored pools swap partitions degraded at the same time (1 hd each)"

Similar threads