Pool keeps degrading

buswedg · Sep 21, 2022

Still no change.

Code:

  pool: storage-pool
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 19:18:38 with 0 errors on Mon Sep 19 19:18:39 2022
config:

        NAME                                      STATE     READ WRITE CKSUM
        storage-pool                              DEGRADED     0     0     0
          raidz2-0                                DEGRADED     0     0     0
            1766c859-9231-4127-ad35-882937845b76  ONLINE       0     0     0
            89b557d9-cc55-463c-9124-3765cff9aaac  ONLINE       0     0     0
            4a07f2eb-442c-430d-9db6-0e5fd80302a9  ONLINE       0     0     0
            c6287e21-49e0-4ffe-86a1-78a33bbbec70  ONLINE       0     0     0
            1c817e68-35de-4b8c-be59-4e98fc1fc9fe  ONLINE       0     0     0
            28dba822-3f63-4e2c-82ad-8e4467bc81a1  FAULTED     26     0     0  too many errors
        cache
          f5456732-9153-44fb-977b-fe5f1021b959    ONLINE       0     0     0

Code:

[261874.964229] ata4.00: exception Emask 0x0 SAct 0x1b81000 SErr 0x0 action 0x0
[261874.972085] ata4.00: irq_stat 0x40000008
[261874.976533] ata4.00: failed command: READ FPDMA QUEUED
[261874.982197] ata4.00: cmd 60/40:98:d8:94:cb/06:00:f5:05:00/40 tag 19 ncq dma 819200 in
                         res 43/40:40:d8:94:cb/00:06:f5:05:00/40 Emask 0x409 (media error) <F>
[261874.999284] ata4.00: status: { DRDY SENSE ERR }
[261875.004335] ata4.00: error: { UNC }
[261875.052509] ata4.00: configured for UDMA/133
[261875.057606] sd 4:0:0:0: [sdd] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=7s
[261875.067742] sd 4:0:0:0: [sdd] tag#19 Sense Key : Medium Error [current]
[261875.075302] sd 4:0:0:0: [sdd] tag#19 Add. Sense: Unrecovered read error - auto reallocate failed
[261875.084929] sd 4:0:0:0: [sdd] tag#19 CDB: Read(16) 88 00 00 00 00 05 f5 cb 94 d8 00 00 06 40 00 00
[261875.094739] blk_update_request: I/O error, dev sdd, sector 25598596312 op 0x0:(READ) flags 0x700 phys_seg 25 prio class 0
[261875.106426] zio pool=storage-pool vdev=/dev/disk/by-partuuid/28dba822-3f63-4e2c-82ad-8e4467bc81a1 error=5 type=1 offset=13104333762560 size=819200 flags=40080c80
[261875.122046] ata4: EH complete

buswedg · Oct 3, 2022

Hopefully my last update for those interested -- But I ended up putting in my friends lenovo 420-8i over a week ago and I haven't seen an error since.

This whole thing still has me scratching my head however, as I saw errors when try two different HBAs prior (both flashed LSI 9208-8i) and also saw errors when I had all the drives plugged direct into the mb sata headers.

Either way, I'm going to give it another week and if I all remains well, I'll buy myself a 420-16i. That way, I should have pretty good confidence that all will remain well, and I'll also be able to do away with the expander board for when I decide to add another pool of drives.

brando56894 · Oct 3, 2022

I've had this issue too as well for many months and could never trace it back to what the issue was.

My server:
Asrock Rack X399D8A-2T
Threadripper 2 2970wx
4x 32 GB ECC Micron 18ASF2G72AZ-2G6D1
LSI 9208-16i HBA flashed to IT mode
12x 8 TB WD RED in RAIDZ2 for my Storage pool
4x 6 TB WD RED in RAIDZ for my Tank pool
2x 5 TB Seagate ST5000LM000-2U8170 mirrored for my Requests pool

It doesn't really seem to matter what bus the drives are connected to and it doesn't seem to be model specific, also I don't remember having any of these issues when I was using Arch Linux with OpenZFS for many months. This leads me to believe it may be something specific to TrueNAS SCALE.

It doesn't happen often, maybe every month or so a drive or three will show up as bad and be kicked out of the pool, a power off and/or scrub usually fixes it. I haven't noticed any issues in a while (but then again I haven't been checking), I was away this weekend for a friend's bachelor party and when I got back, his fiancee said that Plex was down all weekend. I looked into and all three of my HDD based pools were degraded. Storage had 2 drives in the same VDEV drop out, tank had one drive drop out, and one seagate drive dropped out of the requests pool. I had switched to an ATTO 12 Gbps HBA a few months ago when all this started to happen, so maybe that may be the culprit since I don't really remember having this many issues with the LSI card, although it keeps telling me something like "Offset 0x8e bus degraded" that's been showing ever since I got the card, other people have seen it too and don't really know what the issue is, hence the reason I tried the ATTO card. I've had both of the Seagate drives connected to an internal U.2 port and those were giving me a lot of issues as well and would constantly drop out, once I switched to use the standard SATA ports I didn't see any issues from what I remember. All the WD drives have SCT Error Recover Control set to 70, the Seagate drives don't support it.

I just swapped back in the LSI HBA and I'm in the process of scrubbing and resilvering my pools, Seagate already recovered and is healthy. I'll keep an eye on them and see if this keeps happening with the LSI card.

Arwen · Oct 3, 2022

@brando56894 - Please clarify what type of WD RED you are using, CMR or SMR.

Because of Western Digital's bad behavior, we will be on this for years to come.

whodat · Oct 3, 2022

I have recently spent countless troubleshooting hours over many months with much the same issues affecting my various ZFS pool drives. Naturally thinking the issues related to my SATA hardware I methodically and systematically replaced and tested all drives, SATA cables, and also the SATA adaptor, without any luck. I even swapped drives to different SATA adaptors, but the issues persisted. I have lost count of how many times I have re-installed TrueNAS SCALE in order to create a fresh boot-pool.

Eventually I realised that by physically swapping drives to specific SATA and PSU connectors in my system I could reliably reproduce the issue in particular locations i.e. any drive swapped to my 3rd or 5th drive bay would eventually fault... So I replaced my PSU SATA power cable for those bays, which somehow prevented my host from booting at all - the case and PSU fans would spin momentarily but the machine would not POST.

Through process of elimination, I determined that certain PSU SATA power cables connected to certain modular female SATA power ports on my PSU were causing the POST issue. I removed all of these PSU SATA power cables from my system, and also avoided using the modular PSU SATA power ports which were related to the POST issue.

So far (5+ days and counting) my drives and pools have remained healthy, where previously I would have definitely encountered a faulted drive/ degraded pool within 24 hours of uptime.

FYI my PSU is a Corsair HX750i, now 8+ years old. I will be replacing it shortly!

I hope this might help your troubleshooting!

brando56894 · Oct 8, 2022

Arwen said:
@brando56894 - Please clarify what type of WD RED you are using, CMR or SMR.

Because of Western Digital's bad behavior, we will be on this for years to come.

They're all CMR. I finished scrubbing/resilvers all the pools and after two days one of the 6 TB REDs in my smaller RAIDZ1 Tank pool has been marked as faulted again since there are 18 read errors and 13,801 write errors. SMART shows that the disk is healthy, but it does show that one ATA error has occurred though.

Error 1 occurred at disk power-on lifetime: 26122 hours (1088 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 e0 a0 00 40 e0 Error: UNC 224 sectors at LBA = 0x004000a0 = 4194464

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 e0 a0 00 40 e0 08 19d+03:22:41.126 READ DMA
ca 00 10 90 02 40 e0 08 19d+03:22:41.126 WRITE DMA

All the short offline tests show that they've completed without error.

I'm seeing a bunch of IO errors in dmesg from another HDD (same model) in that pool, but TrueNAS only shows that is has two write errors

[Wed Oct 5 07:15:13 2022] ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[Wed Oct 5 07:15:13 2022] ata9.00: irq_stat 0x40000001
[Wed Oct 5 07:15:13 2022] ata9.00: failed command: READ DMA
[Wed Oct 5 07:15:13 2022] ata9.00: cmd c8/00:08:e8:cc:00/00:00:00:00:00/e0 tag 21 dma 4096 in
res 51/40:08:e8:cc:00/00:00:00:00:00/e0 Emask 0x9 (media error)
[Wed Oct 5 07:15:13 2022] ata9.00: status: { DRDY ERR }
[Wed Oct 5 07:15:13 2022] ata9.00: error: { UNC }
[Wed Oct 5 07:15:13 2022] ata9.00: configured for UDMA/133
[Wed Oct 5 07:15:13 2022] sd 9:0:0:0: [sdt] tag#21 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=4s
[Wed Oct 5 07:15:13 2022] sd 9:0:0:0: [sdt] tag#21 Sense Key : Medium Error [current]
[Wed Oct 5 07:15:13 2022] sd 9:0:0:0: [sdt] tag#21 Add. Sense: Unrecovered read error - auto reallocate failed
[Wed Oct 5 07:15:13 2022] sd 9:0:0:0: [sdt] tag#21 CDB: Read(16) 88 00 00 00 00 00 00 00 cc e8 00 00 00 08 00 00
[Wed Oct 5 07:15:13 2022] blk_update_request: I/O error, dev sdt, sector 52456 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 0
[Wed Oct 5 07:15:13 2022] ata9: EH complete
[Wed Oct 5 07:15:21 2022] ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[Wed Oct 5 07:15:21 2022] ata9.00: irq_stat 0x40000001
[Wed Oct 5 07:15:21 2022] ata9.00: failed command: READ DMA
[Wed Oct 5 07:15:21 2022] ata9.00: cmd c8/00:08:48:cd:00/00:00:00:00:00/e0 tag 3 dma 4096 in
res 51/40:08:48:cd:00/00:00:00:00:00/e0 Emask 0x9 (media error)
[Wed Oct 5 07:15:21 2022] ata9.00: status: { DRDY ERR }
[Wed Oct 5 07:15:21 2022] ata9.00: error: { UNC }
[Wed Oct 5 07:15:21 2022] ata9.00: configured for UDMA/133
[Wed Oct 5 07:15:21 2022] sd 9:0:0:0: [sdt] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=4s
[Wed Oct 5 07:15:21 2022] sd 9:0:0:0: [sdt] tag#3 Sense Key : Medium Error [current]
[Wed Oct 5 07:15:21 2022] sd 9:0:0:0: [sdt] tag#3 Add. Sense: Unrecovered read error - auto reallocate failed
[Wed Oct 5 07:15:21 2022] sd 9:0:0:0: [sdt] tag#3 CDB: Read(16) 88 00 00 00 00 00 00 00 cd 48 00 00 00 08 00 00
[Wed Oct 5 07:15:21 2022] blk_update_request: I/O error, dev sdt, sector 52552 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 0
[Wed Oct 5 07:15:21 2022] ata9: EH complete
[Wed Oct 5 07:15:32 2022] ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[Wed Oct 5 07:15:32 2022] ata9.00: irq_stat 0x40000001
[Wed Oct 5 07:15:32 2022] ata9.00: failed command: READ DMA
[Wed Oct 5 07:15:32 2022] ata9.00: cmd c8/00:08:b8:d2:00/00:00:00:00:00/e0 tag 27 dma 4096 in
res 51/40:08:b8:d2:00/00:00:00:00:00/e0 Emask 0x9 (media error)
[Wed Oct 5 07:15:32 2022] ata9.00: status: { DRDY ERR }
[Wed Oct 5 07:15:32 2022] ata9.00: error: { UNC }
[Wed Oct 5 07:15:32 2022] ata9.00: configured for UDMA/133
[Wed Oct 5 07:15:32 2022] sd 9:0:0:0: [sdt] tag#27 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=4s
[Wed Oct 5 07:15:32 2022] sd 9:0:0:0: [sdt] tag#27 Sense Key : Medium Error [current]
[Wed Oct 5 07:15:32 2022] sd 9:0:0:0: [sdt] tag#27 Add. Sense: Unrecovered read error - auto reallocate failed
[Wed Oct 5 07:15:32 2022] sd 9:0:0:0: [sdt] tag#27 CDB: Read(16) 88 00 00 00 00 00 00 00 d2 b8 00 00 00 08 00 00
[Wed Oct 5 07:15:32 2022] blk_update_request: I/O error, dev sdt, sector 53944 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 0
[Wed Oct 5 07:15:32 2022] ata9: EH complete
[Wed Oct 5 07:15:47 2022] Adding 2097084k swap on /dev/mapper/md127. Priority:-2 extents:1 across:2097084k FS
[Wed Oct 5 07:15:47 2022] md: resync of RAID array md126
[Wed Oct 5 07:15:48 2022] Adding 2097084k swap on /dev/mapper/md126. Priority:-3 extents:1 across:2097084k FS
[Wed Oct 5 07:15:48 2022] md: resync of RAID array md125
[Wed Oct 5 07:15:48 2022] Adding 2097084k swap on /dev/mapper/md125. Priority:-4 extents:1 across:2097084k FS
[Wed Oct 5 07:15:59 2022] md: md127: resync done.
[Wed Oct 5 07:16:01 2022] md: md125: resync done.
[Wed Oct 5 07:16:56 2022] md: md126: resync done.

SMART also shows a bunch of ATA errors for that drive as well

Error 151 occurred at disk power-on lifetime: 41110 hours (1712 days + 22 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 b8 d2 00 e0 Error: UNC 8 sectors at LBA = 0x0000d2b8 = 53944

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 b8 d2 00 e0 08 1d+19:20:37.960 READ DMA
c8 00 08 b0 d2 00 e0 08 1d+19:20:37.960 READ DMA
c8 00 08 a8 d2 00 e0 08 1d+19:20:37.949 READ DMA
c8 00 08 a0 d2 00 e0 08 1d+19:20:37.929 READ DMA
c8 00 08 98 d2 00 e0 08 1d+19:20:37.929 READ DMA

Error 150 occurred at disk power-on lifetime: 41110 hours (1712 days + 22 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 48 cd 00 e0 Error: UNC 8 sectors at LBA = 0x0000cd48 = 52552

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 48 cd 00 e0 08 1d+19:20:27.080 READ DMA
c8 00 08 40 cd 00 e0 08 1d+19:20:27.071 READ DMA
c8 00 08 38 cd 00 e0 08 1d+19:20:25.811 READ DMA
c8 00 08 30 cd 00 e0 08 1d+19:20:24.802 READ DMA
c8 00 08 28 cd 00 e0 08 1d+19:20:23.709 READ DMA

Error 149 occurred at disk power-on lifetime: 41110 hours (1712 days + 22 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 e8 cc 00 e0 Error: UNC 8 sectors at LBA = 0x0000cce8 = 52456

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 e8 cc 00 e0 08 1d+19:20:18.942 READ DMA
c8 00 08 e0 cc 00 e0 08 1d+19:20:18.934 READ DMA
c8 00 08 d8 cc 00 e0 08 1d+19:20:17.676 READ DMA
c8 00 08 d0 cc 00 e0 08 1d+19:20:17.676 READ DMA
ca 00 08 d0 cc 00 e0 08 1d+19:20:17.658 WRITE DMA

Error 148 occurred at disk power-on lifetime: 41110 hours (1712 days + 22 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 d0 cc 00 e0 Error: UNC 8 sectors at LBA = 0x0000ccd0 = 52432

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 d0 cc 00 e0 08 1d+19:20:13.326 READ DMA
c8 00 08 c8 cc 00 e0 08 1d+19:20:13.326 READ DMA
ca 00 08 c8 cc 00 e0 08 1d+19:20:13.318 WRITE DMA
ef 10 02 00 00 00 a0 08 1d+19:20:13.233 SET FEATURES [Enable SATA feature]

Error 147 occurred at disk power-on lifetime: 41110 hours (1712 days + 22 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 c8 cc 00 e0 Error: UNC 8 sectors at LBA = 0x0000ccc8 = 52424

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 c8 cc 00 e0 08 1d+19:20:09.004 READ DMA
c8 00 08 c0 cc 00 e0 08 1d+19:20:09.004 READ DMA
c8 00 08 b8 cc 00 e0 08 1d+19:20:09.004 READ DMA
c8 00 08 b0 cc 00 e0 08 1d+19:20:08.983 READ DMA
c8 00 08 a8 cc 00 e0 08 1d+19:20:08.973 READ DMA

It also shows that one short test (out of 21) has failed in the past

# 9 Short offline Completed: read failure 10% 33935 2002160664

Arwen · Oct 8, 2022

I don't have any suggestions if SMART shows little to nothing... but I am no expert in disk diagnoses.

Etorix · Oct 9, 2022

whodat said:
Through process of elimination, I determined that certain PSU SATA power cables connected to certain modular female SATA power ports on my PSU were causing the POST issue. I removed all of these PSU SATA power cables from my system, and also avoided using the modular PSU SATA power ports which were related to the POST issue.

So far (5+ days and counting) my drives and pools have remained healthy, where previously I would have definitely encountered a faulted drive/ degraded pool within 24 hours of uptime.

Great troubleshooting! Thanks for the report.

buswedg · Oct 9, 2022

Looks very similar to my problem. And as I mentioned -- switching to a new, more modern, and legitimate HBA solved my issue. I ordered one from Lenovo for around $400 last week to replace the one I've been loaned.

buswedg · Oct 10, 2022

Don't believe it. I just got the same error again. So, the problem must be truenas itself -- likely some kind of incompatibility with the type of drives I'm using.

Code:

[612578.064648] sd 0:0:1:0: [sdd] tag#6716 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=6s
[612578.064653] sd 0:0:1:0: [sdd] tag#6716 Sense Key : Aborted Command [current]
[612578.064656] sd 0:0:1:0: [sdd] tag#6716 Add. Sense: Command timeout during processing
[612578.064660] sd 0:0:1:0: [sdd] tag#6716 CDB: Read(16) 88 00 00 00 00 06 e5 70 9e a0 00 00 00 40 00 00
[612578.064664] blk_update_request: I/O error, dev sdd, sector 29619166880 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 0

Code:

        NAME                                      STATE     READ WRITE CKSUM
        storage-pool                              DEGRADED     0     0     0
          raidz2-0                                DEGRADED     0     0     0
            1766c859-9231-4127-ad35-882937845b76  ONLINE       0     0     0
            89b557d9-cc55-463c-9124-3765cff9aaac  ONLINE       0     0     0
            4a07f2eb-442c-430d-9db6-0e5fd80302a9  ONLINE       0     0     0
            c6287e21-49e0-4ffe-86a1-78a33bbbec70  ONLINE       0     0     0
            1c817e68-35de-4b8c-be59-4e98fc1fc9fe  ONLINE       0     0     0
            28dba822-3f63-4e2c-82ad-8e4467bc81a1  FAULTED     12     2     0  too many errors
        cache
          f5456732-9153-44fb-977b-fe5f1021b959    ONLINE       0     0     0

buswedg · Nov 18, 2022

I finally figured this out. And after almost a month of testing now, I'm very confident.

In short -- I needed to turn off Native Command Queuing for all the WD gold 16tb drives by setting the queue depth to 1. They don't retain this setting but I have a script to set the depth on boot. I've been running this for some time now, with many periods of heavy read/writes. Have haven't had any errors.

I think overall, these gold drives seem to be acting like SMR drives in almost every way, even though they are CMR. I say that, as I got my hands on a set of WD Red Pro 18tb drives, and have been testing back to back against the golds. Same server, on the same HBA. And the red's are fine. The golds even keep disabling SCT/TLER on cold restart, whilst the red's do not.

If anyone can help me understand why this would be, it would be much appreciated.

Arwen · Nov 19, 2022

For the WD Gold's native command queuing, you may have found a firmware bug in the interaction between the drive, controller and ZFS. I'd try one drive with a queue depth of 2 and see if the problem returns. It's possible that the maximum depth was the problem. Or close to it. If 2 works, slowly move that drive's queue depth up a bit to find the threshold of problem. Then back off and apply to all drives.

For the WD Gold's disabling SCT/TLER, I've seen irregular behavior in disk drives retaining parameters, even when the say they do. Simply add or update your queue depth script to include setting SCT/TLER.

brymck · Mar 11, 2023

I was having a similar problem a few months back that caused me to roll back from SCALE.
Tried upgrading from CORE to SCALE and it is currently installed in a dual-boot scenario in GRUB (not sure if this is intended).
Running in an r720xd so enterprise grade hardware all around.
I had 6x6TB IBM branded st6000nm0034 drives (2017 manufacture) that worked just fine for years in CORE and SMART data is clean, as soon as I upgraded to SCALE the pool started degrading with multiple random disk failures. Console was showing a bunch of error messages about protection errors when doing blk_update_requests, buffer i/o errors.
Thought it might be an issue with firmware on the drives so I recently bought an old IBM server and just flashed all the firmware for the 6TB drives on it (IBM branded drives couldn't be flashed without an IBM x server, thankfully was only $50 or so - made in late 2000s and did what I needed it to!)
I took all the drives out one by one and replaced/resilvered on my main storage NAS with 8TB Toshiba SAS drives (newer, 2021 manufacture date). Still running those on CORE. I'm copying the full data pool to another 2x10tb stripe pool that I'm going to export as a physical backup, then try to boot into SCALE to see if I'm still getting errors with the new drives, in which case I would suspect the controller (PERC R710p flashed into IT mode).
I flashed all the 6TB drives last night to latest firmware, put them in a spare r720 I have with the same controller/same controller firmware, am installing SCALE on that fresh at this moment and am getting the same blk_update_request buffer i/o errors during the install process so I'm not optimistic.

update: i installed SCALE successfully but get the error in dmesg when clicking on the disks page. I wrote zeroes to the drive that I was getting the error for using dd for a few minutes and reload the page and no longer get it. I'm currently wiping all drives with zeroes using nwipe and will create a new pool on that system, copy some data to it, and use to see if i can reproduce.

update 2: after backing up the primary pool to a secondary, disconnecting the secondary to protect data, and rebooting into scale, i get no errors from the disks on the r720xd which points strongly to the disks themselves being the culprit. i saw another case with the st6000nm0034 drives not behaving well with pool creation/destruction in SCALE so maybe the drives themselves are to blame. Drives are 97% wiped with zeroes on my test bed, going to boot them into SCALE after that and see if I get any errors.

update 3: drives will not create into a pool via the GUI, they are reporting data integrity formatting error as described here (https://www.truenas.com/community/threads/troubleshooting-disk-format-warnings-in-bluefin.106051/), formatting the drives individually now with the instructions in that thread to see if this allows creation of a pool after.

update 4: after reformatting the drives to remove data integrity, no errors with the drives in the pool. looks like this is something that was maybe compatible (or not checked) with CORE but now has become an issue in SCALE.

tony_m · May 24, 2023

buswedg said:
I finally figured this out. And after almost a month of testing now, I'm very confident.

In short -- I needed to turn off Native Command Queuing for all the WD gold 16tb drives by setting the queue depth to 1. They don't retain this setting but I have a script to set the depth on boot. I've been running this for some time now, with many periods of heavy read/writes. Have haven't had any errors.

I think overall, these gold drives seem to be acting like SMR drives in almost every way, even though they are CMR. I say that, as I got my hands on a set of WD Red Pro 18tb drives, and have been testing back to back against the golds. Same server, on the same HBA. And the red's are fine. The golds even keep disabling SCT/TLER on cold restart, whilst the red's do not.

If anyone can help me understand why this would be, it would be much appreciated.

A search brought me to this thread. I've been having (I think) the same issue as you were. In my system I have a bunch of 10TB Reds and 16TB Golds. The Golds have been giving me constant Read and Write errors that, so far, have cleared with no data corruption with a zpool clear. Long smartctl tests on the Reds always complete. Long tests on the Golds usually fail (host reset). They are all on the same backplane/cable/hba and I have randomly switched the drives to different bays. The Golds are the only ones that give me errors.

Did changing the queue depth to 1 work for you? Did you go higher?

I'm assuming your script writes to /sys/block/sd*/device/queue_depth but if there is another or better way please let me know.

Much thanks in advance,
Tony

tanaka_desu · Jun 30, 2023

I was having exactly this issue for two months: WD Ultrastars (which are just WD Golds with another label, if people on the web are correct) randomly dropping out of pool with one of three errors: timeout, I/O error or device reset (which also produced SMART error in logs at sector 00000000).

(However, I'm not running TrueNAS — just regular Ubuntu 22.04, so sorry for barging on this forum. I'll hide more details about my build under spoiler)

Linux 5.15.0-75-generic #82-Ubuntu SMP Tue Jun 6 23:10:23 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

$ zfs --version
zfs-2.1.5-1ubuntu6~22.04.1
zfs-kmod-2.1.5-1ubuntu6~22.04.1

AMD Ryzen 5 PRO 5650G with Radeon Graphics
ASUS X570-E Gaming mobo
Corsair HX850 PSU
128GB ECC RAM — KSM32ED8/32HC * 4
Samsung 980 Pro 1TB * 2 — rpool mirror
LSI SAS 9305-16i in IT mode with latest firmware

HDDs are WD Red Pro and WD Ultrastar, only Ultrastars dropping out of pools. Everything is SATA

Pools are:
2x8TB — WD Red Pro — mirror
6x16TB — WD Ultrastar DC HC550 — raid-z2

System passed 7 days memtest (I don't remember how many passes is was, but at least 30 — well above necessary)

Every single HDD was burned in with this script (4 passes) https://github.com/Spearfoot/disk-burnin-and-testing and then passed long SMART test. No issues

NAS is connected to APC UPS

Over two months, I've tried:

resetting every cable possible
blowing dust and cleaning contacts with isopropyl
replacing cables from HBA to HDD -- twice!
setting TLER (SCT ERC) at boot (which Ultrastars don't have by default — and according to their Datasheet is 60 seconds by default, so it's a huge pitfall!)
disabling NCQ in linux kernel boot params
limiting speed to SATA 2
adding small usb noctua fan under HBA — I did check temperature with thermal camera and the hottest it got was 43°C under 100% load of 8 disks. But I added it anyway.

With no effect. Under high I/O ZFS would kick HDDs randomly from pool from just a single read or write error.

The guaranteed recipe for triggering issue for me is: select every torrent in qbittorrent and press "force recheck".
Then start copying something huge from raidz2 to other pool.
Within 30-60 minutes zfs will kick at least one HDD out of pool.

If you "zpool clear" it, it'll resume normal work.

Yesterday, when I was just getting ready to blow money on another HBA I found this thread and tried disabling NCQ via "queue_depth" param.

Code:

#!/bin/bash
for i in /dev/sd? ; do
    echo 1 > /sys/block/${i/\/dev\/}/device/queue_depth
    echo -n $i " disabled NCQ "
    smartctl -i $i | egrep "(Device Model|Product:)"
done

Suddenly, disks stopped falling out of the pool! System is running 28 hours in stress mode, copying stuff over and over again and I don't see any errors, SMART errors, timeouts and zpool errors. I'll keep stress testing it and monitoring it.

That is, performance did drop by at least 25% from setting queue_depth to 1 and I still don't know how it'll affect scrub duration (It was 48 hours at 31 TB of data before, I'm waiting now for newer results).

So... Did anyone try mailing Western Digital or ZFS mailing list on this issue? What's so special about WD Gold/WD Ultrastar that they are performing weirdly with NCQ enabled + ZFS? If it's indeed a bug within WD Gold/WD Ultrastar family, how's it was not discovered yet — just like WD Red SMR disaster 3 years ago? I doubt no one uses enterprise WD disks with ZFS, so...?

diogen · Jul 1, 2023

tanaka_desu said:
...I still don't know how it'll affect scrub duration (It was 48 hours at 31 TB of data before, I'm waiting now for newer results).

That is quite a bit!

Mine just finished (running every first day of the month) scrubbing: 31TB in 13 hours.
Latest bluefin, no apps, Seagate EXOS 16TB drives, six of them under RAIDZ2...

tanaka_desu · Jul 1, 2023

diogen said:
That is quite a bit!

Mine just finished (running every first day of the month) scrubbing: 31TB in 13 hours.
Latest bluefin, no apps, Seagate EXOS 16TB drives, six of them under RAIDZ2...

I was incorrect in my previous post — I triple checked grafana/prometheus, and previous scrub was 18 hours.
Current one is already noticeably longer — 36 hours and it's only 67% done. I was really scared in the first ~15 hours, because scrub was not progressing at all, showing speeds of 4 M/s, but then it quickly gained speed and went from 0.5% to 50% in like few hours. I'm not sure what this means and how it relates to queue_depth changes.

It's still not finished with this one, so I'll post final data. Plus, current scrub found some checksum errors and corrected some files, so I'll run it again immediately after — I'm not surprised about some corrupted data after disks were kicked out of the pool like 30 times in a week, so I'll be patient.

tanaka_desu · Jul 4, 2023

So, update on scrub.

First scrub took 1 days 18:37:06.
Second scrub took 1 days 06:47:50. Second scrub did not have any checksum errors nor any other errors.

Pre-patch scrubs took 18 hours (11 jun) and 26 hours (15 may).

So, comparing pre-patch scrubs and after-patch scrubs.

Average load:
Was: 3.5~4.5 with a peak at 6
Now: 4.0~5.0 with a peak at 7

Disk IOPS (reads):
Was: 400~600 iops
Now: 180~300 iops

Instantaneous queue size:
Was: 1~10
Now: 1 (constant, well, queue_depth in effect)

Time spent during IOPS (baseline), seeding torrents at 400 mbits:
Was: 20-30%
Now: 20-40%

Time spent during IOPS (scrub):
Was: 60-70%
Now: 80-99%, staying above 95% for long periods of time

Read speed:
Was: ~100-70 MB/s (uniform, slowing down towards the end of scrub)
Now: All over the place, between 100 MB/s and 4 MB/s for long periods of time

IO Utilization:
Was: staying below 80% all the time
Now: all over the place, hitting 95% util for long periods of time

Another thing: copying large amounts of data I now see Average Load spike up to 20 seemingly from starved IOPS.

Obviously this is in no way an academic research, it's just your old regular node_exporter and prometheus.

---

Currently I'm not sure what to do. Probably will try mailing Western Digital or someone else.

Funny thing, I found mentions about flawed NCQ in WD Golds in ZFS github from 2020:

scrub impacts user I/O latency too much · Issue #10535 · openzfs/zfs

Type Version/Name Distribution Name Debian Distribution Version Unstable/Sid Linux Kernel 5.7.0-1 Architecture amd64 ZFS Version 0.8.4-1 SPL Version 0.8.4-1 When a scrub is ongoing, normal user I/O...

github.com

Scrub heavily impacting application IO performance · Issue #10253 · openzfs/zfs

System information Type Version/Name Distribution Name CentOS Distribution Version 8.1 Linux Kernel 4.18.0-147.5.1.el8_1.x86_64 Architecture x86_64 ZFS Version 0.8.3-1 SPL Version 0.8.3-1 Describe ...

github.com

I'm not sure if these are all related but still something to note.

Important Announcement for the TrueNAS Community.

Pool keeps degrading

Explorer

Explorer

Wizard

MVP

Dabbler

Wizard

MVP

Wizard

Explorer

Explorer

Explorer

MVP

Cadet

Cadet

Cadet

Explorer

Cadet

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Pool keeps degrading"

Similar threads