HI all. I bought a refurbed R730XD specifically for a home lab. I did a bit of research beforehand and specifically chose the HBA330 mini non-RAID controller. I hope that was the right thing to do.
Unfortunately, any time I put any sort of load on the storage subsystem I start getting errors. I did a zpool clear <foo> on my pools this morning and as of right now I see:
Ignore the boot pool, that's broken on purpose so I could quickly switch operating systems (I get the same errors in proxmox, BTW.)
The support vendor refuses to swap out any hardware because idrac isn't showing any errors on the controller. I've reseated cables (but probably not all of them, it's a tight squeeze. I'm willing to fully disassemble but it's going to be uncomfortable, like the back of a volkswagen.)
I see messages like this frequently in dmesg:
But it's not always WRITE and it's not always /dev/sdg
I currently have it fully populated with 8 12TB seagate drives and 4 12TB WD Reds that I pulled from my old NAS. I get the random errors on all (or at least most of the) drives. I have spares I've tried and I still get errors. While I'm not discounting the possibility, I doubt all the drives I have are bad.
As far as I can tell all the firmware on the system is as up to date as possible. I can't find anything newer than 16.17.01.00 for the HBA330 Mini.
Anyone have anything I should try or look at? I'm not exactly a noob, but I've not touched (enterprise) hardware for quite some time.
Unfortunately, any time I put any sort of load on the storage subsystem I start getting errors. I did a zpool clear <foo> on my pools this morning and as of right now I see:
Code:
root@truenas[~]# zpool status pool: boot-pool state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J scan: scrub repaired 0B in 00:00:23 with 0 errors on Fri Jul 1 03:45:26 2022 config: NAME STATE READ WRITE CKSUM boot-pool DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 16189685538754587719 UNAVAIL 0 0 0 was /dev/sdj3 sdm3 ONLINE 0 0 0 errors: No known data errors pool: main state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: resilvered 12.5G in 00:01:35 with 0 errors on Wed Jul 6 13:42:51 2022 config: NAME STATE READ WRITE CKSUM main DEGRADED 0 0 0 raidz2-0 ONLINE 0 0 0 f011b123-078c-4ab7-ad8c-49e14643e1ab ONLINE 0 0 0 15eec459-9576-4426-b347-052a5e581844 ONLINE 0 0 0 47826a44-9122-42ae-a1fd-4ad54562f9e5 ONLINE 0 0 0 9db35305-152f-46ba-8faa-6d8fc57ed258 ONLINE 0 0 0 raidz2-1 DEGRADED 0 0 0 b2b96949-b93c-4392-9b5d-88ca21bbfbb9 ONLINE 0 0 0 3f8ff389-9704-4c77-9e67-15a829f9a3d9 ONLINE 0 0 0 b5d3c619-a38a-4433-8767-591c835a1de3 FAULTED 0 58 0 too many errors 7f348036-7db4-4d41-8878-07eaaa2d0b14 FAULTED 22 53 0 too many errors errors: No known data errors pool: test state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P scan: resilvered 618G in 00:32:57 with 0 errors on Wed Jul 6 14:14:05 2022 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 9387ffd2-b04f-41fd-ba77-691ffaf4b840 ONLINE 0 0 0 89b4f62e-1307-48c2-a96c-5917c6d0a217 ONLINE 0 0 0 0987934d-14f5-428e-9ed3-ce039a52683d ONLINE 0 0 0 90cee02a-36a7-43fc-95f9-9e6a94c5393e ONLINE 0 6 0 errors: No known data errors
Ignore the boot pool, that's broken on purpose so I could quickly switch operating systems (I get the same errors in proxmox, BTW.)
The support vendor refuses to swap out any hardware because idrac isn't showing any errors on the controller. I've reseated cables (but probably not all of them, it's a tight squeeze. I'm willing to fully disassemble but it's going to be uncomfortable, like the back of a volkswagen.)
I see messages like this frequently in dmesg:
Code:
[1108669.935060] blk_update_request: I/O error, dev sdg, sector 6992830584 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0 [1108669.949537] blk_update_request: I/O error, dev sdg, sector 6992829776 op 0x1:(WRITE) flags 0x700 phys_seg 52 prio class 0 [1108669.962046] zio pool=main vdev=/dev/disk/by-partuuid/b5d3c619-a38a-4433-8767-591c835a1de3 error=5 type=2 offset=3578181709824 size=8192 flags=180880 [1108669.975725] zio pool=main vdev=/dev/disk/by-partuuid/b5d3c619-a38a-4433-8767-591c835a1de3 error=5 type=2 offset=3578181296128 size=413696 flags=40080c80 [1108670.369217] sd 0:0:6:0: Power-on or device reset occurred
But it's not always WRITE and it's not always /dev/sdg
I currently have it fully populated with 8 12TB seagate drives and 4 12TB WD Reds that I pulled from my old NAS. I get the random errors on all (or at least most of the) drives. I have spares I've tried and I still get errors. While I'm not discounting the possibility, I doubt all the drives I have are bad.
As far as I can tell all the firmware on the system is as up to date as possible. I can't find anything newer than 16.17.01.00 for the HBA330 Mini.
Anyone have anything I should try or look at? I'm not exactly a noob, but I've not touched (enterprise) hardware for quite some time.