SOLVED Dell R730XD w/ HBA330 mini getting many disk errors, drive resets

sakodak

Cadet
Joined
Jun 15, 2022
Messages
3
HI all. I bought a refurbed R730XD specifically for a home lab. I did a bit of research beforehand and specifically chose the HBA330 mini non-RAID controller. I hope that was the right thing to do.

Unfortunately, any time I put any sort of load on the storage subsystem I start getting errors. I did a zpool clear <foo> on my pools this morning and as of right now I see:

Code:
root@truenas[~]# zpool status
  pool: boot-pool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 00:00:23 with 0 errors on Fri Jul  1 03:45:26 2022
config:

        NAME                      STATE     READ WRITE CKSUM
        boot-pool                 DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            16189685538754587719  UNAVAIL      0     0     0  was /dev/sdj3
            sdm3                  ONLINE       0     0     0

errors: No known data errors

  pool: main
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: resilvered 12.5G in 00:01:35 with 0 errors on Wed Jul  6 13:42:51 2022
config:

        NAME                                      STATE     READ WRITE CKSUM
        main                                      DEGRADED     0     0     0
          raidz2-0                                ONLINE       0     0     0
            f011b123-078c-4ab7-ad8c-49e14643e1ab  ONLINE       0     0     0
            15eec459-9576-4426-b347-052a5e581844  ONLINE       0     0     0
            47826a44-9122-42ae-a1fd-4ad54562f9e5  ONLINE       0     0     0
            9db35305-152f-46ba-8faa-6d8fc57ed258  ONLINE       0     0     0
          raidz2-1                                DEGRADED     0     0     0
            b2b96949-b93c-4392-9b5d-88ca21bbfbb9  ONLINE       0     0     0
            3f8ff389-9704-4c77-9e67-15a829f9a3d9  ONLINE       0     0     0
            b5d3c619-a38a-4433-8767-591c835a1de3  FAULTED      0    58     0  too many errors
            7f348036-7db4-4d41-8878-07eaaa2d0b14  FAULTED     22    53     0  too many errors

errors: No known data errors

  pool: test
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 618G in 00:32:57 with 0 errors on Wed Jul  6 14:14:05 2022
config:

        NAME                                      STATE     READ WRITE CKSUM
        test                                      ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            9387ffd2-b04f-41fd-ba77-691ffaf4b840  ONLINE       0     0     0
            89b4f62e-1307-48c2-a96c-5917c6d0a217  ONLINE       0     0     0
            0987934d-14f5-428e-9ed3-ce039a52683d  ONLINE       0     0     0
            90cee02a-36a7-43fc-95f9-9e6a94c5393e  ONLINE       0     6     0

errors: No known data errors


Ignore the boot pool, that's broken on purpose so I could quickly switch operating systems (I get the same errors in proxmox, BTW.)

The support vendor refuses to swap out any hardware because idrac isn't showing any errors on the controller. I've reseated cables (but probably not all of them, it's a tight squeeze. I'm willing to fully disassemble but it's going to be uncomfortable, like the back of a volkswagen.)

I see messages like this frequently in dmesg:

Code:
[1108669.935060] blk_update_request: I/O error, dev sdg, sector 6992830584 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[1108669.949537] blk_update_request: I/O error, dev sdg, sector 6992829776 op 0x1:(WRITE) flags 0x700 phys_seg 52 prio class 0
[1108669.962046] zio pool=main vdev=/dev/disk/by-partuuid/b5d3c619-a38a-4433-8767-591c835a1de3 error=5 type=2 offset=3578181709824 size=8192 flags=180880
[1108669.975725] zio pool=main vdev=/dev/disk/by-partuuid/b5d3c619-a38a-4433-8767-591c835a1de3 error=5 type=2 offset=3578181296128 size=413696 flags=40080c80
[1108670.369217] sd 0:0:6:0: Power-on or device reset occurred


But it's not always WRITE and it's not always /dev/sdg

I currently have it fully populated with 8 12TB seagate drives and 4 12TB WD Reds that I pulled from my old NAS. I get the random errors on all (or at least most of the) drives. I have spares I've tried and I still get errors. While I'm not discounting the possibility, I doubt all the drives I have are bad.

As far as I can tell all the firmware on the system is as up to date as possible. I can't find anything newer than 16.17.01.00 for the HBA330 Mini.

Anyone have anything I should try or look at? I'm not exactly a noob, but I've not touched (enterprise) hardware for quite some time.
 

FlyingHacker

Dabbler
Joined
Jun 27, 2022
Messages
39
I am a newbie at this, but here is some info:

I use an HBA330 that I bought pre-flashed from ArtOfTheServer (Youtube guy in the link below) - It shows this version in iDrac:
Dell HBA330 MiniARTofSERVER16.17.00.05



Hope this is useful to you. I have not seen errors like you report with HGST SAS drives..
 

sakodak

Cadet
Joined
Jun 15, 2022
Messages
3
This is an HBA330, not an H330. Basically when you flash an H330 you're turning it into a HBA330. But I really appreciate you trying to help, thank you.
 

FlyingHacker

Dabbler
Joined
Jun 27, 2022
Messages
39
Ah, oops! Good luck. Will be following this thread anyway.

My r730xd came with a Perc730P, but I replaced it with the above card.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Forum search is your friend.

 

sakodak

Cadet
Joined
Jun 15, 2022
Messages
3
After swapping out the backplane (which made things worse) and cables, this ended up being a bad HBA330. The vendor replaced it and everything seems to be fine so far. Granted, it's been less than an hour since I replaced the HBA so I might be jumping the gun, but previously I'd get ZFS errors and device resets anytime I did any even moderate amount of IO on the disks. I've been running some tasks for a while and I still have no errors or device resets at all.

Samuel: I did, in fact, search the forums before I posted. That thread is for a different HBA and different server than the one I have, so I deduced the relevance would be minimal. I had already tried cleaning the pads on the HBA with IPA, I probably should have mentioned that.
 
Top