SOLVED Dell R730XD w/ HBA330 mini getting many disk errors, drive resets

sakodak · Jul 6, 2022

HI all. I bought a refurbed R730XD specifically for a home lab. I did a bit of research beforehand and specifically chose the HBA330 mini non-RAID controller. I hope that was the right thing to do.

Unfortunately, any time I put any sort of load on the storage subsystem I start getting errors. I did a zpool clear <foo> on my pools this morning and as of right now I see:

Code:

root@truenas[~]# zpool status
  pool: boot-pool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 00:00:23 with 0 errors on Fri Jul  1 03:45:26 2022
config:

        NAME                      STATE     READ WRITE CKSUM
        boot-pool                 DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            16189685538754587719  UNAVAIL      0     0     0  was /dev/sdj3
            sdm3                  ONLINE       0     0     0

errors: No known data errors

  pool: main
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: resilvered 12.5G in 00:01:35 with 0 errors on Wed Jul  6 13:42:51 2022
config:

        NAME                                      STATE     READ WRITE CKSUM
        main                                      DEGRADED     0     0     0
          raidz2-0                                ONLINE       0     0     0
            f011b123-078c-4ab7-ad8c-49e14643e1ab  ONLINE       0     0     0
            15eec459-9576-4426-b347-052a5e581844  ONLINE       0     0     0
            47826a44-9122-42ae-a1fd-4ad54562f9e5  ONLINE       0     0     0
            9db35305-152f-46ba-8faa-6d8fc57ed258  ONLINE       0     0     0
          raidz2-1                                DEGRADED     0     0     0
            b2b96949-b93c-4392-9b5d-88ca21bbfbb9  ONLINE       0     0     0
            3f8ff389-9704-4c77-9e67-15a829f9a3d9  ONLINE       0     0     0
            b5d3c619-a38a-4433-8767-591c835a1de3  FAULTED      0    58     0  too many errors
            7f348036-7db4-4d41-8878-07eaaa2d0b14  FAULTED     22    53     0  too many errors

errors: No known data errors

  pool: test
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 618G in 00:32:57 with 0 errors on Wed Jul  6 14:14:05 2022
config:

        NAME                                      STATE     READ WRITE CKSUM
        test                                      ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            9387ffd2-b04f-41fd-ba77-691ffaf4b840  ONLINE       0     0     0
            89b4f62e-1307-48c2-a96c-5917c6d0a217  ONLINE       0     0     0
            0987934d-14f5-428e-9ed3-ce039a52683d  ONLINE       0     0     0
            90cee02a-36a7-43fc-95f9-9e6a94c5393e  ONLINE       0     6     0

errors: No known data errors

Ignore the boot pool, that's broken on purpose so I could quickly switch operating systems (I get the same errors in proxmox, BTW.)

The support vendor refuses to swap out any hardware because idrac isn't showing any errors on the controller. I've reseated cables (but probably not all of them, it's a tight squeeze. I'm willing to fully disassemble but it's going to be uncomfortable, like the back of a volkswagen.)

I see messages like this frequently in dmesg:

Code:

[1108669.935060] blk_update_request: I/O error, dev sdg, sector 6992830584 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[1108669.949537] blk_update_request: I/O error, dev sdg, sector 6992829776 op 0x1:(WRITE) flags 0x700 phys_seg 52 prio class 0
[1108669.962046] zio pool=main vdev=/dev/disk/by-partuuid/b5d3c619-a38a-4433-8767-591c835a1de3 error=5 type=2 offset=3578181709824 size=8192 flags=180880
[1108669.975725] zio pool=main vdev=/dev/disk/by-partuuid/b5d3c619-a38a-4433-8767-591c835a1de3 error=5 type=2 offset=3578181296128 size=413696 flags=40080c80
[1108670.369217] sd 0:0:6:0: Power-on or device reset occurred

But it's not always WRITE and it's not always /dev/sdg

I currently have it fully populated with 8 12TB seagate drives and 4 12TB WD Reds that I pulled from my old NAS. I get the random errors on all (or at least most of the) drives. I have spares I've tried and I still get errors. While I'm not discounting the possibility, I doubt all the drives I have are bad.

As far as I can tell all the firmware on the system is as up to date as possible. I can't find anything newer than 16.17.01.00 for the HBA330 Mini.

Anyone have anything I should try or look at? I'm not exactly a noob, but I've not touched (enterprise) hardware for quite some time.

FlyingHacker · Jul 6, 2022

I am a newbie at this, but here is some info:

I use an HBA330 that I bought pre-flashed from ArtOfTheServer (Youtube guy in the link below) - It shows this version in iDrac:

Dell HBA330 MiniARTofSERVER

16.17.00.05

How to crossflash Dell H330 to IT mode firmware

In this video, I'm going to show you how to flash a Dell H330 MegaRAID controller with HBA IT mode firmware from the HBA330 controller. This method is brough...

www.youtube.com

Flash/Crossflash DELL H330 RAID Card to HBA330/12Gbps HBA IT Firmware

Success! You can now flash the Dell PERC H330 (All Models) to IT Mode (HBA330). Big Thanks to BLinux for asking if it could be done! (Inspiration!) So I initially had a super long write-up, but that's way too long, you just need the compact steps. I linked the write up at the end of the...

forums.servethehome.com

Hope this is useful to you. I have not seen errors like you report with HGST SAS drives..

sakodak · Jul 6, 2022

This is an HBA330, not an H330. Basically when you flash an H330 you're turning it into a HBA330. But I really appreciate you trying to help, thank you.

FlyingHacker · Jul 6, 2022

Ah, oops! Good luck. Will be following this thread anyway.

My r730xd came with a Perc730P, but I replaced it with the above card.

Samuel Tai · Jul 6, 2022

Forum search is your friend.

SOLVED - Failed perc damaged multiple disks, how to fix?

I had a H710 perc that failed in my Dell R720xd. I replaced it with a new one and when I started Scale, the polls were ruined with several platter disks showing as faulted or degraded. Since this is a new setup, there was no data into pools, so I decided to destroy all pools and start over...

www.truenas.com

sakodak · Jul 17, 2022

After swapping out the backplane (which made things worse) and cables, this ended up being a bad HBA330. The vendor replaced it and everything seems to be fine so far. Granted, it's been less than an hour since I replaced the HBA so I might be jumping the gun, but previously I'd get ZFS errors and device resets anytime I did any even moderate amount of IO on the disks. I've been running some tasks for a while and I still have no errors or device resets at all.

Samuel: I did, in fact, search the forums before I posted. That thread is for a different HBA and different server than the one I have, so I deduced the relevance would be minimal. I had already tried cleaning the pads on the HBA with IPA, I probably should have mentioned that.

Important Announcement for the TrueNAS Community.

SOLVED Dell R730XD w/ HBA330 mini getting many disk errors, drive resets

sakodak

Cadet

FlyingHacker

Dabbler

How to crossflash Dell H330 to IT mode firmware

Flash/Crossflash DELL H330 RAID Card to HBA330/12Gbps HBA IT Firmware

sakodak

Cadet

FlyingHacker

Dabbler

Samuel Tai

Never underestimate your own stupidity

SOLVED - Failed perc damaged multiple disks, how to fix?

sakodak

Cadet

Similar threads