New drives being marked as failed constantly.

CatchyGeo

Dabbler
Joined
Dec 5, 2023
Messages
14
I'm running a dell power edge r530 with 196gb ecc ddr4, 2x xeon e5 2698 v3 cpus, a perc h330 flashed into IT mode, and a lsi sas 9207-8e connected to a DS4246. Now with that out of the way, the issue im having is that a specific vDev keeps being marked as failed. I've tried reformatting the drives by replacing them one at a time with one another but formatted in 4kn (the drives support it) but the same thing happens. Smart data looks fine on the drives and there are no uncorrected errors. There are 4 other vDevs in the dataset with no issues. Any help would be awesome! The drives being used are HGST H7280A520SUN8.0T drives. If there's any more info needed let me know! Also running TrueNAS-SCALE-22.12.4.2 as I need hardware transcoding with an older nvidia gpu.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Some manufacturers have poor defaults for some of the following;

TLER - Time Limited Error Reporting, (Seagate and other manufacturers may call it something different). Desktop drives can go to extreme lengths to recover a failing block. Even taking more than a minute. During that time a drive may not respond to ANY other request, causing ZFS to consider the drive failed. Most NAS type drives default to 7 seconds, which is quite reasonable.

Auto head parking - Some drives are aggressive in automatically parking the heads. This can lead to the drive not being ready. Again, potentially causing a timeout.

As for how to check for these issues, sometimes it depends on the manufacturer. Sorry.

Anyway, one or more drives in a vDev with one of those problems can cause a vDev to be marked failed.

It can also help if you describe your vDev layout in your ZFS pool. Like is it RAID-Zx or Mirror? Output of zpool status can show the exact configuration, which you can post here in CODE tags.


Last, using the proper terminology can help. You say vDev, which sounds right. But later you say "4 other vDevs in the dataset", which is not correct. Datasets exist in a pool. So I think you meant "4 other vDevs in the pool".
 

CatchyGeo

Dabbler
Joined
Dec 5, 2023
Messages
14
Some manufacturers have poor defaults for some of the following;

TLER - Time Limited Error Reporting, (Seagate and other manufacturers may call it something different). Desktop drives can go to extreme lengths to recover a failing block. Even taking more than a minute. During that time a drive may not respond to ANY other request, causing ZFS to consider the drive failed. Most NAS type drives default to 7 seconds, which is quite reasonable.

Auto head parking - Some drives are aggressive in automatically parking the heads. This can lead to the drive not being ready. Again, potentially causing a timeout.

As for how to check for these issues, sometimes it depends on the manufacturer. Sorry.

Anyway, one or more drives in a vDev with one of those problems can cause a vDev to be marked failed.

It can also help if you describe your vDev layout in your ZFS pool. Like is it RAID-Zx or Mirror? Output of zpool status can show the exact configuration, which you can post here in CODE tags.


Last, using the proper terminology can help. You say vDev, which sounds right. But later you say "4 other vDevs in the dataset", which is not correct. Datasets exist in a pool. So I think you meant "4 other vDevs in the pool".
Hey! Thank you for the quick response! Sorry for my terminology. I've been caught up with school and such and haven't had time to dig back into this for a while. Here's the output you were mentioning. Everything is RAID-Z1 as it worked best with what I had in mind.

Code:
root@truenas[~]# zpool status
 

  pool: Yanni Studio
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 804G in 09:16:02 with 188 errors on Thu Apr  4 15:06:25 2024
config:

        NAME                                      STATE     READ WRITE CKSUM
        Yanni Studio                              DEGRADED     0     0     0
          raidz1-0                                ONLINE       0     0     0
            7c3f6eec-6b41-41b6-911a-ed873b909fec  ONLINE       0     0     0
            1b37e351-d33e-4fc5-b622-c0228c1b44f1  ONLINE       0     0     0
            0bfbd6c4-d0f5-4490-a857-9ca5bdc2580b  ONLINE       0     0     0
            604da0dc-5b62-4210-bb52-714482f333c5  ONLINE       0     0     0
            92f3db72-24d3-44e2-af8b-17951a72f33b  ONLINE       0     0     0
          raidz1-1                                ONLINE       0     0     0
            f41d8a1d-f052-456d-9343-5280a3192c26  ONLINE       0     0     0
            2363d5b3-89f9-4586-9291-e2781962e79e  ONLINE       0     0     0
            db0f3488-3475-43cf-89c8-3cd27b960e41  ONLINE       0     0     0
            aaf62f30-53c7-4944-8fec-048303c45def  ONLINE       0     0     0
            148ff0cd-c3d9-4534-aafb-d955871f9319  ONLINE       0     0     0
          raidz1-2                                ONLINE       0     0     0
            b0564f03-24cc-4f06-83d2-7f71256d54ce  ONLINE       0     0     0
            ba56aaae-aa21-480f-9816-9fe1685cf8ac  ONLINE       0     0     0
            84812f61-b7b1-4775-aec0-effa72d5b345  ONLINE       0     0     0
            965195a9-ff64-41bf-946b-da4639583537  ONLINE       0     0     0
            4cce06cf-a844-484f-9b2f-4b3d51aeda62  ONLINE       0     0     0
          raidz1-3                                DEGRADED   164     0     0
            2095c23f-1486-4b1e-9bbc-33ffddf8c11c  ONLINE       0     0   916
            a44577db-861a-49ec-9825-02cd1368e5c2  DEGRADED   190     0   916  too many errors
            79e0dbf9-79d2-45a5-8399-4417dc61fe1e  ONLINE       0     0   916
            4dca8d3d-887f-4b5d-b968-72f3d2a3a872  ONLINE       0     0     0
            47aa4bda-41de-4f2d-bada-defaf1b8617b  FAULTED     57     0   916  too many errors
        spares
          e0fcc478-707b-46ef-826f-d2a1e7233829    AVAIL  

errors: 188 data errors, use '-v' for a list
 
Top