SOLVED Degraded pool, drives fine, 200 year scrub estimate?

CatchyGeo

Dabbler
Joined
Dec 5, 2023
Messages
14
As the title states, I'm having a really odd issue going on. All 6 of the new drives I added are in a degraded or failed state and i'm not sure what went wrong here. They all passed smart tests and the smart data seemed healthy to me. Any advice?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
What disks are these, where did you get them from, how are they connected... There's not much to go on in your post.
 

CatchyGeo

Dabbler
Joined
Dec 5, 2023
Messages
14
What disks are these, where did you get them from, how are they connected... There's not much to go on in your post.
Hey! Sorry this is my post here. I'm running a powered r530 with a perc h310 flashed in it mode. Also running a lsi 9207-8e connected to a Netapp DS4246. In this pool i have 3 vDevs. One of which is giving errors after adding 6 HGST drives that passed all smart tests before I threw them in. I'm not sure what information you'd need, but I can add anything else that would be of help. Screenshots, etc. Thank you!
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Let's start with zpool status and smartctl -x /dev/sdX, the latter for each of the affected disks.
 

CatchyGeo

Dabbler
Joined
Dec 5, 2023
Messages
14
Let's start with zpool status and smartctl -x /dev/sdX, the latter for each of the affected disks.
ZPool Status for the pool with issues

Code:
  pool: Studio
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
    repaired.
  scan: scrub canceled on Tue Dec  5 13:32:12 2023
config:

    NAME                                        STATE     READ WRITE CKSUM
    Studio                                      DEGRADED     0     0     0
      raidz1-0                                  ONLINE       0     0     0
        bf6894b2-4e90-47fe-be4c-e6be8ed41ec7    ONLINE       0     0     0
        90fce70e-b82b-4c68-a78f-adb830b12e68    ONLINE       0     0     0
        4ad6ec78-2967-4017-b372-927a7dca1e47    ONLINE       0     0     0
        58bd3274-a91b-4a33-a711-b2495b80e462    ONLINE       0     0     0
        426d713c-1171-4557-90bf-96b9da727041    ONLINE       0     0     0
      raidz1-1                                  ONLINE       0     0     0
        f41d8a1d-f052-456d-9343-5280a3192c26    ONLINE       0     0     0
        2363d5b3-89f9-4586-9291-e2781962e79e    ONLINE       0     0     0
        db0f3488-3475-43cf-89c8-3cd27b960e41    ONLINE       0     0     0
        aaf62f30-53c7-4944-8fec-048303c45def    ONLINE       0     0     0
        148ff0cd-c3d9-4534-aafb-d955871f9319    ONLINE       0     0     0
      raidz1-2                                  DEGRADED     0 1.94K     0
        6c8f8d99-40ee-4b14-9dc3-5bef3546b758    DEGRADED     0   963     1  too many errors
        18f69766-3fa0-4458-9225-6b0323730b66    DEGRADED     0 1.07K     1  too many errors
        96000c13-61ec-4278-832c-51bd90527dea    DEGRADED     0   788     0  too many errors
        spare-3                                 UNAVAIL      0    74     0  insufficient replicas
          b57fca1f-0bd9-42a0-aca8-56aa73553b90  FAULTED      0    64     0  too many errors
          77f58081-decd-4d38-b4c1-6139d0a1d5aa  FAULTED      0    75     0  too many errors
        2c7f1196-5415-4ac3-bd19-f6f7aa979442    DEGRADED     0   847     0  too many errors
    spares
      77f58081-decd-4d38-b4c1-6139d0a1d5aa      INUSE     currently in use

errors: No known data errors



Drive sdab
Code:
smartctl -x /dev/sdab
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUH721008AL4200
Revision:             A3Z4
Compliance:           SPC-4
User Capacity:        8,001,563,222,016 bytes [8.00 TB]
Logical block size:   4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca27d0272c0
Serial number:        2SG1ARYF
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Tue Dec  5 15:46:33 2023 EST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     35 C
Drive Trip Temperature:        85 C

Manufactured in week 21 of year 2019
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  89
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  1504
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 51113588672167936

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        3         0         3    3212507      14955.371           0
write:         0        0         0         0    3894982     232944.141           0
verify:        0        0         0         0      11774          0.000           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   34125                 - [-   -    -]
# 2  Background long   Completed                   -   34096                 - [-   -    -]
# 3  Background short  Completed                   -   34053                 - [-   -    -]
# 4  Background short  Completed                   -   34005                 - [-   -    -]
# 5  Background short  Completed                   -   33957                 - [-   -    -]
# 6  Background short  Completed                   -   33885                 - [-   -    -]
# 7  Background long   Completed                   -   33866                 - [-   -    -]
# 8  Background short  Completed                   -   33848                 - [-   -    -]

Long (extended) Self-test duration: 60239 seconds [1004.0 minutes]

Background scan results log
  Status: waiting until BMS interval timer expires
    Accumulated power on time, hours:minutes 34165:02 [2049902 minutes]
    Number of background scans performed: 27,  scan progress: 0.00%
    Number of background medium scans performed: 27

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 2
  number of phys = 1
  phy identifier = 0
    attached device type: SAS or SATA device
    attached reason: unknown
    reason: unknown
    negotiated logical link rate: phy enabled; 12 Gbps
    attached initiator port: ssp=1 stp=1 smp=1
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000cca27d0272c1
    attached SAS address = 0x5d09466071559e05
    attached phy identifier = 5
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 0
     Phy reset problem count: 0
relative target port id = 2
  generation code = 2
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: power on
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000cca27d0272c2
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 0
     Phy reset problem count: 0




drive sdy
Code:
smartctl -x /dev/sdy
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.131+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUH721008AL4200
Revision:             A3Z4
Compliance:           SPC-4
User Capacity:        8,001,563,222,016 bytes [8.00 TB]
Logical block size:   4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca27d0649f4
Serial number:        2SG3G6TF
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Tue Dec  5 15:46:52 2023 EST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     42 C
Drive Trip Temperature:        85 C

Manufactured in week 23 of year 2019
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  92
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  1509
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 54442452785823744

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0      341         0       341    2998578      14958.617           0
write:         0        0         0         0    2576965     272073.797           0
verify:        0        0         0         0       6047          0.000           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   34134                 - [-   -    -]
# 2  Background long   Completed                   -   34102                 - [-   -    -]
# 3  Background short  Completed                   -   34062                 - [-   -    -]
# 4  Background short  Completed                   -   34014                 - [-   -    -]
# 5  Background short  Completed                   -   33966                 - [-   -    -]
# 6  Background short  Completed                   -   33894                 - [-   -    -]
# 7  Background long   Completed                   -   33876                 - [-   -    -]
# 8  Background short  Completed                   -   33857                 - [-   -    -]

Long (extended) Self-test duration: 65426 seconds [1090.4 minutes]

Background scan results log
  Status: waiting until BMS interval timer expires
    Accumulated power on time, hours:minutes 34173:51 [2050431 minutes]
    Number of background scans performed: 40,  scan progress: 0.00%
    Number of background medium scans performed: 40

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 1
  number of phys = 1
  phy identifier = 0
    attached device type: expander device
    attached reason: power on
    reason: unknown
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=0 stp=0 smp=1
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000cca27d0649f5
    attached SAS address = 0x500a09800638aeff
    attached phy identifier = 26
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 0
     Phy reset problem count: 0
relative target port id = 2
  generation code = 1
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: power on
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000cca27d0649f6
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 0
     Phy reset problem count: 0




the other 4 drives have an identical readout to sdab
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Well, the SMART reports on those disks are rather unhelpful and vague, but they do say the disks have been in near-continuous use for the past four years. I also see that one disk is connected at 12 Gb/s and the other only at 6 Gb/s. Since most of the errors reported are write errors and the disks aren't themselves complaining, you should share the details of the HBA(s) and expanders you're using. You were either sold crap drives salvaged after being replaced for having crapped out or have a bizarre SAS connectivity issue.
 

CatchyGeo

Dabbler
Joined
Dec 5, 2023
Messages
14
Well, the SMART reports on those disks are rather unhelpful and vague, but they do say the disks have been in near-continuous use for the past four years. I also see that one disk is connected at 12 Gb/s and the other only at 6 Gb/s. Since most of the errors reported are write errors and the disks aren't themselves complaining, you should share the details of the HBA(s) and expanders you're using. You were either sold crap drives salvaged after being replaced for having crapped out or have a bizarre SAS connectivity issue.
They're being ran through two separate HBA cards currently. The one connected at 6gbps is connected through the external LSI9208 I have connected to a DS4246. The 12gbps drive is connected to the internal perc hba310 connected to the server's backplane
 

CatchyGeo

Dabbler
Joined
Dec 5, 2023
Messages
14
also forgot to mention that the drive connected to the 6gbps interface is in a "hot spare" vDev
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
No immediate red flags there, and if the other bays are all fine, I guess you were just scrap instead of working disks...
 

CatchyGeo

Dabbler
Joined
Dec 5, 2023
Messages
14
Only thing I was thinking is that a block size mismatch could be causing these read and write issues? I just realized these are 4096 bit logical blocks while my other vdevs are comprised of drives with 512 bit blocks. that's about all I can think about aside from being sold bad disks.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Well, TrueNAS defaults to ashift=12 to avoid that being a problem.
 

CatchyGeo

Dabbler
Joined
Dec 5, 2023
Messages
14
Ah. I didn't know that. Would it be a bad idea to format the drives one at a time with 512 byte blocks incase that somehow is causing some kind of issue with my HBA? Or could I have my pool configured in a way that would make that an issue?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
If the drives accept it, it's worth a shot.
 

CatchyGeo

Dabbler
Joined
Dec 5, 2023
Messages
14
According to what I can find online, they support it. I'll give it a shot and let you know if it worked. Should I offline the drive first?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Best to do so, yes.
 

CatchyGeo

Dabbler
Joined
Dec 5, 2023
Messages
14
Best to do so, yes.
Yeah no dice. Same issue. I'm currently away from the server, but i'll be able to check in on it tomorrow. I'll try switching out my LSI HBA and double check cables and switch where the drives are in the chassis to see if the issue follows.
 

CatchyGeo

Dabbler
Joined
Dec 5, 2023
Messages
14
i ended up ordering new drives and was able to replace them all and resilver everything. im going to make a temp pool with those 6 drives and see what i can do to get these working again
 

CatchyGeo

Dabbler
Joined
Dec 5, 2023
Messages
14
replaced the drives, and then was able to use HUGO to do a proper long reformat on the HDDs and am using them now without any errors
no idea what the issue was at first but id say it's resolved
 

sammael

Explorer
Joined
May 15, 2017
Messages
76
Rookie numbers
1704988930048.png
 
Top