New drive failed after a few days

serverboy · Sep 16, 2023

NugentS said:
What firmware is on that HBA?

Is there a way to find this out within truenas or do i nees to press ctrl C on boot to go to the hba bios?

Ok, so i copied some files from the pool and a ton of those errors I posted above appeared in dmesg for DA11 the count for UDMA CRC Errors now went up to 51 on that drive i swapped around. The original FAULTED drive values have not been affected. Definitely something up with that slot. I will swap the cables round on that row with another row and see if it affects any drive there.

EDIT

Added the images

joeschmuck · Sep 16, 2023

serverboy said:
The original FAULTED drive values have not been affected. Definitely something up with that slot.

I want to make sure that you are aware, UDMA_CRC_Errors count will never go back to zero. It is recorded forever on the drive electronics. Don't ask anyone why that is, it would be nice to have the ability to reset it once the problem has been fixed but that is not the case. I'm only saying this to ensure that you know this fact and do not get upset later when you cannot return the count back to zero.

As for the HBA firmware, please note that I don't really keep up on this aspect of the forum as I do not use an LSI HBA, but I thought P20 vas the current version, not P16. Please do not take any action due to my comment, wait for someone who knows a lot more than I do on this topic give you advice. I can't say it this would be the problem for the UDMA_CRC_Errors either, I wouldn't think so but stranger things have happened.

danb35 · Sep 16, 2023

joeschmuck said:
but I thought P20 vas the current version, not P16.

I think P16 is current for the 3008-based HBAs, but I also am uncertain.

Etorix · Sep 16, 2023

P20 is for LSI 2008 (PCIe 2.0); P16 is for LSI 3008 (PCIe 3.0). Firmware is not the issue here.
If the drive which has been moved to this slot now has CRC errors it is indeed a cable issue—except that the "cable" here consists of traces in the backplane.

serverboy · Sep 16, 2023

Thanks for the input everyone I really appreciate it.

So today I moved the cable from the 7-11 port on the HBA to port 0-3 and the problem followed to the drive on the end of the row DA03 when kicking of some transfers.
The bottom row of drives is empty at the moment so I moved the cable from slot 20-23 to the 7-11, booted it back up an kicked off some transfers.
On the console I kept a

Code:

tail -f /var/log/messages -n 1000

No errors appeared at all. So it looks like that cable is dodgy.
This is my first time dealing with SAS cables an HBAs. I would of thought if the cable was broke then all drives on that row would have been affected.
I am going to kick off a bad blocks run now on the whole server and see if anything comes from this.

Do these cables suffer from some sort of bend radius? Only thing I can think of is that the slight bend in the cable of damaged it somehow.

serverboy · Sep 16, 2023

Etorix said:
P20 is for LSI 2008 (PCIe 2.0); P16 is for LSI 3008 (PCIe 3.0). Firmware is not the issue here.
If the drive which has been moved to this slot now has CRC errors it is indeed a cable issue—except that the "cable" here consists of traces in the backplane.

What do you mean by traces?

serverboy · Sep 16, 2023

joeschmuck said:
I want to make sure that you are aware, UDMA_CRC_Errors count will never go back to zero. It is recorded forever on the drive electronics. Don't ask anyone why that is, it would be nice to have the ability to reset it once the problem has been fixed but that is not the case. I'm only saying this to ensure that you know this fact and do not get upset later when you cannot return the count back to zero.

As for the HBA firmware, please note that I don't really keep up on this aspect of the forum as I do not use an LSI HBA, but I thought P20 vas the current version, not P16. Please do not take any action due to my comment, wait for someone who knows a lot more than I do on this topic give you advice. I can't say it this would be the problem for the UDMA_CRC_Errors either, I wouldn't think so but stranger things have happened.

I know its a shame you can't wipe them but it is what it is I guess.

joeschmuck · Sep 16, 2023

serverboy said:
What do you mean by traces?

You can disregard that, you have isolated it to the data cable. Sometimes the backplane is the issue is what the reference was and given the available data at the time it was a viable option.

Yes, sometimes data cables just fail, without anyone touching them. You should replace it of course.

serverboy said:
I know its a shame you can't wipe them but it is what it is I guess.

Yup

serverboy · Sep 16, 2023

joeschmuck said:
You can disregard that, you have isolated it to the data cable. Sometimes the backplane is the issue is what the reference was.

Yes, sometimes data cables just fail, without anyone touching them. You should replace it of course.

Yup

I have ordered 2 more from Amazon, the ones I posted in my previous comments.

This has been a learning curve for me.

joeschmuck · Sep 16, 2023

serverboy said:
This has been a learning curve for me.

Most of us learn through the failures we encounter. Thankfully yours was an easy one.

ChrisRJ · Sep 16, 2023

joeschmuck said:
Most of us learn through the failures we encounter. Thankfully yours was an easy one.

"Good judgment comes from experience. And experience comes from bad judgement." Mark Horstman from Manager-Tools.com

serverboy · Sep 18, 2023

Hi Everyone
So after the long smart test ran. All drives reported back with no errors found. I copied some files off the share to my PC which was usually caused the errors to apper in the DMESG logs. Nothing has come up so far.

I am now running a scrub on the pool which is triggering a lot of email alerts.

Pool DATA01 state is ONLINE: One or more devices has experienced an
unrecoverable error. An attempt was made to correct the error. Applications
are unaffected.

The Scrub is currently at 10% but the pool has gone unhealthy now. But if i check the pool status all disks are online.

Code:

root@S-STORE01[~]# zpool status
  pool: DATA01
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Mon Sep 18 13:41:33 2023
        65.8T scanned at 22.5G/s, 8.01T issued at 2.74G/s, 65.8T total
        256K repaired, 12.16% done, 06:00:32 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        DATA01                                          ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/f01b9ed9-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     0
            gptid/f04a29f8-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     0
            gptid/f037c724-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     0
            gptid/f08b59ed-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     0
            gptid/f024fe29-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     0
            gptid/f0a56333-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     0
          raidz2-1                                      ONLINE       0     0     0
            gptid/efd1f769-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     0
            gptid/f040d7ed-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     0
            gptid/f02f2a10-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     0
            gptid/f094735f-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     0
            gptid/f08184cc-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     0
            gptid/f09df0dd-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     1  (repairing)

errors: No known data errors

  pool: NVME01
 state: ONLINE
config:

        NAME                                          STATE     READ WRITE CKSUM
        NVME01                                        ONLINE       0     0     0
          gptid/ade1acfb-4504-11ee-a3bf-7cc25504333c  ONLINE       0     0     0
          gptid/ade44197-4504-11ee-a3bf-7cc25504333c  ONLINE       0     0     0
          gptid/ade02594-4504-11ee-a3bf-7cc25504333c  ONLINE       0     0     0
          gptid/ade67d57-4504-11ee-a3bf-7cc25504333c  ONLINE       0     0     0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:06 with 0 errors on Thu Sep 14 03:45:06 2023
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada1p2  ONLINE       0     0     0
            ada0p2  ONLINE       0     0     0

errors: No known data errors
root@S-STORE01[~]#

I believe that drive it states that it is repairing is the one which was having the CRC errors.

Am I right in thinking the pool is corrupted or is the scrub doing its thing it fixing it etc?

NugentS · Sep 18, 2023

is da11 a drive you were having issues with? That is reporting a chksum error.
Keep an eye on that - if it increases you still have a problem. Let the scrub complete and then have a look

serverboy · Sep 18, 2023

NugentS said:
is da11 a drive you were having issues with? That is reporting a chksum error.
Keep an eye on that - if it increases you still have a problem. Let the scrub complete and then have a look

Yes that was the one.

Ok I will report back once it has finished. should be done later on this evening.

Can those errors be cleared or does it need to be rma'd?

serverboy · Sep 18, 2023

Hi @NugentS, it looks like the scrub has finished. The drive DA11 still shows the checksum error of 1 and the pool is still in a unhealthy state.

I have ran a short smart test on the da11 drive.

Code:

root@S-STORE01[~]# zpool status
  pool: DATA01
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 256K in 07:11:56 with 0 errors on Mon Sep 18 20:53:29 2023
config:

        NAME                                            STATE     READ WRITE CKSUM
        DATA01                                          ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/f01b9ed9-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     0
            gptid/f04a29f8-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     0
            gptid/f037c724-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     0
            gptid/f08b59ed-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     0
            gptid/f024fe29-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     0
            gptid/f0a56333-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     0
          raidz2-1                                      ONLINE       0     0     0
            gptid/efd1f769-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     0
            gptid/f040d7ed-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     0
            gptid/f02f2a10-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     0
            gptid/f094735f-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     0
            gptid/f08184cc-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     0
            gptid/f09df0dd-5258-11ee-abb2-7cc25504333c  ONLINE       0     0     1

errors: No known data errors

Code:

root@S-STORE01[~]# smartctl -a /dev/da11
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST20000NM007D-3DJ103
Serial Number:    ZVT86SC8
LU WWN Device Id: 5 000c50 0e65d9382
Firmware Version: SN03
User Capacity:    20,000,588,955,648 bytes [20.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Sep 18 21:15:54 2023 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  567) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (1732) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   078   064   044    Pre-fail  Always       -       63999888
  3 Spin_Up_Time            0x0003   090   090   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       10
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   075   060   045    Pre-fail  Always       -       32477741
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       244
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       10
 18 Unknown_Attribute       0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   098   000    Old_age   Always       -       87
190 Airflow_Temperature_Cel 0x0022   060   045   000    Old_age   Always       -       40 (Min/Max 37/55)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       9
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       290
194 Temperature_Celsius     0x0022   040   055   000    Old_age   Always       -       40 (0 28 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   197   000    Old_age   Always       -       15479
200 Multi_Zone_Error_Rate   0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   100   000    Old_age   Offline      -       119 (145 117 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       12057994624
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       11981429584

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       244         -
# 2  Extended offline    Completed without error       00%       216         -
# 3  Short offline       Completed without error       00%       189         -
# 4  Extended offline    Interrupted (host reset)      00%       187         -
# 5  Short offline       Completed without error       00%       174         -
# 6  Extended offline    Interrupted (host reset)      00%       174         -
# 7  Short offline       Completed without error       00%       160         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@S-STORE01[~]#

I ran a zpool clear DATA01 which has removed the value under checksum errors and restored the pool to a healthy state now.

joeschmuck · Sep 19, 2023

I had said before to monitor the UDMA_CRC_Errors, maybe you forgot but they did increase in value, quite a bit actually.

serverboy · Sep 19, 2023

joeschmuck said:
I had said before to monitor the UDMA_CRC_Errors, maybe you forgot but they did increase in value, quite a bit actually.

Hi
They increased as I was testing the drive when moving it around bays and doing some transfers. It hasn't moved since I fixed the cable issue.

I ran another scrub last night which completed successfully with no check sum errors.

No more messages have appeared in the logs like they did before. Unfortunately those values will stay on my drive now though. I do have 2 cold spares. But I am tempted to just RMA the drive to get a fresh one.

joeschmuck · Sep 19, 2023

serverboy said:
But I am tempted to just RMA the drive to get a fresh one.

I'm not sure UDMA_CRC_Errors is something you can RMA, but it's worth a try if that is the path you would like to go down.

serverboy said:
They increased as I was testing the drive when moving it around bays and doing some transfers. It hasn't moved since I fixed the cable issue.

Sorry for jumping on you about it, I didn't realize the value hadn't increased since the problem was fixed. I looked at the last two SMART reports and saw the values were significantly different and jumped to a conclusion.

serverboy · Sep 19, 2023

joeschmuck said:
I'm not sure UDMA_CRC_Errors is something you can RMA, but it's worth a try if that is the path you would like to go down.

Sorry for jumping on you about it, I didn't realize the value hadn't increased since the problem was fixed. I looked at the last two SMART reports and saw the values were significantly different and jumped to a conclusion.

You are probably right. The drive is under warranty for 5years so if it does decide to go bang at some point i will just RMA it then. The pool has been healthy all day with no issues so that is a positive.

Ye no worries bud. I'm just new to this stuff so appreciate everyone telling me how to troubleshoot this. The value jumped up quite high from before because every time I kicked off a transfer the messages logs just filled with a ton of those errors so it looks like it was just dropping constantly during every attempt to access the drive.

I have two cold spares in the draw so I should be ok. Im just waiting to move everything off my synology to this system.

joeschmuck · Sep 19, 2023

serverboy said:
Im just waiting to move everything off my synology to this system.

Well wait until you are confident the NAS is trouble free.

Important Announcement for the TrueNAS Community.

New drive failed after a few days

Dabbler

Attachments

Old Man

Hall of Famer

Wizard

Dabbler

Dabbler

Dabbler

Old Man

Dabbler

Old Man

Wizard

Dabbler

MVP

Dabbler

Dabbler

Old Man

Dabbler

Old Man

Dabbler

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "New drive failed after a few days"

Similar threads