New drive failed after a few days

serverboy

Dabbler
Joined
Aug 11, 2023
Messages
39
Hi

I recently built a new system and added 12x 20TB Seagate EXOS.

My pool is configured in a 6xRAIDz2 2vdev

I transferred 40TB of data from Synology to this pool (my plex media)

On the pool i have one dataset create media and a zvol for veeam - see below

1694761949178.png


This morning i woke up to a bunch of emails stating the pool has degraded.

1694761987620.png


I ran a short smart test and then outputted the data below. I have also kicked off a long test.

Code:
    root@S-STORE01[~]# smartctl -a /dev/da11

    smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)

    Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org


    === START OF INFORMATION SECTION ===

    Device Model:     ST20000NM007D-3DJ103

    Serial Number:    ZVT86SC8

    LU WWN Device Id: 5 000c50 0e65d9382

    Firmware Version: SN03

    User Capacity:    20,000,588,955,648 bytes [20.0 TB]

    Sector Sizes:     512 bytes logical, 4096 bytes physical

    Rotation Rate:    7200 rpm

    Form Factor:      3.5 inches

    Device is:        Not in smartctl database [for details use: -P showall]

    ATA Version is:   ACS-4 (minor revision not indicated)

    SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)

    Local Time is:    Fri Sep 15 08:08:23 2023 BST

    SMART support is: Available - device has SMART capability.

    SMART support is: Enabled


    === START OF READ SMART DATA SECTION ===

    SMART overall-health self-assessment test result: PASSED


    General SMART Values:

    Offline data collection status:  (0x82) Offline data collection activity

                                            was completed without error.

                                            Auto Offline Data Collection: Enabled.

    Self-test execution status:      ( 249) Self-test routine in progress...

                                            90% of test remaining.

    Total time to complete Offline

    data collection:                (  567) seconds.

    Offline data collection

    capabilities:                    (0x7b) SMART execute Offline immediate.

                                            Auto Offline data collection on/off support.

                                            Suspend Offline collection upon new

                                            command.

                                            Offline surface scan supported.

                                            Self-test supported.

                                            Conveyance Self-test supported.

                                            Selective Self-test supported.

    SMART capabilities:            (0x0003) Saves SMART data before entering

                                            power-saving mode.

                                            Supports SMART auto save timer.

    Error logging capability:        (0x01) Error logging supported.

                                            General Purpose Logging supported.

    Short self-test routine

    recommended polling time:        (   1) minutes.

    Extended self-test routine

    recommended polling time:        (1732) minutes.

    Conveyance self-test routine

    recommended polling time:        (   2) minutes.

    SCT capabilities:              (0x70bd) SCT Status supported.

                                            SCT Error Recovery Control supported.

                                            SCT Feature Control supported.

                                            SCT Data Table supported.


    SMART Attributes Data Structure revision number: 10

    Vendor Specific SMART Attributes with Thresholds:

    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

      1 Raw_Read_Error_Rate     0x000f   079   064   044    Pre-fail  Always       -       73831479

      3 Spin_Up_Time            0x0003   098   098   000    Pre-fail  Always       -       0

      4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       2

      5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0

      7 Seek_Error_Rate         0x000f   067   060   045    Pre-fail  Always       -       4881657

      9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       161

     10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0

     12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       2

     18 Unknown_Attribute       0x000b   100   100   050    Pre-fail  Always       -       0

    187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0

    188 Command_Timeout         0x0032   100   098   000    Old_age   Always       -       86

    190 Airflow_Temperature_Cel 0x0022   059   045   000    Old_age   Always       -       41 (Min/Max 35/44)

    192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1

    193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       278

    194 Temperature_Celsius     0x0022   041   047   000    Old_age   Always       -       41 (0 28 0 0 0)

    197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

    198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0

    199 UDMA_CRC_Error_Count    0x003e   200   199   000    Old_age   Always       -       14654

    200 Multi_Zone_Error_Rate   0x0023   100   100   001    Pre-fail  Always       -       0

    240 Head_Flying_Hours       0x0000   100   100   000    Old_age   Offline      -       36 (156 4 0)

    241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       11889963464

    242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       148687660


    SMART Error Log Version: 1

    No Errors Logged


    SMART Self-test log structure revision number 1

    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

    # 1  Extended offline    Self-test routine in progress 90%       161         -

    # 2  Short offline       Completed without error       00%       160         -


    SMART Selective self-test log data structure revision number 1

     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

        1        0        0  Not_testing

        2        0        0  Not_testing

        3        0        0  Not_testing

        4        0        0  Not_testing

        5        0        0  Not_testing

    Selective self-test flags (0x0):

      After scanning selected spans, do NOT read-scan remainder of disk.

    If Selective self-test is pending on power-up, resume after 0 minute delay.


I don't understand how this brand new drive has failed in a space of a few days. Can someone advise me on what to do. The SMART test shows no errors logged.

Do I need to replace the drive?
 
Last edited:

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Post smartctl output in code blocks please - its easier to read.
Let the smart test complete - its on 90%
When the test has completed - try reseatting the cable
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
I don't understand how this brand new drive has failed in a space of a few days.
Sh!t happens… Drives failing within a few says of powering on is rare but not unheard of—that's the point of "burning in" new hardware.
CRC errors may also be a cable issue, hence the suggestion to reseat the cable, or test with another cable/swap cables if the long SMART test does not report any further issue.
 

serverboy

Dabbler
Joined
Aug 11, 2023
Messages
39
hi thanks for reply

in regards to the burn in i have copied 100TB of data across these drives during the clone process from my synology. This went fine.

the disk is connected to a backplane, not individually connected via sata cables.

just thinkinkg if it were a cable that whole row of disks would of died or experienced the same issue?
 
Last edited:

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Hmmm - probably. Turn the NAS off, and swap the faulted disk with another disk (from the same vdev) and then switch it back on. Does the disk come back and resilver or is there an issue with the swapped drive?
 

serverboy

Dabbler
Joined
Aug 11, 2023
Messages
39
@NugentS appreciate the help on this.

I don't have a spare at the moment. I have ordered 2 more drives which shall arrive tomorrow morning.
I am currently running a long smart test on all drives.
So from your comment.
Do I do those steps in this order.
  1. Turn off the server
  2. remove faulted drive
  3. remove a working drive
  4. swap them round.
  5. power server on
Do I not need to go to status pool and select the ellipses of the faulty drive and select replace, or do I only do this when I am actually replacing it with a brand new drive?

Drive ordering doesn't matter in ZFS right? On my Synology, drives have to go back in the right order if they were removed otherwise it will destroy the RAID.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
in regards to the burn in i have copied 100TB of data across these drives during the clone process from my synology. This went fine.
That's not a burn-in in my book. I am paranoid enough to let a drive run for at least a month under load. With my current NAS I actually ran the new drives for 3 months. But that was still not enough. I lost 2 drives after 6 and 8 months respectively.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Where I use badblocks to run a full pass over the drives.

Drive ordering doesn't matter in ZFS. Do not "replace" anything - just swap the drives.
1. Does the "faulty" drive come back and resilver
2. Does the swapped drive have issues in the old faulty drives slot

This will help track down the issue
 

serverboy

Dabbler
Joined
Aug 11, 2023
Messages
39
Where I use badblocks to run a full pass over the drives.

Drive ordering doesn't matter in ZFS. Do not "replace" anything - just swap the drives.
1. Does the "faulty" drive come back and resilver
2. Does the swapped drive have issues in the old faulty drives slot

This will help track down the issue
Ok I will try this tonight and report back. I have requested a RMA on the disk from where I purchased it from in the mean time.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
There may be nothing wrong with the disk
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I have requested a RMA on the disk from where I purchased it from in the mean time.
Nothing you've posted indicates that anything is wrong with the disk.
 

serverboy

Dabbler
Joined
Aug 11, 2023
Messages
39
Nothing you've posted indicates that anything is wrong with the disk.
It is still currently running a long smart test.

I have just attached an output from messages log which is filled with errors.

Any idea what these mean?
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I'm not going to download and sort through a 6+MB attachment to try to figure it out. If there are specific errors in there you're concerned about, post them and we can have a look.

To answer the question that's the subject of this thread, infant mortality happens. It's not super-common, but it still does. Devices are dead on arrival, or die very soon after being put into use. This really shouldn't be surprising. I don't see any indication that it's what's happened to you, but it still shouldn't result in a "I don't understand how this could happen" reaction.

But the bottom line is that all the SMART data you've posted so far is fine as far as I can see, which means the likelihood that this is a disk problem is very low. You didn't do anything to burn in the disk before you put it into production, which is poor planning, and it's running warmer than we generally like to see, but it isn't showing bad or reallocated sectors and it isn't (yet) failing SMART tests. The other "error rate" parameters use the weird Seagate encoding that makes it hard to get usable information from them. The CRC errors reported in the SMART data suggest a problem in the signal path, as has already been suggested. If the long SMART test fails, that would change my answer, but at this point I don't see any reason to suspect the disk.
 

serverboy

Dabbler
Joined
Aug 11, 2023
Messages
39
I'm not going to download and sort through a 6+MB attachment to try to figure it out. If there are specific errors in there you're concerned about, post them and we can have a look.

To answer the question that's the subject of this thread, infant mortality happens. It's not super-common, but it still does. Devices are dead on arrival, or die very soon after being put into use. This really shouldn't be surprising. I don't see any indication that it's what's happened to you, but it still shouldn't result in a "I don't understand how this could happen" reaction.

But the bottom line is that all the SMART data you've posted so far is fine as far as I can see, which means the likelihood that this is a disk problem is very low. You didn't do anything to burn in the disk before you put it into production, which is poor planning, and it's running warmer than we generally like to see, but it isn't showing bad or reallocated sectors and it isn't (yet) failing SMART tests. The other "error rate" parameters use the weird Seagate encoding that makes it hard to get usable information from them. The CRC errors reported in the SMART data suggest a problem in the signal path, as has already been suggested. If the long SMART test fails, that would change my answer, but at this point I don't see any reason to suspect the disk.
I am not saying it is not impossible that it can happen, just shocked that it has happened.

I originally built the pool with a record size of 128K the default as I forgot to change it. So I destroyed the pool and recreated it with 1M as it will be storing Plex consisting of large media files.

The output from the short smart test does indicate the disk is fine. However from looking at these logs from the "messages logs" from the day I recreated the pool it is showing the following.

Code:
Sep 13 18:23:23 S-STORE01 mpr0: Controller reported scsi ioc terminated tgt 11 SMID 1249 loginfo 31110d01
Sep 13 18:23:23 S-STORE01 (da11:mpr0:0:11:0): WRITE(10). CDB: 2a 00 07 59 8f 20 00 08 00 00
Sep 13 18:23:23 S-STORE01 (da11:mpr0:0:11:0): CAM status: CCB request completed with an error
Sep 13 18:23:23 S-STORE01 (da11:mpr0:0:11:0): Retrying command, 3 more tries remain
Sep 13 18:23:24 S-STORE01 (da11:mpr0:0:11:0): WRITE(10). CDB: 2a 00 07 59 8f 20 00 08 00 00
Sep 13 18:23:24 S-STORE01 (da11:mpr0:0:11:0): CAM status: SCSI Status Error
Sep 13 18:23:24 S-STORE01 (da11:mpr0:0:11:0): SCSI status: Check Condition
Sep 13 18:23:24 S-STORE01 (da11:mpr0:0:11:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Sep 13 18:23:24 S-STORE01 (da11:mpr0:0:11:0): Retrying command (per sense data)
Sep 13 18:26:03 S-STORE01 mpr0: Controller reported scsi ioc terminated tgt 11 SMID 298 loginfo 31110d01
Sep 13 18:26:03 S-STORE01 (da11:mpr0:0:11:0): WRITE(10). CDB: 2a 00 0a 00 19 a0 00 08 00 00
Sep 13 18:26:03 S-STORE01 (da11:mpr0:0:11:0): CAM status: CCB request completed with an error
Sep 13 18:26:03 S-STORE01 (da11:mpr0:0:11:0): Retrying command, 3 more tries remain
Sep 13 18:26:04 S-STORE01 (da11:mpr0:0:11:0): WRITE(10). CDB: 2a 00 0a 00 19 a0 00 08 00 00
Sep 13 18:26:04 S-STORE01 (da11:mpr0:0:11:0): CAM status: SCSI Status Error
Sep 13 18:26:04 S-STORE01 (da11:mpr0:0:11:0): SCSI status: Check Condition
Sep 13 18:26:04 S-STORE01 (da11:mpr0:0:11:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)


Those messages continued on to the 14th then started to output this

Code:
Sep 14 19:37:09 S-STORE01 mpr0: Controller reported scsi ioc terminated tgt 11 SMID 401 loginfo 31110d01
Sep 14 19:37:09 S-STORE01 (da11:mpr0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 05 62 b1 b6 b8 00 00 08 00 00 00
Sep 14 19:37:09 S-STORE01 (da11:mpr0:0:11:0): CAM status: CCB request completed with an error
Sep 14 19:37:09 S-STORE01 (da11:mpr0:0:11:0): Retrying command, 3 more tries remain
Sep 14 19:37:10 S-STORE01 (da11:mpr0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 05 62 b1 b6 b8 00 00 08 00 00 00
Sep 14 19:37:10 S-STORE01 (da11:mpr0:0:11:0): CAM status: SCSI Status Error
Sep 14 19:37:10 S-STORE01 (da11:mpr0:0:11:0): SCSI status: Check Condition
Sep 14 19:37:10 S-STORE01 (da11:mpr0:0:11:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Sep 14 19:37:10 S-STORE01 (da11:mpr0:0:11:0): Retrying command (per sense data)
Sep 14 20:28:01 S-STORE01 (da11:mpr0:0:11:0): READ(16). CDB: 88 00 00 00 00 04 c4 eb 03 28 00 00 02 00 00 00
Sep 14 20:28:01 S-STORE01 (da11:mpr0:0:11:0): CAM status: SCSI Status Error
Sep 14 20:28:01 S-STORE01 (da11:mpr0:0:11:0): SCSI status: Check Condition
Sep 14 20:28:01 S-STORE01 (da11:mpr0:0:11:0): SCSI sense: ABORTED COMMAND asc:47,0 (SCSI parity error)
Sep 14 20:28:01 S-STORE01 (da11:mpr0:0:11:0): Retrying command (per sense data)
Sep 14 20:28:07 S-STORE01 mpr0: Controller reported scsi ioc terminated tgt 11 SMID 1718 loginfo 31080000
Sep 14 20:28:07 S-STORE01 (da11:mpr0:0:11:0): READ(16). CDB: 88 00 00 00 00 04 c4 eb 13 28 00 00 02 00 00 00
Sep 14 20:28:07 S-STORE01 (da11:mpr0:0:11:0): CAM status: CCB request completed with an error
Sep 14 20:28:07 S-STORE01 (da11:mpr0:0:11:0): Retrying command, 3 more tries remain
Sep 14 20:28:07 S-STORE01 (da11:mpr0:0:11:0): READ(16). CDB: 88 00 00 00 00 04 c4 eb 11 28 00 00 02 00 00 00
Sep 14 20:28:07 S-STORE01 (da11:mpr0:0:11:0): CAM status: SCSI Status Error
Sep 14 20:28:07 S-STORE01 (da11:mpr0:0:11:0): SCSI status: Check Condition
Sep 14 20:28:07 S-STORE01 (da11:mpr0:0:11:0): SCSI sense: ABORTED COMMAND asc:47,0 (SCSI parity error)
Sep 14 20:28:07 S-STORE01 (da11:mpr0:0:11:0): Retrying command (per sense data)
Sep 14 20:28:07 S-STORE01 mpr0: Controller reported scsi ioc terminated tgt 11 SMID 1633 loginfo 31080000


Which then continued on to this in the early hours of the morning at around 3.20AM it marked it as FAULTED

Code:
Sep 15 03:19:42 S-STORE01 (da11:mpr0:0:11:0): READ(16). CDB: 88 00 00 00 00 01 71 ff fe a8 00 00 02 00 00 00
Sep 15 03:19:42 S-STORE01 (da11:mpr0:0:11:0): CAM status: SCSI Status Error
Sep 15 03:19:42 S-STORE01 (da11:mpr0:0:11:0): SCSI status: Check Condition
Sep 15 03:19:42 S-STORE01 (da11:mpr0:0:11:0): SCSI sense: ABORTED COMMAND asc:47,0 (SCSI parity error)
Sep 15 03:19:42 S-STORE01 (da11:mpr0:0:11:0): Retrying command (per sense data)
Sep 15 03:19:42 S-STORE01 (da11:mpr0:0:11:0): READ(16). CDB: 88 00 00 00 00 01 72 00 0e a8 00 00 02 00 00 00
Sep 15 03:19:42 S-STORE01 (da11:mpr0:0:11:0): CAM status: SCSI Status Error
Sep 15 03:19:42 S-STORE01 (da11:mpr0:0:11:0): SCSI status: Check Condition
Sep 15 03:19:42 S-STORE01 (da11:mpr0:0:11:0): SCSI sense: ABORTED COMMAND asc:47,0 (SCSI parity error)
Sep 15 03:19:42 S-STORE01 (da11:mpr0:0:11:0): Retrying command (per sense data)
Sep 15 03:26:15 S-STORE01 GEOM_MIRROR: Device mirror/swap0 launched (3/3).
Sep 15 03:26:15 S-STORE01 GEOM_MIRROR: Device mirror/swap1 launched (3/3).
Sep 15 03:26:15 S-STORE01 GEOM_MIRROR: Device mirror/swap2 launched (3/3).
Sep 15 03:26:15 S-STORE01 GEOM_MIRROR: Device mirror/swap3 launched (3/3).
Sep 15 03:26:15 S-STORE01 GEOM_MIRROR: Device mirror/swap4 launched (3/3).
Sep 15 03:26:15 S-STORE01 GEOM_ELI: Device mirror/swap0.eli created.
Sep 15 03:26:15 S-STORE01 GEOM_ELI: Encryption: AES-XTS 128
Sep 15 03:26:15 S-STORE01 GEOM_ELI:     Crypto: accelerated software
Sep 15 03:26:15 S-STORE01 GEOM_ELI: Device mirror/swap1.eli created.
Sep 15 03:26:15 S-STORE01 GEOM_ELI: Encryption: AES-XTS 128
Sep 15 03:26:15 S-STORE01 GEOM_ELI:     Crypto: accelerated software
Sep 15 03:26:15 S-STORE01 GEOM_ELI: Device mirror/swap2.eli created.
Sep 15 03:26:15 S-STORE01 GEOM_ELI: Encryption: AES-XTS 128
Sep 15 03:26:15 S-STORE01 GEOM_ELI:     Crypto: accelerated software
Sep 15 03:26:15 S-STORE01 GEOM_ELI: Device mirror/swap3.eli created.
Sep 15 03:26:15 S-STORE01 GEOM_ELI: Encryption: AES-XTS 128
Sep 15 03:26:15 S-STORE01 GEOM_ELI:     Crypto: accelerated software
Sep 15 03:26:15 S-STORE01 GEOM_ELI: Device mirror/swap4.eli created.
Sep 15 03:26:15 S-STORE01 GEOM_ELI: Encryption: AES-XTS 128
Sep 15 03:26:15 S-STORE01 GEOM_ELI:     Crypto: accelerated software
Sep 15 03:26:15 S-STORE01 (da11:mpr0:0:11:0): READ(6). CDB: 08 01 8f 30 30 00
Sep 15 03:26:15 S-STORE01 (da11:mpr0:0:11:0): CAM status: SCSI Status Error
Sep 15 03:26:15 S-STORE01 (da11:mpr0:0:11:0): SCSI status: Check Condition
Sep 15 03:26:15 S-STORE01 (da11:mpr0:0:11:0): SCSI sense: ABORTED COMMAND asc:47,0 (SCSI parity error)
Sep 15 03:26:15 S-STORE01 (da11:mpr0:0:11:0): Retrying command (per sense data)
Sep 15 04:40:17 S-STORE01 (da11:mpr0:0:11:0): READ(6). CDB: 08 02 ff e8 38 00
Sep 15 04:40:17 S-STORE01 (da11:mpr0:0:11:0): CAM status: SCSI Status Error
Sep 15 04:40:17 S-STORE01 (da11:mpr0:0:11:0): SCSI status: Check Condition
Sep 15 04:40:17 S-STORE01 (da11:mpr0:0:11:0): SCSI sense: ABORTED COMMAND asc:47,0 (SCSI parity error)
Sep 15 04:40:17 S-STORE01 (da11:mpr0:0:11:0): Retrying command (per sense data)
Sep 15 08:19:42 S-STORE01 (da11:mpr0:0:11:0): READ(10). CDB: 28 00 00 00 67 d0 00 01 00 00
Sep 15 08:19:42 S-STORE01 (da11:mpr0:0:11:0): CAM status: SCSI Status Error
Sep 15 08:19:42 S-STORE01 (da11:mpr0:0:11:0): SCSI status: Check Condition
Sep 15 08:19:42 S-STORE01 (da11:mpr0:0:11:0): SCSI sense: ABORTED COMMAND asc:47,0 (SCSI parity error)
Sep 15 08:19:42 S-STORE01 (da11:mpr0:0:11:0): Retrying command (per sense data)


The drive is connected to a backplane with other drives in that row. I will shut the server down and swap it with around with another drive to see what happens
 

serverboy

Dabbler
Joined
Aug 11, 2023
Messages
39
Where I use badblocks to run a full pass over the drives.

Drive ordering doesn't matter in ZFS. Do not "replace" anything - just swap the drives.
1. Does the "faulty" drive come back and resilver
2. Does the swapped drive have issues in the old faulty drives slot

This will help track down the issue
Hi @NugentS
I have done what you said.

I checked all the cables nothing was loose or looked odd.

I have swapped round DA10(working drive) with DA11(FAULTED drive)

Booted the server backup. An email alert triggered similar to what i got last night.

But when i check the zpool status it now shows this.


1694812126744.png


should of kick off another smart test?
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504

serverboy

Dabbler
Joined
Aug 11, 2023
Messages
39
Any pool errors clear on reboot, so that isn't much of a surprise.

If the last one didn't finish, yes. And when it finishes, if it succeeds, kick off a scrub.
Thanks @danb35 I will do this.
Test will finish sunday early hours.
I will report back
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
199 UDMA_CRC_Error_Count 0x003e 200 199 000 Old_age Always - 14654
I'd say you may have a bad data cable. You do not need to do a SMART Long test in my opinion, just keep checking the status of this value. If it increments then you have a problem likely with the data cable. Power off and swap the data cable with an adjacent drive, see if the UDMA CRC Errors keep occuring on the suspect drive (use the serial number to track this) or if the value starts incrementing on the adjacent drive.

It could be the controller as well, or it might be the hard drive itself but generally it is not.
 

serverboy

Dabbler
Joined
Aug 11, 2023
Messages
39
I'd say you may have a bad data cable. You do not need to do a SMART Long test in my opinion, just keep checking the status of this value. If it increments then you have a problem likely with the data cable. Power off and swap the data cable with an adjacent drive, see if the UDMA CRC Errors keep occuring on the suspect drive (use the serial number to track this) or if the value starts incrementing on the adjacent drive.

It could be the controller as well, or it might be the hard drive itself but generally it is not.
These drives are in a 24 bay case. Each row has its own backplane with one SFF-8643 HD Mini SAS connector supporting the four drives. I would believe it was the cable if all four played up.

For the time being i have swapped a drive round.

From looking at the messages logs. It looks like it started to happen when I destroyed a pool and created a new one then started to copy all my media across.
 
Last edited:

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
What firmware is on that HBA?
 
Top