HELP! Not sure how to proceed. Scrub shows degraded and faulted drive(s)

sretalla · Aug 31, 2022

other than getting the HBAs flashed properly (which you certainly should do no matter what), you might want to consider 2 additional things...

Heat in your case, particularly where the HBA(s) sit. An overheating HBA will cook itself to death (maybe what you're seeing is the first signs of that).

Disks... did you check them against the SMR list? https://www.truenas.com/community/resources/list-of-known-smr-drives.141/

All drives in a pool going into a mess during a scrub sounds a bit like SMR disks to me... more so would be a resilver, but you're not quite there yet (maybe we'll see soon).

somewhatdamaged · Aug 31, 2022

11 disk RAIDZ1...balls of steel!! Good luck man, hope you manage to sort it out!

Demonlinx · Aug 31, 2022

Redcoat said:
sas2flash -list will get you the adapter info and show IT or IR mode:

View attachment 58078

Here's the output of the command once it's ran:


LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved

        Adapter Selected is a LSI SAS: SAS2008(B2)

        Controller Number              : 0
        Controller                     : SAS2008(B2)
        PCI Address                    : 00:03:00:00
        SAS Address                    : 590b11c-0-121d-5102
        NVDATA Version (Default)       : 14.01.00.08
        NVDATA Version (Persistent)    : 14.01.00.08
        Firmware Product ID            : 0x2213 (IT)
        Firmware Version               : 20.00.07.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : SAS9211-8i
        BIOS Version                   : 07.11.10.00
        UEFI BSD Version               : N/A
        FCODE Version                  : N/A
        Board Name                     : SAS9211-8i
        Board Assembly                 : N/A
        Board Tracer Number            : N/A

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.

Demonlinx · Aug 31, 2022

Alex_K said:
DL180 G6 has expander on the backplane
If its 12х3.5" it should be working OK with SAS2008 IT. I have 4 of these serevers working for years already solid.

Can use
sas2ircu 0 display
to show which drive reside in which bay (0 is controller number - you'll figure it out)
and
sas2ircu locate
to light a led on a drive caddy of your choice


LSI Corporation SAS2 IR Configuration Utility.
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved.

Read configuration has been initiated for controller 0
------------------------------------------------------------------------
Controller information
------------------------------------------------------------------------
  Controller type                         : SAS2008
  BIOS version                            : 7.11.10.00
  Firmware version                        : 20.00.07.00
  Channel description                     : 1 Serial Attached SCSI
  Initiator ID                            : 0
  Maximum physical devices                : 255
  Concurrent commands supported           : 3432
  Slot                                    : 4
  Segment                                 : 0
  Bus                                     : 3
  Device                                  : 0
  Function                                : 0
  RAID Support                            : No
------------------------------------------------------------------------
IR Volume information
------------------------------------------------------------------------
------------------------------------------------------------------------
Physical device information
------------------------------------------------------------------------
Initiator at ID #0

Device is a Hard disk
  Enclosure #                             : 2
  Slot #                                  : 0
  SAS Address                             : 5001438-0-131f-7f41
  State                                   : Ready (RDY)
  Size (in MB)/(in sectors)               : 1907729/3907029167
  Manufacturer                            : ATA
  Model Number                            : HUS724020ALA640
  Firmware Revision                       : AA50
  Serial No                               : P6GL9KTP
  GUID                                    : N/A
  Protocol                                : SATA
  Drive Type                              : SATA_HDD

Device is a Enclosure services device
  Enclosure #                             : 2
  Slot #                                  : 0
  SAS Address                             : 5001438-0-131f-7f53
  State                                   : Standby (SBY)
  Manufacturer                            : HP
  Model Number                            : DL18xG6BP
  Firmware Revision                       : 2.20
  Serial No                               :
  GUID                                    : N/A
  Protocol                                : SAS
  Device Type                             : Enclosure services device

Device is a Hard disk
  Enclosure #                             : 2
  Slot #                                  : 1
  SAS Address                             : 5001438-0-131f-7f42
  State                                   : Ready (RDY)
  Size (in MB)/(in sectors)               : 1907729/3907029167
  Manufacturer                            : ATA
  Model Number                            : HUS724020ALA640
  Firmware Revision                       : AA50
  Serial No                               : P6GK39NP
  GUID                                    : N/A
  Protocol                                : SATA
  Drive Type                              : SATA_HDD

Device is a Hard disk
  Enclosure #                             : 2
  Slot #                                  : 2
  SAS Address                             : 5001438-0-131f-7f43
  State                                   : Ready (RDY)
  Size (in MB)/(in sectors)               : 1907729/3907029167
  Manufacturer                            : ATA
  Model Number                            : HUS724020ALA640
  Firmware Revision                       : AA50
  Serial No                               : P6GL7E2P
  GUID                                    : N/A
  Protocol                                : SATA
  Drive Type                              : SATA_HDD

Device is a Hard disk
  Enclosure #                             : 2
  Slot #                                  : 3
  SAS Address                             : 5001438-0-131f-7f44
  State                                   : Ready (RDY)
  Size (in MB)/(in sectors)               : 1907729/3907029167
  Manufacturer                            : ATA
  Model Number                            : HUS724020ALA640
  Firmware Revision                       : AA50
  Serial No                               : P6GKYJ0P
  GUID                                    : N/A
  Protocol                                : SATA
  Drive Type                              : SATA_HDD

Device is a Hard disk
  Enclosure #                             : 2
  Slot #                                  : 4
  SAS Address                             : 5000c50-0-7f94-66d5
  State                                   : Ready (RDY)
  Size (in MB)/(in sectors)               : 1907729/3907029167
  Manufacturer                            : HP
  Model Number                            : MB2000FCWDF
  Firmware Revision                       : HPDA
  Serial No                               : S1X0DBB5
  GUID                                    : 5000c5007f9466d7
  Protocol                                : SAS
  Drive Type                              : SAS_HDD

Device is a Hard disk
  Enclosure #                             : 2
  Slot #                                  : 5
  SAS Address                             : 5001438-0-131f-7f46
  State                                   : Ready (RDY)
  Size (in MB)/(in sectors)               : 1907729/3907029167
  Manufacturer                            : ATA
  Model Number                            : Hitachi HUA72302
  Firmware Revision                       : AA10
  Serial No                               : MK0171YFHNPB2A
  GUID                                    : 5000cca223d77f19
  Protocol                                : SATA
  Drive Type                              : SATA_HDD

Device is a Hard disk
  Enclosure #                             : 2
  Slot #                                  : 6
  SAS Address                             : 5001438-0-131f-7f47
  State                                   : Ready (RDY)
  Size (in MB)/(in sectors)               : 1907729/3907029167
  Manufacturer                            : ATA
  Model Number                            : HUS724020ALA640
  Firmware Revision                       : AA50
  Serial No                               : P6GJSBLP
  GUID                                    : N/A
  Protocol                                : SATA
  Drive Type                              : SATA_HDD

Device is a Hard disk
  Enclosure #                             : 2
  Slot #                                  : 7
  SAS Address                             : 5000c50-0-4135-7629
  State                                   : Ready (RDY)
  Size (in MB)/(in sectors)               : 1907729/3907029167
  Manufacturer                            : HP
  Model Number                            : MB2000FBZPN
  Firmware Revision                       : HPD4
  Serial No                               : Z1P1NQTV
  GUID                                    : 5000c5004135762b
  Protocol                                : SAS
  Drive Type                              : SAS_HDD

Device is a Hard disk
  Enclosure #                             : 2
  Slot #                                  : 8
  SAS Address                             : 5000c50-0-413b-ef0d
  State                                   : Ready (RDY)
  Size (in MB)/(in sectors)               : 1907729/3907029167
  Manufacturer                            : HP
  Model Number                            : MB2000FBZPN
  Firmware Revision                       : HPD4
  Serial No                               : Z1P1QEHC
  GUID                                    : 5000c500413bef0f
  Protocol                                : SAS
  Drive Type                              : SAS_HDD

Device is a Hard disk
  Enclosure #                             : 2
  Slot #                                  : 9
  SAS Address                             : 5001438-0-131f-7f4a
  State                                   : Ready (RDY)
  Size (in MB)/(in sectors)               : 1907729/3907029167
  Manufacturer                            : ATA
  Model Number                            : Hitachi HUA72302
  Firmware Revision                       : A840
  Serial No                               : YGJ446PA
  GUID                                    : 5000cca224de1051
  Protocol                                : SATA
  Drive Type                              : SATA_HDD

Device is a Hard disk
  Enclosure #                             : 2
  Slot #                                  : 10
  SAS Address                             : 5001438-0-131f-7f4b
  State                                   : Ready (RDY)
  Size (in MB)/(in sectors)               : 1907729/3907029167
  Manufacturer                            : ATA
  Model Number                            : HUS724020ALA640
  Firmware Revision                       : AA50
  Serial No                               : P6GKXBYP
  GUID                                    : N/A
  Protocol                                : SATA
  Drive Type                              : SATA_HDD

Device is a Hard disk
  Enclosure #                             : 2
  Slot #                                  : 11
  SAS Address                             : 5001438-0-131f-7f4c
  State                                   : Ready (RDY)
  Size (in MB)/(in sectors)               : 1907729/3907029167
  Manufacturer                            : ATA
  Model Number                            : HUS724020ALA640
  Firmware Revision                       : AA50
  Serial No                               : P6GL7D6P
  GUID                                    : N/A
  Protocol                                : SATA
  Drive Type                              : SATA_HDD
------------------------------------------------------------------------
Enclosure information
------------------------------------------------------------------------
  Enclosure#                              : 1
  Logical ID                              : 590b11c0:121d5102
  Numslots                                : 8
  StartSlot                               : 0
  Enclosure#                              : 2
  Logical ID                              : 50014380:131f7f00
  Numslots                                : 13
  StartSlot                               : 0
------------------------------------------------------------------------
SAS2IRCU: Command DISPLAY Completed Successfully.
SAS2IRCU: Utility Completed Successfully.

Alex_K said:
also would be helpful if you posted entire system config,

How do I go about doing this?

Alex_K said:
and output of
smartctl -a /dev/sd*


smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi/HGST Ultrastar 7K4000
Device Model:     HUS724020ALA640
Serial Number:    P6GL9KTP
Firmware Version: MF6OAA50
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed Aug 31 06:57:40 2022 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   24) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 320) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   137   137   054    Pre-fail  Offline      -       78
  3 Spin_Up_Time            0x0007   123   123   024    Pre-fail  Always       -       508 (Average 508)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       15
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   140   140   020    Pre-fail  Offline      -       26
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       3045
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       15
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       141
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       141
194 Temperature_Celsius     0x0002   206   206   000    Old_age   Always       -       29 (Min/Max 25/31)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged


smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi/HGST Ultrastar 7K4000
Device Model:     HUS724020ALA640
Serial Number:    P6GK39NP
Firmware Version: MF6OAA50
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed Aug 31 07:00:44 2022 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   24) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 316) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   137   137   054    Pre-fail  Offline      -       79
  3 Spin_Up_Time            0x0007   128   128   024    Pre-fail  Always       -       486 (Average 485)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       16
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   142   142   020    Pre-fail  Offline      -       25
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       3045
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       16
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       76
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       76
194 Temperature_Celsius     0x0002   206   206   000    Old_age   Always       -       29 (Min/Max 25/32)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 1
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 347 hours (14 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 18 98 87 69 01  Error: UNC at LBA = 0x01698798 = 23693208

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 18 00 98 87 69 40 00   1d+04:24:57.152  READ FPDMA QUEUED
  60 18 00 80 87 69 40 00   1d+04:24:57.151  READ FPDMA QUEUED
  60 18 00 68 87 69 40 00   1d+04:24:57.151  READ FPDMA QUEUED
  60 20 00 48 87 69 40 00   1d+04:24:57.151  READ FPDMA QUEUED
  60 18 00 30 87 69 40 00   1d+04:24:57.150  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       893         -
# 2  Extended offline    Completed without error       00%       681         -
# 3  Short offline       Completed without error       00%       307         -
# 4  Short offline       Completed without error       00%       283         -
# 5  Short offline       Completed without error       00%       259         -
# 6  Short offline       Completed without error       00%       235         -
# 7  Short offline       Completed without error       00%       211         -
# 8  Short offline       Completed without error       00%       187         -
# 9  Short offline       Completed without error       00%       163         -
#10  Extended offline    Completed without error       00%       158         -
#11  Short offline       Completed without error       00%     51904         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Alex_K said:
for every drive thats faulty
mpsutil show all

don't seem to have mpsutil...

Alex_K said:
too

Demonlinx · Aug 31, 2022

Alex_K said:
smartctl -a /dev/sd*



smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build)

Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===

Model Family:     Hitachi/HGST Ultrastar 7K4000

Device Model:     HUS724020ALA640

Serial Number:    P6GKYJ0P

Firmware Version: MF6OAA50

User Capacity:    2,000,398,934,016 bytes [2.00 TB]

Sector Size:      512 bytes logical/physical

Rotation Rate:    7200 rpm

Form Factor:      3.5 inches

Device is:        In smartctl database [for details use: -P show]

ATA Version is:   ATA8-ACS T13/1699-D revision 4

SATA Version is:  SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s)

Local Time is:    Wed Aug 31 07:01:03 2022 PDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED


General SMART Values:

Offline data collection status:  (0x84) Offline data collection activity

                                        was suspended by an interrupting command from host.

                                        Auto Offline Data Collection: Enabled.

Self-test execution status:      (   0) The previous self-test routine completed

                                        without error or no self-test has ever

                                        been run.

Total time to complete Offline

data collection:                (   24) seconds.

Offline data collection

capabilities:                    (0x5b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        No Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine

recommended polling time:        (   1) minutes.

Extended self-test routine

recommended polling time:        ( 309) minutes.

SCT capabilities:              (0x003d) SCT Status supported.

                                        SCT Error Recovery Control supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.


SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0

  2 Throughput_Performance  0x0005   137   137   054    Pre-fail  Offline      -       79

  3 Spin_Up_Time            0x0007   126   126   024    Pre-fail  Always       -       495 (Average 496)

  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       15

  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0

  8 Seek_Time_Performance   0x0005   140   140   020    Pre-fail  Offline      -       26

  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       3045

 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       15

192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       75

193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       75

194 Temperature_Celsius     0x0002   200   200   000    Old_age   Always       -       30 (Min/Max 25/32)

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0


SMART Error Log Version: 1

No Errors Logged


SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline       Completed without error       00%       893         -

# 2  Extended offline    Completed without error       00%       681         -

# 3  Short offline       Completed without error       00%       307         -

# 4  Short offline       Completed without error       00%       283         -

# 5  Short offline       Completed without error       00%       259         -

# 6  Short offline       Completed without error       00%       235         -

# 7  Short offline       Completed without error       00%       211         -

# 8  Short offline       Completed without error       00%       187         -

# 9  Short offline       Completed without error       00%       163         -

#10  Extended offline    Completed without error       00%       158         -

#11  Short offline       Completed without error       00%     51903         -


SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.



smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build)

Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===

Vendor:               HP

Product:              MB2000FCWDF

Revision:             HPDA

Compliance:           SPC-4

User Capacity:        2,000,398,934,016 bytes [2.00 TB]

Logical block size:   512 bytes

LU is fully provisioned

Rotation Rate:        7200 rpm

Form Factor:          3.5 inches

Logical Unit id:      0x5000c5007f9466d7

Serial number:        S1X0DBB50000K528H7XK

Device type:          disk

Transport protocol:   SAS (SPL-3)

Local Time is:        Wed Aug 31 07:01:24 2022 PDT

SMART support is:     Available - device has SMART capability.

SMART support is:     Enabled

Temperature Warning:  Enabled


=== START OF READ SMART DATA SECTION ===

SMART Health Status: OK


Current Drive Temperature:     30 C

Drive Trip Temperature:        60 C


Accumulated power on time, hours:minutes 59218:37

Manufactured in week 05 of year 2015

Specified cycle count over device lifetime:  10000

Accumulated start-stop cycles:  268

Specified load-unload count over device lifetime:  300000

Accumulated load-unload cycles:  2737

Elements in grown defect list: 0


Error counter log:

           Errors Corrected by           Total   Correction     Gigabytes    Total

               ECC          rereads/    errors   algorithm      processed    uncorrected

           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors

read:          0        0         0         0          0     715957.397           0

write:         0        0         0         0          0      44636.672           0


Non-medium error count:    31363


SMART Self-test log

Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]

     Description                              number   (hours)

# 1  Background short  Completed                   -   57066                 - [-   -    -]

# 2  Background long   Completed                   -   56853                 - [-   -    -]

# 3  Background short  Completed                   -   56495                 - [-   -    -]

# 4  Background short  Completed                   -   56486                 - [-   -    -]


Long (extended) Self-test duration: 15300 seconds [255.0 minutes]



smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build)

Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===

Model Family:     Hitachi Ultrastar 7K3000

Device Model:     Hitachi HUA723020ALA640

Serial Number:    MK0171YFHNPB2A

LU WWN Device Id: 5 000cca 223d77f19

Firmware Version: MK7OAA10

User Capacity:    2,000,398,934,016 bytes [2.00 TB]

Sector Size:      512 bytes logical/physical

Rotation Rate:    7200 rpm

Form Factor:      3.5 inches

Device is:        In smartctl database [for details use: -P show]

ATA Version is:   ATA8-ACS T13/1699-D revision 4

SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 3.0 Gb/s)

Local Time is:    Wed Aug 31 07:01:46 2022 PDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED


General SMART Values:

Offline data collection status:  (0x84) Offline data collection activity

                                        was suspended by an interrupting command from host.

                                        Auto Offline Data Collection: Enabled.

Self-test execution status:      (   0) The previous self-test routine completed

                                        without error or no self-test has ever

                                        been run.

Total time to complete Offline

data collection:                (   28) seconds.

Offline data collection

capabilities:                    (0x5b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        No Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine

recommended polling time:        (   1) minutes.

Extended self-test routine

recommended polling time:        ( 327) minutes.

SCT capabilities:              (0x003d) SCT Status supported.

                                        SCT Error Recovery Control supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.


SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0

  2 Throughput_Performance  0x0005   135   135   054    Pre-fail  Offline      -       86

  3 Spin_Up_Time            0x0007   124   124   024    Pre-fail  Always       -       502 (Average 500)

  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       16

  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0

  8 Seek_Time_Performance   0x0005   123   123   020    Pre-fail  Offline      -       31

  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       3124

 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       16

192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       68

193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       68

194 Temperature_Celsius     0x0002   200   200   000    Old_age   Always       -       30 (Min/Max 25/33)

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0


SMART Error Log Version: 1

No Errors Logged


SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline       Completed without error       00%       971         -

# 2  Extended offline    Completed without error       00%       760         -

# 3  Short offline       Completed without error       00%       386         -

# 4  Short offline       Completed without error       00%       362         -

# 5  Short offline       Completed without error       00%       338         -

# 6  Short offline       Completed without error       00%       314         -

# 7  Short offline       Completed without error       00%       289         -

# 8  Short offline       Completed without error       00%       265         -

# 9  Short offline       Completed without error       00%       241         -

#10  Extended offline    Completed without error       00%       237         -


SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.



smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build)

Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===

Vendor:               HP

Product:              MB2000FBZPN

Revision:             HPD4

Compliance:           SPC-3

User Capacity:        2,000,398,934,016 bytes [2.00 TB]

Logical block size:   512 bytes

Rotation Rate:        7200 rpm

Form Factor:          3.5 inches

Logical Unit id:      0x5000c500413bef0f

Serial number:        Z1P1QEHC0000C234A1YF

Device type:          disk

Transport protocol:   SAS (SPL-3)

Local Time is:        Wed Aug 31 07:02:07 2022 PDT

SMART support is:     Available - device has SMART capability.

SMART support is:     Enabled

Temperature Warning:  Enabled


=== START OF READ SMART DATA SECTION ===

SMART Health Status: OK


Current Drive Temperature:     29 C

Drive Trip Temperature:        65 C


Accumulated power on time, hours:minutes 54140:36

Manufactured in week 10 of year 2012

Specified cycle count over device lifetime:  10000

Accumulated start-stop cycles:  106

Specified load-unload count over device lifetime:  300000

Accumulated load-unload cycles:  2356

Elements in grown defect list: 0


Error counter log:

           Errors Corrected by           Total   Correction     Gigabytes    Total

               ECC          rereads/    errors   algorithm      processed    uncorrected

           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors

read:          0        0         0  1712374307          0    1820277.925           0

write:         0        0         0         0          0     136881.942           0


Non-medium error count:      300


SMART Self-test log

Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]

     Description                              number   (hours)

# 1  Background short  Completed                   -   51970                 - [-   -    -]

# 2  Background long   Completed                   -   51757                 - [-   -    -]

# 3  Background short  Completed                   -   51380                 - [-   -    -]

# 4  Background short  Completed                   -   51356                 - [-   -    -]

# 5  Background short  Completed                   -   51332                 - [-   -    -]

# 6  Background short  Completed                   -   51308                 - [-   -    -]

# 7  Background short  Completed                   -   51201                 - [-   -    -]


Long (extended) Self-test duration: 18000 seconds [300.0 minutes]



smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build)

Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===

Model Family:     Hitachi/HGST Ultrastar 7K4000

Device Model:     HUS724020ALA640

Serial Number:    P6GKXBYP

Firmware Version: MF6OAA50

User Capacity:    2,000,398,934,016 bytes [2.00 TB]

Sector Size:      512 bytes logical/physical

Rotation Rate:    7200 rpm

Form Factor:      3.5 inches

Device is:        In smartctl database [for details use: -P show]

ATA Version is:   ATA8-ACS T13/1699-D revision 4

SATA Version is:  SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s)

Local Time is:    Wed Aug 31 07:02:29 2022 PDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED


General SMART Values:

Offline data collection status:  (0x84) Offline data collection activity

                                        was suspended by an interrupting command from host.

                                        Auto Offline Data Collection: Enabled.

Self-test execution status:      (   0) The previous self-test routine completed

                                        without error or no self-test has ever

                                        been run.

Total time to complete Offline

data collection:                (   28) seconds.

Offline data collection

capabilities:                    (0x5b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        No Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine

recommended polling time:        (   1) minutes.

Extended self-test routine

recommended polling time:        ( 329) minutes.

SCT capabilities:              (0x003d) SCT Status supported.

                                        SCT Error Recovery Control supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.


SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0

  2 Throughput_Performance  0x0005   137   137   054    Pre-fail  Offline      -       78

  3 Spin_Up_Time            0x0007   128   128   024    Pre-fail  Always       -       489 (Average 486)

  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       15

  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0

  8 Seek_Time_Performance   0x0005   140   140   020    Pre-fail  Offline      -       26

  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       3045

 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       15

192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       75

193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       75

194 Temperature_Celsius     0x0002   193   193   000    Old_age   Always       -       31 (Min/Max 24/33)

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0


SMART Error Log Version: 1

No Errors Logged


SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline       Completed without error       00%       893         -

# 2  Extended offline    Completed without error       00%       681         -

# 3  Short offline       Completed without error       00%       307         -

# 4  Short offline       Completed without error       00%       283         -

# 5  Short offline       Completed without error       00%       259         -

# 6  Short offline       Completed without error       00%       235         -

# 7  Short offline       Completed without error       00%       211         -

# 8  Short offline       Completed without error       00%       187         -

# 9  Short offline       Completed without error       00%       163         -


SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.



smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build)

Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===

Model Family:     Hitachi/HGST Ultrastar 7K4000

Device Model:     HUS724020ALA640

Serial Number:    P6GL7D6P

Firmware Version: MF6OAA50

User Capacity:    2,000,398,934,016 bytes [2.00 TB]

Sector Size:      512 bytes logical/physical

Rotation Rate:    7200 rpm

Form Factor:      3.5 inches

Device is:        In smartctl database [for details use: -P show]

ATA Version is:   ATA8-ACS T13/1699-D revision 4

SATA Version is:  SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s)

Local Time is:    Wed Aug 31 07:02:45 2022 PDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED


General SMART Values:

Offline data collection status:  (0x84) Offline data collection activity

                                        was suspended by an interrupting command from host.

                                        Auto Offline Data Collection: Enabled.

Self-test execution status:      (   0) The previous self-test routine completed

                                        without error or no self-test has ever

                                        been run.

Total time to complete Offline

data collection:                (   24) seconds.

Offline data collection

capabilities:                    (0x5b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        No Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine

recommended polling time:        (   1) minutes.

Extended self-test routine

recommended polling time:        ( 317) minutes.

SCT capabilities:              (0x003d) SCT Status supported.

                                        SCT Error Recovery Control supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.


SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0

  2 Throughput_Performance  0x0005   137   137   054    Pre-fail  Offline      -       79

  3 Spin_Up_Time            0x0007   124   124   024    Pre-fail  Always       -       503 (Average 502)

  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       15

  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0

  8 Seek_Time_Performance   0x0005   142   142   020    Pre-fail  Offline      -       25

  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       3045

 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       15

192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       75

193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       75

194 Temperature_Celsius     0x0002   200   200   000    Old_age   Always       -       30 (Min/Max 25/33)

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0


SMART Error Log Version: 1

No Errors Logged


SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline       Completed without error       00%       893         -

# 2  Extended offline    Completed without error       00%       681         -

# 3  Short offline       Completed without error       00%       307         -

# 4  Short offline       Completed without error       00%       283         -

# 5  Short offline       Completed without error       00%       259         -

# 6  Short offline       Completed without error       00%       235         -

# 7  Short offline       Completed without error       00%       211         -

# 8  Short offline       Completed without error       00%       187         -

# 9  Short offline       Completed without error       00%       163         -

#10  Short offline       Completed without error       00%     51904         -


SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

Demonlinx · Aug 31, 2022

Here's the drive after the scrub completed:

Should I replace the faulted drive? Or still, be more concerned about the entire pool/hardware malfunctioning?

Demonlinx · Aug 31, 2022

It's also worth noting that I did a file move in the CLI:
These happened at /mnt/Backup/ this is the main pool configured for my server.
I then ran:
netcat -l -p 7000 | tar x
And a sender on the other server. I realize now there could have been uid/gid mismatches that could have occurred. I feel like doing this move in this way was not the proper way. Should I move these files to the "home directory" dataset that I've created and then modify the files so that they are all owned by a user on the system? At this point would doing that matter?

Was doing the above an issue? If so why?

Demonlinx · Aug 31, 2022

Various md5sum checks between known good files on the remote backup server and these files on the machine with issues seem to be the same. Is this common for this to be the case when drives are having issues?

Demonlinx · Aug 31, 2022

I'm also not able to offline the bad drive it seems... It just spins and then does nothing when I try to offline the failed drive.

indivision · Aug 31, 2022

Demonlinx said:
Here's the output of the command once it's ran:

Firmware Product ID : 0x2213 (IT)
Firmware Version : 20.00.07.00

That appears to be the correct mode ("IT") and the latest version (that I could find).

So, you are good there.

Did you look into the temp for the HBA as suggested? Is it crammed in with little space? Or, does that card look well ventilated?

indivision · Aug 31, 2022

Demonlinx said:
Manufacturer : ATA
Model Number : HUS724020ALA640

Manufacturer : HP
Model Number : MB2000FCWDF

Manufacturer : ATA
Model Number : Hitachi HUA72302

Manufacturer : HP
Model Number : MB2000FBZPN

This looks like you are using 4 different types of hard drives? What is the context behind that?

Demonlinx said:
How do I go about doing this?

I don't want to make assumptions. The asker might have something else in mind. But, they may just be asking for more details about all of your hardware. RAM, CPU, MB, Drives, Boot device, etc.

Demonlinx said:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 137 137 054 Pre-fail Offline - 78
3 Spin_Up_Time 0x0007 123 123 024 Pre-fail Always - 508 (Average 508)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 15
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 140 140 020 Pre-fail Offline - 26
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 3045
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 15
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 141
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 141
194 Temperature_Celsius 0x0002 206 206 000 Old_age Always - 29 (Min/Max 25/31)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0

"Old_age"? Are these old/used drives? If so, what is their specific history?

Demonlinx said:
don't seem to have mpsutil...

I haven't had to use this before. So, am not sure myself.

Demonlinx said:
Should I replace the faulted drive? Or still, be more concerned about the entire pool/hardware malfunctioning?

In my opinion it depends on the drives. If these are really a collection of old drives, there may actually be that many bad ones... In that case, I think it would be pretty unreliable, especially on Z1, to try to use them. Replacing the faulted drive would just be polishing a turd. And with turd-based polish if its with another old drive!

Demonlinx said:
I'm also not able to offline the bad drive it seems... It just spins and then does nothing when I try to offline the failed drive.

When it shows as FAULTED it is typically already ready to be replaced (doesn't need to be offlined): https://www.truenas.com/community/threads/drive-faulted-cant-offline-it-to-replace-it.91297/

I'm not certain about your other questions. So, hopefully someone else can chime in to cover those. I don't think that moving the files via CLI should have created an issue. But, not 100% on that.

Alex_K · Aug 31, 2022

It seem suspicious to see on same disks

Code:

9 Power_On_Hours        3045

together with

Code:

#11  Short offline       Completed without error       00%     51904

And no Reallocated_Event_Count / other errors 196-199.
While the Failed drive openly state it worked for 54140.

It's possible to reset SMART on some Hitachi /HGST drives. Its not hard even

On the other hand its normal for drives like these to start to fail at 50'000-65'000+ hours. This could be what is just happening.

This may help to navigate through data posted

Code:

0  a    boot        SATA    HUS724020ALA640    AA50    P6GL9KTP
1  b    82    * 1W    SATA    HUS724020ALA640    AA50    P6GK39NP
2  c    90    +    SATA    HUS724020ALA640    AA50    P6GL7E2P
3  d    128    *    SATA    HUS724020ALA640    AA50    P6GKYJ0P
4  e    242    *    SAS    MB2000FCWDF    HPDA    S1X0DBB5
5  f    80    *    SATA    HUA72302    AA10    MK0171YFHNPB2A
6  g    44    +    SATA    HUS724020ALA640    AA50    P6GJSBLP
7  h    6    +    SAS    MB2000FBZPN    HPD4    Z1P1NQTV
8  i    29    -    SAS    MB2000FBZPN    HPD4    Z1P1QEHC
9  j    68    +    SATA    HUA72302    A840    YGJ446PA
10 k    117    *    SATA    HUS724020ALA640    AA50    P6GKXBYP
11 l    62    *    SATA    HUS724020ALA640    AA50    P6GL7D6P

You do not have mpsutil because its Scale. No big deal.

Alex_K · Aug 31, 2022

Demonlinx said:
It's also worth noting that I did a file move in the CLI:
These happened at /mnt/Backup/ this is the main pool configured for my server.
I then ran:
netcat -l -p 7000 | tar x
And a sender on the other server. I realize now there could have been uid/gid mismatches that could have occurred. I feel like doing this move in this way was not the proper way. Should I move these files to the "home directory" dataset that I've created and then modify the files so that they are all owned by a user on the system? At this point would doing that matter?

Was doing the above an issue? If so why?

moving files does not break pools - directly. Though data move pose some strain on hardware.
To preserve attributes you use -p switch with tar
--same-owner helps too

Now if attributes/gid/uid are not how you need them, you can fix them yep. chown and chmod
Its harder if you had SMB shares and ACL there..
Usually better off doing zfs send/recv to move data or using GUI to replicate snapshots of datasets, or RSYNC.

Demonlinx said:
Various md5sum checks between known good files on the remote backup server and these files on the machine with issues seem to be the same. Is this common for this to be the case when drives are having issues?

ZFS if Very resilient. It checksums everything. If it successfully scrub and your pool did not break, its Highly Unlikely that data is changed.

indivision said:
I don't want to make assumptions. The asker might have something else in mind. But, they may just be asking for more details about all of your hardware. RAM, CPU, MB, Drives, Boot device, etc.

Thats right. And software build/versions.

About situation as a whole, Demonlinx, I'm more inclined to believe its faulty drives then controller. SAS2008 based cards are designed to operate at up to +55 Ambient temperature, as per https://docs.broadcom.com/doc/12353227
If something else was broken - like power source or a fan or memory - it would have been visible in HP iLO2

For how long have you had that pool operational before you saw errors? Did you check it before, did you burn-in it before putting data on?

Bottom end it looks like you do have backups, at least partial. If backups are incomplete, update them, then plan on to recreating pool with more redundancy. Replacing faulty drive right now would have high chance or breaking one more drive during resilver and thus losing your pool.

If your workload allows it, after backups are up to date, turn off your server in questuion, and check drives one by one using if possible another PC with utility like Victoria HDD

Download Victoria for Windows - MajorGeeks

Victoria for Windows is a powerful HDD information and diagnostic utility. It can be used for diagnostics, research, speed testing, minor repair of hard drives (HDD), SSD drives, memory cards, and any other drives in the Windows operating system.

www.majorgeeks.com

If they lie about their SMART, Victoria would show reallocated sectors. At least as drops on the read speed graph. No lying around the graph. Use Read only tests

If disks prove ok, shell check everithing else again - cabling, tempetature, components part by part.

P.S. Mixing SATA and SAS drives aren't good also because they operate on different signaling voltages (800–1,600 mV for SAS versus 400–600 mV for SATA (transmit). While they do work, its preferable to avoid mixing them on same backplane, certainly avoid mixing them in same VDEV.

Demonlinx · Sep 1, 2022

Alex_K said:
It seem suspicious to see on same disks

Code:
9 Power_On_Hours 3045

together with

Code:
#11 Short offline Completed without error 00% 51904

And no Reallocated_Event_Count / other errors 196-199.
While the Failed drive openly state it worked for 54140.

This may be the cause and we might have received drives which we believed were new but were not...

Demonlinx · Sep 1, 2022

Alex_K said:
moving files does not break pools - directly. Though data move pose some strain on hardware.
To preserve attributes you use -p switch with tar
--same-owner helps too

I'll keep this in mind for future transfers! Thanks!

Alex_K said:
Now if attributes/gid/uid are not how you need them, you can fix them yep. chown and chmod
Its harder if you had SMB shares and ACL there..
Usually better off doing zfs send/recv to move data or using GUI to replicate snapshots of datasets, or RSYNC.

And how might I set this up properly? I tried doing this but kept getting issues about keys not being available, etc.

Alex_K said:
ZFS if Very resilient. It checksums everything. If it successfully scrub and your pool did not break, its Highly Unlikely that data is changed.

Thats right. And software build/versions.

About situation as a whole, Demonlinx, I'm more inclined to believe its faulty drives then controller. SAS2008 based cards are designed to operate at up to +55 Ambient temperature, as per https://docs.broadcom.com/doc/12353227
If something else was broken - like power source or a fan or memory - it would have been visible in HP iLO2

For how long have you had that pool operational before you saw errors? Did you check it before, did you burn-in it before putting data on?

We've only had this machine for a couple of months. I did ~40TB of I/O on the pool before doing this migration and didn't see any issues.

I did NOT do a burn-in but this is definitely something that I'll be doing in the future!

Alex_K said:
Bottom end it looks like you do have backups, at least partial. If backups are incomplete, update them, then plan on to recreating pool with more redundancy. Replacing faulty drive right now would have high chance or breaking one more drive during resilver and thus losing your pool.

If your workload allows it, after backups are up to date, turn off your server in questuion, and check drives one by one using if possible another PC with utility like Victoria HDD

Download Victoria for Windows - MajorGeeks

Victoria for Windows is a powerful HDD information and diagnostic utility. It can be used for diagnostics, research, speed testing, minor repair of hard drives (HDD), SSD drives, memory cards, and any other drives in the Windows operating system.

www.majorgeeks.com

If they lie about their SMART, Victoria would show reallocated sectors. At least as drops on the read speed graph. No lying around the graph. Use Read only tests

If disks prove ok, shell check everithing else again - cabling, tempetature, components part by part.

P.S. Mixing SATA and SAS drives aren't good also because they operate on different signaling voltages (800–1,600 mV for SAS versus 400–600 mV for SATA (transmit). While they do work, its preferable to avoid mixing them on same backplane, certainly avoid mixing them in same VDEV.

We're beginning to believe the old used hardware that we bought was just reaching the end of it lifecycle. We intend on buying new hardware to avoid this issue in the future. I'll also likely be scrapping the drives and going with "new" WD RED drives.

Demonlinx · Sep 1, 2022

indivision said:
That appears to be the correct mode ("IT") and the latest version (that I could find).

So, you are good there.

Did you look into the temp for the HBA as suggested? Is it crammed in with little space? Or, does that card look well ventilated?

The card is well-ventilated. It is the only card back there and seems to get good airflow from the fans.

Demonlinx · Sep 1, 2022

indivision said:
This looks like you are using 4 different types of hard drives? What is the context behind that?

It was what we had available. I'm beginning to realize that this was probably incorrect to do.

indivision said:
I don't want to make assumptions. The asker might have something else in mind. But, they may just be asking for more details about all of your hardware. RAM, CPU, MB, Drives, Boot device, etc.

I see, I assumed there was some way to output system configuration within TrueNAS

indivision said:
"Old_age"? Are these old/used drives? If so, what is their specific history?

I'm honestly not sure either. I'm beginning to realize that relying on Amazon to sell you "new" drives is highly unreliable. Will probably look at going through an alternative re-seller at this point. Any good recommendations?

indivision said:
In my opinion it depends on the drives. If these are really a collection of old drives, there may actually be that many bad ones... In that case, I think it would be pretty unreliable, especially on Z1, to try to use them. Replacing the faulted drive would just be polishing a turd. And with turd-based polish if its with another old drive!

I think the plan now is to replace all of the drives and rebuild the pool. I believe I've got all of the necessary data off of the drives that is needed.

indivision said:
When it shows as FAULTED it is typically already ready to be replaced (doesn't need to be offlined): https://www.truenas.com/community/threads/drive-faulted-cant-offline-it-to-replace-it.91297/

This is good to know, thanks!

Alex_K · Sep 1, 2022

Old chassis is ok, but always new drives. Can buy 2-3 such sets at price of one new server, and its more reliable then one new even with best sever care pack.
This is one of such DL180 G6 very alike yours (C5 H2 in signature)

indivision · Sep 1, 2022

Demonlinx said:
I'll also likely be scrapping the drives and going with "new" WD RED drives.

That's a good choice. Just make sure that you get the CMR model. Not the SMR ones.

List of known SMR drives

Hard drives that write data in overlapping, "shingled" tracks, have greater areal density than ones that do not. For cost and capacity reasons, manufacturers are increasingly moving to SMR, Shingled Magnetic Recording. SMR is a form of PMR...

www.truenas.com

Demonlinx said:
I'm honestly not sure either. I'm beginning to realize that relying on Amazon to sell you "new" drives is highly unreliable. Will probably look at going through an alternative re-seller at this point. Any good recommendations?

Unfortunately, I think Amazon is probably the best choice. I've never received old hardware from them before. Not doubting you. I know that kind of thing happens there. But, it must be rare.

I had a terrible experience ordering a batch of Red drives from Newegg one time. They delivered to the wrong house. When I tried to correct it they actually asked me to file a police report to prove that they messed up! Took over a month to resolve. So, I will never use Newegg again.

Demonlinx said:
I think the plan now is to replace all of the drives and rebuild the pool. I believe I've got all of the necessary data off of the drives that is needed.

I recommend going with fewer but larger capacity drives. In RaidZ2. Maybe 8 or 6 x 4TB depending on your needs. This way you can fit them all on one HBA. Also, with fewer drives, if you ever wanted to upgrade the sizes its a bit easier (offline and resilver with larger drive replacements one at a time).

Also, order enough to have 2 spares ready. And register all of the drives with WD. If any of them goes out within warranty you just send it in and get a new one. I've had them replace a couple over time.

Demonlinx · Sep 1, 2022

indivision said:
That's a good choice. Just make sure that you get the CMR model. Not the SMR ones.

List of known SMR drives

Hard drives that write data in overlapping, "shingled" tracks, have greater areal density than ones that do not. For cost and capacity reasons, manufacturers are increasingly moving to SMR, Shingled Magnetic Recording. SMR is a form of PMR...

www.truenas.com

I'll look at making sure the drives are CMR.

indivision said:
Unfortunately, I think Amazon is probably the best choice. I've never received old hardware from them before. Not doubting you. I know that kind of thing happens there. But, it must be rare.

I had a terrible experience ordering a batch of Red drives from Newegg one time. They delivered to the wrong house. When I tried to correct it they actually asked me to file a police report to prove that they messed up! Took over a month to resolve. So, I will never use Newegg again.

Yikes! That doesn't sound good at all. Hopefully Amazon will be better with the next shipment of drives we purchase.

indivision said:
I recommend going with fewer but larger capacity drives. In RaidZ2. Maybe 8 or 6 x 4TB depending on your needs. This way you can fit them all on one HBA. Also, with fewer drives, if you ever wanted to upgrade the sizes its a bit easier (offline and resilver with larger drive replacements one at a time).

Are there any 12i card variants? I know that when I was looking last there was nothing like that.. Is it only recommended to form pools with drives that are on the HBA?

indivision said:
Also, order enough to have 2 spares ready. And register all of the drives with WD. If any of them goes out within warranty you just send it in and get a new one. I've had them replace a couple over time.

Already have many spare drives available.

Important Announcement for the TrueNAS Community.

HELP! Not sure how to proceed. Scrub shows degraded and faulted drive(s)

Powered by Neutrality

Dabbler

Explorer

Explorer

Explorer

Explorer

Explorer

Explorer

Explorer

Guru

Guru

Explorer

Explorer

Explorer

Explorer

Explorer

Explorer

Explorer

Attachments

Guru

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "HELP! Not sure how to proceed. Scrub shows degraded and faulted drive(s)"

Similar threads