HELP! Not sure how to proceed. Scrub shows degraded and faulted drive(s)

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
other than getting the HBAs flashed properly (which you certainly should do no matter what), you might want to consider 2 additional things...

Heat in your case, particularly where the HBA(s) sit. An overheating HBA will cook itself to death (maybe what you're seeing is the first signs of that).

Disks... did you check them against the SMR list? https://www.truenas.com/community/resources/list-of-known-smr-drives.141/

All drives in a pool going into a mess during a scrub sounds a bit like SMR disks to me... more so would be a resilver, but you're not quite there yet (maybe we'll see soon).
 

somewhatdamaged

Dabbler
Joined
Sep 5, 2015
Messages
49
11 disk RAIDZ1...balls of steel!! Good luck man, hope you manage to sort it out!
 

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
sas2flash -list will get you the adapter info and show IT or IR mode:

View attachment 58078
Here's the output of the command once it's ran:
LSI Corporation SAS2 Flash Utility Version 20.00.00.00 (2014.09.18) Copyright (c) 2008-2014 LSI Corporation. All rights reserved Adapter Selected is a LSI SAS: SAS2008(B2) Controller Number : 0 Controller : SAS2008(B2) PCI Address : 00:03:00:00 SAS Address : 590b11c-0-121d-5102 NVDATA Version (Default) : 14.01.00.08 NVDATA Version (Persistent) : 14.01.00.08 Firmware Product ID : 0x2213 (IT) Firmware Version : 20.00.07.00 NVDATA Vendor : LSI NVDATA Product ID : SAS9211-8i BIOS Version : 07.11.10.00 UEFI BSD Version : N/A FCODE Version : N/A Board Name : SAS9211-8i Board Assembly : N/A Board Tracer Number : N/A Finished Processing Commands Successfully. Exiting SAS2Flash.
 

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
DL180 G6 has expander on the backplane
If its 12х3.5" it should be working OK with SAS2008 IT. I have 4 of these serevers working for years already solid.

Can use
sas2ircu 0 display
to show which drive reside in which bay (0 is controller number - you'll figure it out)
and
sas2ircu locate
to light a led on a drive caddy of your choice
LSI Corporation SAS2 IR Configuration Utility. Version 20.00.00.00 (2014.09.18) Copyright (c) 2008-2014 LSI Corporation. All rights reserved. Read configuration has been initiated for controller 0 ------------------------------------------------------------------------ Controller information ------------------------------------------------------------------------ Controller type : SAS2008 BIOS version : 7.11.10.00 Firmware version : 20.00.07.00 Channel description : 1 Serial Attached SCSI Initiator ID : 0 Maximum physical devices : 255 Concurrent commands supported : 3432 Slot : 4 Segment : 0 Bus : 3 Device : 0 Function : 0 RAID Support : No ------------------------------------------------------------------------ IR Volume information ------------------------------------------------------------------------ ------------------------------------------------------------------------ Physical device information ------------------------------------------------------------------------ Initiator at ID #0 Device is a Hard disk Enclosure # : 2 Slot # : 0 SAS Address : 5001438-0-131f-7f41 State : Ready (RDY) Size (in MB)/(in sectors) : 1907729/3907029167 Manufacturer : ATA Model Number : HUS724020ALA640 Firmware Revision : AA50 Serial No : P6GL9KTP GUID : N/A Protocol : SATA Drive Type : SATA_HDD Device is a Enclosure services device Enclosure # : 2 Slot # : 0 SAS Address : 5001438-0-131f-7f53 State : Standby (SBY) Manufacturer : HP Model Number : DL18xG6BP Firmware Revision : 2.20 Serial No : GUID : N/A Protocol : SAS Device Type : Enclosure services device Device is a Hard disk Enclosure # : 2 Slot # : 1 SAS Address : 5001438-0-131f-7f42 State : Ready (RDY) Size (in MB)/(in sectors) : 1907729/3907029167 Manufacturer : ATA Model Number : HUS724020ALA640 Firmware Revision : AA50 Serial No : P6GK39NP GUID : N/A Protocol : SATA Drive Type : SATA_HDD Device is a Hard disk Enclosure # : 2 Slot # : 2 SAS Address : 5001438-0-131f-7f43 State : Ready (RDY) Size (in MB)/(in sectors) : 1907729/3907029167 Manufacturer : ATA Model Number : HUS724020ALA640 Firmware Revision : AA50 Serial No : P6GL7E2P GUID : N/A Protocol : SATA Drive Type : SATA_HDD Device is a Hard disk Enclosure # : 2 Slot # : 3 SAS Address : 5001438-0-131f-7f44 State : Ready (RDY) Size (in MB)/(in sectors) : 1907729/3907029167 Manufacturer : ATA Model Number : HUS724020ALA640 Firmware Revision : AA50 Serial No : P6GKYJ0P GUID : N/A Protocol : SATA Drive Type : SATA_HDD Device is a Hard disk Enclosure # : 2 Slot # : 4 SAS Address : 5000c50-0-7f94-66d5 State : Ready (RDY) Size (in MB)/(in sectors) : 1907729/3907029167 Manufacturer : HP Model Number : MB2000FCWDF Firmware Revision : HPDA Serial No : S1X0DBB5 GUID : 5000c5007f9466d7 Protocol : SAS Drive Type : SAS_HDD Device is a Hard disk Enclosure # : 2 Slot # : 5 SAS Address : 5001438-0-131f-7f46 State : Ready (RDY) Size (in MB)/(in sectors) : 1907729/3907029167 Manufacturer : ATA Model Number : Hitachi HUA72302 Firmware Revision : AA10 Serial No : MK0171YFHNPB2A GUID : 5000cca223d77f19 Protocol : SATA Drive Type : SATA_HDD Device is a Hard disk Enclosure # : 2 Slot # : 6 SAS Address : 5001438-0-131f-7f47 State : Ready (RDY) Size (in MB)/(in sectors) : 1907729/3907029167 Manufacturer : ATA Model Number : HUS724020ALA640 Firmware Revision : AA50 Serial No : P6GJSBLP GUID : N/A Protocol : SATA Drive Type : SATA_HDD Device is a Hard disk Enclosure # : 2 Slot # : 7 SAS Address : 5000c50-0-4135-7629 State : Ready (RDY) Size (in MB)/(in sectors) : 1907729/3907029167 Manufacturer : HP Model Number : MB2000FBZPN Firmware Revision : HPD4 Serial No : Z1P1NQTV GUID : 5000c5004135762b Protocol : SAS Drive Type : SAS_HDD Device is a Hard disk Enclosure # : 2 Slot # : 8 SAS Address : 5000c50-0-413b-ef0d State : Ready (RDY) Size (in MB)/(in sectors) : 1907729/3907029167 Manufacturer : HP Model Number : MB2000FBZPN Firmware Revision : HPD4 Serial No : Z1P1QEHC GUID : 5000c500413bef0f Protocol : SAS Drive Type : SAS_HDD Device is a Hard disk Enclosure # : 2 Slot # : 9 SAS Address : 5001438-0-131f-7f4a State : Ready (RDY) Size (in MB)/(in sectors) : 1907729/3907029167 Manufacturer : ATA Model Number : Hitachi HUA72302 Firmware Revision : A840 Serial No : YGJ446PA GUID : 5000cca224de1051 Protocol : SATA Drive Type : SATA_HDD Device is a Hard disk Enclosure # : 2 Slot # : 10 SAS Address : 5001438-0-131f-7f4b State : Ready (RDY) Size (in MB)/(in sectors) : 1907729/3907029167 Manufacturer : ATA Model Number : HUS724020ALA640 Firmware Revision : AA50 Serial No : P6GKXBYP GUID : N/A Protocol : SATA Drive Type : SATA_HDD Device is a Hard disk Enclosure # : 2 Slot # : 11 SAS Address : 5001438-0-131f-7f4c State : Ready (RDY) Size (in MB)/(in sectors) : 1907729/3907029167 Manufacturer : ATA Model Number : HUS724020ALA640 Firmware Revision : AA50 Serial No : P6GL7D6P GUID : N/A Protocol : SATA Drive Type : SATA_HDD ------------------------------------------------------------------------ Enclosure information ------------------------------------------------------------------------ Enclosure# : 1 Logical ID : 590b11c0:121d5102 Numslots : 8 StartSlot : 0 Enclosure# : 2 Logical ID : 50014380:131f7f00 Numslots : 13 StartSlot : 0 ------------------------------------------------------------------------ SAS2IRCU: Command DISPLAY Completed Successfully. SAS2IRCU: Utility Completed Successfully.
also would be helpful if you posted entire system config,
How do I go about doing this?

and output of
smartctl -a /dev/sd*
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Hitachi/HGST Ultrastar 7K4000 Device Model: HUS724020ALA640 Serial Number: P6GL9KTP Firmware Version: MF6OAA50 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s) Local Time is: Wed Aug 31 06:57:40 2022 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 24) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported.SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 320) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 137 137 054 Pre-fail Offline - 78 3 Spin_Up_Time 0x0007 123 123 024 Pre-fail Always - 508 (Average 508) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 15 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 140 140 020 Pre-fail Offline - 26 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 3045 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 15 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 141 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 141 194 Temperature_Celsius 0x0002 206 206 000 Old_age Always - 29 (Min/Max 25/31) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Hitachi/HGST Ultrastar 7K4000 Device Model: HUS724020ALA640 Serial Number: P6GK39NP Firmware Version: MF6OAA50 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s) Local Time is: Wed Aug 31 07:00:44 2022 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 24) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 316) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 137 137 054 Pre-fail Offline - 79 3 Spin_Up_Time 0x0007 128 128 024 Pre-fail Always - 486 (Average 485) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 16 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 142 142 020 Pre-fail Offline - 25 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 3045 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 16 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 76 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 76 194 Temperature_Celsius 0x0002 206 206 000 Old_age Always - 29 (Min/Max 25/32) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 ATA Error Count: 1 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 1 occurred at disk power-on lifetime: 347 hours (14 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 18 98 87 69 01 Error: UNC at LBA = 0x01698798 = 23693208 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 18 00 98 87 69 40 00 1d+04:24:57.152 READ FPDMA QUEUED 60 18 00 80 87 69 40 00 1d+04:24:57.151 READ FPDMA QUEUED 60 18 00 68 87 69 40 00 1d+04:24:57.151 READ FPDMA QUEUED 60 20 00 48 87 69 40 00 1d+04:24:57.151 READ FPDMA QUEUED 60 18 00 30 87 69 40 00 1d+04:24:57.150 READ FPDMA QUEUED SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 893 - # 2 Extended offline Completed without error 00% 681 - # 3 Short offline Completed without error 00% 307 - # 4 Short offline Completed without error 00% 283 - # 5 Short offline Completed without error 00% 259 - # 6 Short offline Completed without error 00% 235 - # 7 Short offline Completed without error 00% 211 - # 8 Short offline Completed without error 00% 187 - # 9 Short offline Completed without error 00% 163 - #10 Extended offline Completed without error 00% 158 - #11 Short offline Completed without error 00% 51904 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.

for every drive thats faulty
mpsutil show all
don't seem to have mpsutil...
 

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
smartctl -a /dev/sd*
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Hitachi/HGST Ultrastar 7K4000 Device Model: HUS724020ALA640 Serial Number: P6GKYJ0P Firmware Version: MF6OAA50 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s) Local Time is: Wed Aug 31 07:01:03 2022 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 24) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 309) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 137 137 054 Pre-fail Offline - 79 3 Spin_Up_Time 0x0007 126 126 024 Pre-fail Always - 495 (Average 496) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 15 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 140 140 020 Pre-fail Offline - 26 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 3045 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 15 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 75 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 75 194 Temperature_Celsius 0x0002 200 200 000 Old_age Always - 30 (Min/Max 25/32) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 893 - # 2 Extended offline Completed without error 00% 681 - # 3 Short offline Completed without error 00% 307 - # 4 Short offline Completed without error 00% 283 - # 5 Short offline Completed without error 00% 259 - # 6 Short offline Completed without error 00% 235 - # 7 Short offline Completed without error 00% 211 - # 8 Short offline Completed without error 00% 187 - # 9 Short offline Completed without error 00% 163 - #10 Extended offline Completed without error 00% 158 - #11 Short offline Completed without error 00% 51903 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: HP Product: MB2000FCWDF Revision: HPDA Compliance: SPC-4 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Logical block size: 512 bytes LU is fully provisioned Rotation Rate: 7200 rpm Form Factor: 3.5 inches Logical Unit id: 0x5000c5007f9466d7 Serial number: S1X0DBB50000K528H7XK Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Wed Aug 31 07:01:24 2022 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled === START OF READ SMART DATA SECTION === SMART Health Status: OK Current Drive Temperature: 30 C Drive Trip Temperature: 60 C Accumulated power on time, hours:minutes 59218:37 Manufactured in week 05 of year 2015 Specified cycle count over device lifetime: 10000 Accumulated start-stop cycles: 268 Specified load-unload count over device lifetime: 300000 Accumulated load-unload cycles: 2737 Elements in grown defect list: 0 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 0 0 0 0 0 715957.397 0 write: 0 0 0 0 0 44636.672 0 Non-medium error count: 31363 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed - 57066 - [- - -] # 2 Background long Completed - 56853 - [- - -] # 3 Background short Completed - 56495 - [- - -] # 4 Background short Completed - 56486 - [- - -] Long (extended) Self-test duration: 15300 seconds [255.0 minutes]
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Hitachi Ultrastar 7K3000 Device Model: Hitachi HUA723020ALA640 Serial Number: MK0171YFHNPB2A LU WWN Device Id: 5 000cca 223d77f19 Firmware Version: MK7OAA10 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 2.6, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Wed Aug 31 07:01:46 2022 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 28) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 327) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 135 135 054 Pre-fail Offline - 86 3 Spin_Up_Time 0x0007 124 124 024 Pre-fail Always - 502 (Average 500) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 16 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 123 123 020 Pre-fail Offline - 31 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 3124 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 16 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 68 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 68 194 Temperature_Celsius 0x0002 200 200 000 Old_age Always - 30 (Min/Max 25/33) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 971 - # 2 Extended offline Completed without error 00% 760 - # 3 Short offline Completed without error 00% 386 - # 4 Short offline Completed without error 00% 362 - # 5 Short offline Completed without error 00% 338 - # 6 Short offline Completed without error 00% 314 - # 7 Short offline Completed without error 00% 289 - # 8 Short offline Completed without error 00% 265 - # 9 Short offline Completed without error 00% 241 - #10 Extended offline Completed without error 00% 237 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: HP Product: MB2000FBZPN Revision: HPD4 Compliance: SPC-3 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Logical block size: 512 bytes Rotation Rate: 7200 rpm Form Factor: 3.5 inches Logical Unit id: 0x5000c500413bef0f Serial number: Z1P1QEHC0000C234A1YF Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Wed Aug 31 07:02:07 2022 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled === START OF READ SMART DATA SECTION === SMART Health Status: OK Current Drive Temperature: 29 C Drive Trip Temperature: 65 C Accumulated power on time, hours:minutes 54140:36 Manufactured in week 10 of year 2012 Specified cycle count over device lifetime: 10000 Accumulated start-stop cycles: 106 Specified load-unload count over device lifetime: 300000 Accumulated load-unload cycles: 2356 Elements in grown defect list: 0 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 0 0 0 1712374307 0 1820277.925 0 write: 0 0 0 0 0 136881.942 0 Non-medium error count: 300 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed - 51970 - [- - -] # 2 Background long Completed - 51757 - [- - -] # 3 Background short Completed - 51380 - [- - -] # 4 Background short Completed - 51356 - [- - -] # 5 Background short Completed - 51332 - [- - -] # 6 Background short Completed - 51308 - [- - -] # 7 Background short Completed - 51201 - [- - -] Long (extended) Self-test duration: 18000 seconds [300.0 minutes]
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Hitachi/HGST Ultrastar 7K4000 Device Model: HUS724020ALA640 Serial Number: P6GKXBYP Firmware Version: MF6OAA50 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s) Local Time is: Wed Aug 31 07:02:29 2022 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 28) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 329) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 137 137 054 Pre-fail Offline - 78 3 Spin_Up_Time 0x0007 128 128 024 Pre-fail Always - 489 (Average 486) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 15 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 140 140 020 Pre-fail Offline - 26 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 3045 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 15 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 75 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 75 194 Temperature_Celsius 0x0002 193 193 000 Old_age Always - 31 (Min/Max 24/33) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 893 - # 2 Extended offline Completed without error 00% 681 - # 3 Short offline Completed without error 00% 307 - # 4 Short offline Completed without error 00% 283 - # 5 Short offline Completed without error 00% 259 - # 6 Short offline Completed without error 00% 235 - # 7 Short offline Completed without error 00% 211 - # 8 Short offline Completed without error 00% 187 - # 9 Short offline Completed without error 00% 163 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Hitachi/HGST Ultrastar 7K4000 Device Model: HUS724020ALA640 Serial Number: P6GL7D6P Firmware Version: MF6OAA50 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s) Local Time is: Wed Aug 31 07:02:45 2022 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 24) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 317) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 137 137 054 Pre-fail Offline - 79 3 Spin_Up_Time 0x0007 124 124 024 Pre-fail Always - 503 (Average 502) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 15 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 142 142 020 Pre-fail Offline - 25 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 3045 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 15 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 75 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 75 194 Temperature_Celsius 0x0002 200 200 000 Old_age Always - 30 (Min/Max 25/33) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 893 - # 2 Extended offline Completed without error 00% 681 - # 3 Short offline Completed without error 00% 307 - # 4 Short offline Completed without error 00% 283 - # 5 Short offline Completed without error 00% 259 - # 6 Short offline Completed without error 00% 235 - # 7 Short offline Completed without error 00% 211 - # 8 Short offline Completed without error 00% 187 - # 9 Short offline Completed without error 00% 163 - #10 Short offline Completed without error 00% 51904 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
 

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
Here's the drive after the scrub completed:
1661955024132.png


Should I replace the faulted drive? Or still, be more concerned about the entire pool/hardware malfunctioning?
 

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
It's also worth noting that I did a file move in the CLI:
These happened at /mnt/Backup/ this is the main pool configured for my server.
I then ran:
netcat -l -p 7000 | tar x
And a sender on the other server. I realize now there could have been uid/gid mismatches that could have occurred. I feel like doing this move in this way was not the proper way. Should I move these files to the "home directory" dataset that I've created and then modify the files so that they are all owned by a user on the system? At this point would doing that matter?

Was doing the above an issue? If so why?
 

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
Various md5sum checks between known good files on the remote backup server and these files on the machine with issues seem to be the same. Is this common for this to be the case when drives are having issues?
 

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
I'm also not able to offline the bad drive it seems... It just spins and then does nothing when I try to offline the failed drive.
 

indivision

Guru
Joined
Jan 4, 2013
Messages
806
Here's the output of the command once it's ran:

Firmware Product ID : 0x2213 (IT)
Firmware Version : 20.00.07.00

That appears to be the correct mode ("IT") and the latest version (that I could find).

So, you are good there.

Did you look into the temp for the HBA as suggested? Is it crammed in with little space? Or, does that card look well ventilated?
 

indivision

Guru
Joined
Jan 4, 2013
Messages
806
Manufacturer : ATA
Model Number : HUS724020ALA640

Manufacturer : HP
Model Number : MB2000FCWDF

Manufacturer : ATA
Model Number : Hitachi HUA72302

Manufacturer : HP
Model Number : MB2000FBZPN

This looks like you are using 4 different types of hard drives? What is the context behind that?

How do I go about doing this?

I don't want to make assumptions. The asker might have something else in mind. But, they may just be asking for more details about all of your hardware. RAM, CPU, MB, Drives, Boot device, etc.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 137 137 054 Pre-fail Offline - 78
3 Spin_Up_Time 0x0007 123 123 024 Pre-fail Always - 508 (Average 508)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 15
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 140 140 020 Pre-fail Offline - 26
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 3045
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 15
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 141
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 141
194 Temperature_Celsius 0x0002 206 206 000 Old_age Always - 29 (Min/Max 25/31)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0

"Old_age"? Are these old/used drives? If so, what is their specific history?

don't seem to have mpsutil...

I haven't had to use this before. So, am not sure myself.

Should I replace the faulted drive? Or still, be more concerned about the entire pool/hardware malfunctioning?

In my opinion it depends on the drives. If these are really a collection of old drives, there may actually be that many bad ones... In that case, I think it would be pretty unreliable, especially on Z1, to try to use them. Replacing the faulted drive would just be polishing a turd. And with turd-based polish if its with another old drive!

I'm also not able to offline the bad drive it seems... It just spins and then does nothing when I try to offline the failed drive.

When it shows as FAULTED it is typically already ready to be replaced (doesn't need to be offlined): https://www.truenas.com/community/threads/drive-faulted-cant-offline-it-to-replace-it.91297/

I'm not certain about your other questions. So, hopefully someone else can chime in to cover those. I don't think that moving the files via CLI should have created an issue. But, not 100% on that.
 

Alex_K

Explorer
Joined
Sep 4, 2016
Messages
64
It seem suspicious to see on same disks
Code:
9 Power_On_Hours        3045

together with
Code:
#11  Short offline       Completed without error       00%     51904        


And no Reallocated_Event_Count / other errors 196-199.
While the Failed drive openly state it worked for 54140.

It's possible to reset SMART on some Hitachi /HGST drives. Its not hard even

On the other hand its normal for drives like these to start to fail at 50'000-65'000+ hours. This could be what is just happening.

This may help to navigate through data posted
Code:
0  a    boot        SATA    HUS724020ALA640    AA50    P6GL9KTP
1  b    82    * 1W    SATA    HUS724020ALA640    AA50    P6GK39NP
2  c    90    +    SATA    HUS724020ALA640    AA50    P6GL7E2P
3  d    128    *    SATA    HUS724020ALA640    AA50    P6GKYJ0P
4  e    242    *    SAS    MB2000FCWDF    HPDA    S1X0DBB5
5  f    80    *    SATA    HUA72302    AA10    MK0171YFHNPB2A
6  g    44    +    SATA    HUS724020ALA640    AA50    P6GJSBLP
7  h    6    +    SAS    MB2000FBZPN    HPD4    Z1P1NQTV
8  i    29    -    SAS    MB2000FBZPN    HPD4    Z1P1QEHC
9  j    68    +    SATA    HUA72302    A840    YGJ446PA
10 k    117    *    SATA    HUS724020ALA640    AA50    P6GKXBYP
11 l    62    *    SATA    HUS724020ALA640    AA50    P6GL7D6P


You do not have mpsutil because its Scale. No big deal.
 

Alex_K

Explorer
Joined
Sep 4, 2016
Messages
64
It's also worth noting that I did a file move in the CLI:
These happened at /mnt/Backup/ this is the main pool configured for my server.
I then ran:
netcat -l -p 7000 | tar x
And a sender on the other server. I realize now there could have been uid/gid mismatches that could have occurred. I feel like doing this move in this way was not the proper way. Should I move these files to the "home directory" dataset that I've created and then modify the files so that they are all owned by a user on the system? At this point would doing that matter?

Was doing the above an issue? If so why?
moving files does not break pools - directly. Though data move pose some strain on hardware.
To preserve attributes you use -p switch with tar
--same-owner helps too

Now if attributes/gid/uid are not how you need them, you can fix them yep. chown and chmod
Its harder if you had SMB shares and ACL there..
Usually better off doing zfs send/recv to move data or using GUI to replicate snapshots of datasets, or RSYNC.

Various md5sum checks between known good files on the remote backup server and these files on the machine with issues seem to be the same. Is this common for this to be the case when drives are having issues?
ZFS if Very resilient. It checksums everything. If it successfully scrub and your pool did not break, its Highly Unlikely that data is changed.

I don't want to make assumptions. The asker might have something else in mind. But, they may just be asking for more details about all of your hardware. RAM, CPU, MB, Drives, Boot device, etc.
Thats right. And software build/versions.

About situation as a whole, Demonlinx, I'm more inclined to believe its faulty drives then controller. SAS2008 based cards are designed to operate at up to +55 Ambient temperature, as per https://docs.broadcom.com/doc/12353227
If something else was broken - like power source or a fan or memory - it would have been visible in HP iLO2

For how long have you had that pool operational before you saw errors? Did you check it before, did you burn-in it before putting data on?

Bottom end it looks like you do have backups, at least partial. If backups are incomplete, update them, then plan on to recreating pool with more redundancy. Replacing faulty drive right now would have high chance or breaking one more drive during resilver and thus losing your pool.

If your workload allows it, after backups are up to date, turn off your server in questuion, and check drives one by one using if possible another PC with utility like Victoria HDD
If they lie about their SMART, Victoria would show reallocated sectors. At least as drops on the read speed graph. No lying around the graph. Use Read only tests

If disks prove ok, shell check everithing else again - cabling, tempetature, components part by part.

P.S. Mixing SATA and SAS drives aren't good also because they operate on different signaling voltages (800–1,600 mV for SAS versus 400–600 mV for SATA (transmit). While they do work, its preferable to avoid mixing them on same backplane, certainly avoid mixing them in same VDEV.
 
Last edited:

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
It seem suspicious to see on same disks
Code:
9 Power_On_Hours        3045

together with
Code:
#11  Short offline       Completed without error       00%     51904        


And no Reallocated_Event_Count / other errors 196-199.
While the Failed drive openly state it worked for 54140.
This may be the cause and we might have received drives which we believed were new but were not...
 

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
moving files does not break pools - directly. Though data move pose some strain on hardware.
To preserve attributes you use -p switch with tar
--same-owner helps too
I'll keep this in mind for future transfers! Thanks!
Now if attributes/gid/uid are not how you need them, you can fix them yep. chown and chmod
Its harder if you had SMB shares and ACL there..
Usually better off doing zfs send/recv to move data or using GUI to replicate snapshots of datasets, or RSYNC.
And how might I set this up properly? I tried doing this but kept getting issues about keys not being available, etc.
ZFS if Very resilient. It checksums everything. If it successfully scrub and your pool did not break, its Highly Unlikely that data is changed.


Thats right. And software build/versions.

About situation as a whole, Demonlinx, I'm more inclined to believe its faulty drives then controller. SAS2008 based cards are designed to operate at up to +55 Ambient temperature, as per https://docs.broadcom.com/doc/12353227
If something else was broken - like power source or a fan or memory - it would have been visible in HP iLO2

For how long have you had that pool operational before you saw errors? Did you check it before, did you burn-in it before putting data on?
We've only had this machine for a couple of months. I did ~40TB of I/O on the pool before doing this migration and didn't see any issues.

I did NOT do a burn-in but this is definitely something that I'll be doing in the future!
Bottom end it looks like you do have backups, at least partial. If backups are incomplete, update them, then plan on to recreating pool with more redundancy. Replacing faulty drive right now would have high chance or breaking one more drive during resilver and thus losing your pool.

If your workload allows it, after backups are up to date, turn off your server in questuion, and check drives one by one using if possible another PC with utility like Victoria HDD
If they lie about their SMART, Victoria would show reallocated sectors. At least as drops on the read speed graph. No lying around the graph. Use Read only tests

If disks prove ok, shell check everithing else again - cabling, tempetature, components part by part.

P.S. Mixing SATA and SAS drives aren't good also because they operate on different signaling voltages (800–1,600 mV for SAS versus 400–600 mV for SATA (transmit). While they do work, its preferable to avoid mixing them on same backplane, certainly avoid mixing them in same VDEV.
We're beginning to believe the old used hardware that we bought was just reaching the end of it lifecycle. We intend on buying new hardware to avoid this issue in the future. I'll also likely be scrapping the drives and going with "new" WD RED drives.
 

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
That appears to be the correct mode ("IT") and the latest version (that I could find).

So, you are good there.

Did you look into the temp for the HBA as suggested? Is it crammed in with little space? Or, does that card look well ventilated?
The card is well-ventilated. It is the only card back there and seems to get good airflow from the fans.
 

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
This looks like you are using 4 different types of hard drives? What is the context behind that?
It was what we had available. I'm beginning to realize that this was probably incorrect to do.
I don't want to make assumptions. The asker might have something else in mind. But, they may just be asking for more details about all of your hardware. RAM, CPU, MB, Drives, Boot device, etc.
I see, I assumed there was some way to output system configuration within TrueNAS
"Old_age"? Are these old/used drives? If so, what is their specific history?
I'm honestly not sure either. I'm beginning to realize that relying on Amazon to sell you "new" drives is highly unreliable. Will probably look at going through an alternative re-seller at this point. Any good recommendations?
In my opinion it depends on the drives. If these are really a collection of old drives, there may actually be that many bad ones... In that case, I think it would be pretty unreliable, especially on Z1, to try to use them. Replacing the faulted drive would just be polishing a turd. And with turd-based polish if its with another old drive!
I think the plan now is to replace all of the drives and rebuild the pool. I believe I've got all of the necessary data off of the drives that is needed.
When it shows as FAULTED it is typically already ready to be replaced (doesn't need to be offlined): https://www.truenas.com/community/threads/drive-faulted-cant-offline-it-to-replace-it.91297/
This is good to know, thanks!
 

Alex_K

Explorer
Joined
Sep 4, 2016
Messages
64
Old chassis is ok, but always new drives. Can buy 2-3 such sets at price of one new server, and its more reliable then one new even with best sever care pack.
This is one of such DL180 G6 very alike yours (C5 H2 in signature)
 

Attachments

  • 1074.gif
    1074.gif
    46.2 KB · Views: 75
Last edited:

indivision

Guru
Joined
Jan 4, 2013
Messages
806
I'll also likely be scrapping the drives and going with "new" WD RED drives.

That's a good choice. Just make sure that you get the CMR model. Not the SMR ones.


I'm honestly not sure either. I'm beginning to realize that relying on Amazon to sell you "new" drives is highly unreliable. Will probably look at going through an alternative re-seller at this point. Any good recommendations?

Unfortunately, I think Amazon is probably the best choice. I've never received old hardware from them before. Not doubting you. I know that kind of thing happens there. But, it must be rare.

I had a terrible experience ordering a batch of Red drives from Newegg one time. They delivered to the wrong house. When I tried to correct it they actually asked me to file a police report to prove that they messed up! Took over a month to resolve. So, I will never use Newegg again.

I think the plan now is to replace all of the drives and rebuild the pool. I believe I've got all of the necessary data off of the drives that is needed.

I recommend going with fewer but larger capacity drives. In RaidZ2. Maybe 8 or 6 x 4TB depending on your needs. This way you can fit them all on one HBA. Also, with fewer drives, if you ever wanted to upgrade the sizes its a bit easier (offline and resilver with larger drive replacements one at a time).

Also, order enough to have 2 spares ready. And register all of the drives with WD. If any of them goes out within warranty you just send it in and get a new one. I've had them replace a couple over time.
 

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
That's a good choice. Just make sure that you get the CMR model. Not the SMR ones.

I'll look at making sure the drives are CMR.
Unfortunately, I think Amazon is probably the best choice. I've never received old hardware from them before. Not doubting you. I know that kind of thing happens there. But, it must be rare.

I had a terrible experience ordering a batch of Red drives from Newegg one time. They delivered to the wrong house. When I tried to correct it they actually asked me to file a police report to prove that they messed up! Took over a month to resolve. So, I will never use Newegg again.
Yikes! That doesn't sound good at all. Hopefully Amazon will be better with the next shipment of drives we purchase.
I recommend going with fewer but larger capacity drives. In RaidZ2. Maybe 8 or 6 x 4TB depending on your needs. This way you can fit them all on one HBA. Also, with fewer drives, if you ever wanted to upgrade the sizes its a bit easier (offline and resilver with larger drive replacements one at a time).
Are there any 12i card variants? I know that when I was looking last there was nothing like that.. Is it only recommended to form pools with drives that are on the HBA?
Also, order enough to have 2 spares ready. And register all of the drives with WD. If any of them goes out within warranty you just send it in and get a new one. I've had them replace a couple over time.
Already have many spare drives available.
 
Top