I am running TrueNAS Scale 23.10.1.1 with a RAIDz1 setup on 4x 12TB Toshiba MG07ACA12TE, AMD Ryzen 4650P, 64GB ECC, LSI9211 / SAS2008 controller.
A few days ago I got an email notification about I/O errors:
and
So I ordered a new drive and started the RMA process. In the meantime TrueNAS sucessfully did a resilver. But now looking at the storage it is back from DEGRADED to ONLINE, everything is green.
I have clearly seen SMART errors for read and write, but they are not there anymore. smartctl (output below) tells me that everything is fine. I have double checked the serial of the drive to make sure I am looking at the correct one.
How to proceed from here? Without any errors, I doubt that there will be a successful RMA process. But more important, a drive that has failed once should not be trusted I guess. Am I doing anything wrong looking at the results?
Thank you!
	
		
			
		
		
	
			
			A few days ago I got an email notification about I/O errors:
Code:
The number of I/O errors associated with a ZFS device exceeded acceptable levels. ZFS has marked the device as faulted. impact: Fault tolerance of the pool may be compromised. eid: 46 class: statechange state: FAULTED host: storage time: 2024-02-17 02:19:07+0100 vpath: /dev/disk/by-partuuid/ad0e21f3-e97f-410b-aa47-202ab692b9dc vguid: 0xF881419855BB7E2A pool: cobalt (0x50681DE5DDA83CA1)
and
Code:
[*]Pool cobalt state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. The following devices are not healthy: [*]Disk TOSHIBA_MG07ACA12TE X1P0A0EDF95G is FAULTED
So I ordered a new drive and started the RMA process. In the meantime TrueNAS sucessfully did a resilver. But now looking at the storage it is back from DEGRADED to ONLINE, everything is green.
I have clearly seen SMART errors for read and write, but they are not there anymore. smartctl (output below) tells me that everything is fine. I have double checked the serial of the drive to make sure I am looking at the correct one.
How to proceed from here? Without any errors, I doubt that there will be a successful RMA process. But more important, a drive that has failed once should not be trusted I guess. Am I doing anything wrong looking at the results?
Thank you!
Code:
root@storage[~]# smartctl --health /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
root@storage[~]# smartctl --xall /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family:     Toshiba MG07ACA... Enterprise Capacity HDD
Device Model:     TOSHIBA MG07ACA12TE
Serial Number:    X1P0A0EDF95G
LU WWN Device Id: 5 000039 b38db3aca
Firmware Version: 0104
User Capacity:    12,000,138,625,024 bytes [12.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Feb 21 07:38:20 2024 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  120) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (1166) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   050    -    0
  2 Throughput_Performance  P-S---   100   100   050    -    0
  3 Spin_Up_Time            POS--K   100   100   001    -    6957
  4 Start_Stop_Count        -O--CK   100   100   000    -    15
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  7 Seek_Error_Rate         PO-R--   100   100   050    -    0
  8 Seek_Time_Performance   P-S---   100   100   050    -    0
  9 Power_On_Hours          -O--CK   091   091   000    -    3678
 10 Spin_Retry_Count        PO--CK   100   100   030    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    15
 23 Helium_Condition_Lower  PO---K   100   100   075    -    0
 24 Helium_Condition_Upper  PO---K   100   100   075    -    0
191 G-Sense_Error_Rate      -O--CK   100   100   000    -    50
192 Power-Off_Retract_Count -O--CK   100   100   000    -    11
193 Load_Cycle_Count        -O--CK   100   100   000    -    15
194 Temperature_Celsius     -O---K   100   100   000    -    31 (Min/Max 20/56)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O--CK   100   100   000    -    0
198 Offline_Uncorrectable   ----CK   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
220 Disk_Shift              -O----   100   100   000    -    2228224
222 Loaded_Hours            -O--CK   091   091   000    -    3678
223 Load_Retry_Count        -O--CK   100   100   000    -    0
224 Load_Friction           -O---K   100   100   000    -    0
226 Load-in_Time            -OS--K   100   100   000    -    594
240 Head_Flying_Hours       P-----   100   100   001    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning
General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O     51  Comprehensive SMART error log
0x03       GPL     R/O      5  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x0c       GPL     R/O    513  Pending Defects log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x24       GPL     R/O  53248  Current Device Internal Status Data log
0x25       GPL     R/O  53248  Saved Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer
SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
No Errors Logged
SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      3678         -
# 2  Short offline       Completed without error       00%      3661         -
# 3  Short offline       Completed without error       00%      3637         -
# 4  Short offline       Completed without error       00%      3614         -
# 5  Short offline       Completed without error       00%      3610         -
# 6  Short offline       Completed without error       00%      3590         -
# 7  Short offline       Completed without error       00%      3566         -
# 8  Short offline       Completed without error       00%      3542         -
# 9  Short offline       Completed without error       00%      3518         -
#10  Short offline       Completed without error       00%      3494         -
#11  Short offline       Completed without error       00%      3470         -
#12  Short offline       Completed without error       00%      3446         -
#13  Short offline       Completed without error       00%      3422         -
#14  Short offline       Completed without error       00%      3398         -
#15  Short offline       Completed without error       00%      3374         -
#16  Short offline       Completed without error       00%      3350         -
#17  Short offline       Completed without error       00%      3326         -
#18  Short offline       Completed without error       00%      3302         -
#19  Short offline       Completed without error       00%      3278         -
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Status Version:                  3
SCT Version (vendor specific):       1 (0x0001)
Device State:                        Active (0)
Current Temperature:                    31 Celsius
Power Cycle Min/Max Temperature:     21/37 Celsius
Lifetime    Min/Max Temperature:     20/56 Celsius
Specified Max Operating Temperature:    55 Celsius
Under/Over Temperature Limit Count:   0/0
SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      5/55 Celsius
Min/Max Temperature Limit:           -40/70 Celsius
Temperature History Size (Index):    478 (393)
Index    Estimated Time   Temperature Celsius
 394    2024-02-20 23:41    31  ************
 ...    ..(138 skipped).    ..  ************
  55    2024-02-21 02:00    31  ************
  56    2024-02-21 02:01    32  *************
  57    2024-02-21 02:02    32  *************
  58    2024-02-21 02:03    32  *************
  59    2024-02-21 02:04    33  **************
 ...    ..(  2 skipped).    ..  **************
  62    2024-02-21 02:07    33  **************
  63    2024-02-21 02:08    34  ***************
 ...    ..(  4 skipped).    ..  ***************
  68    2024-02-21 02:13    34  ***************
  69    2024-02-21 02:14    35  ****************
 ...    ..( 17 skipped).    ..  ****************
  87    2024-02-21 02:32    35  ****************
  88    2024-02-21 02:33    36  *****************
 ...    ..( 27 skipped).    ..  *****************
 116    2024-02-21 03:01    36  *****************
 117    2024-02-21 03:02    37  ******************
 ...    ..(  8 skipped).    ..  ******************
 126    2024-02-21 03:11    37  ******************
 127    2024-02-21 03:12    36  *****************
 128    2024-02-21 03:13    36  *****************
 129    2024-02-21 03:14    37  ******************
 130    2024-02-21 03:15    36  *****************
 131    2024-02-21 03:16    36  *****************
 132    2024-02-21 03:17    36  *****************
 133    2024-02-21 03:18    37  ******************
 134    2024-02-21 03:19    36  *****************
 135    2024-02-21 03:20    36  *****************
 136    2024-02-21 03:21    36  *****************
 137    2024-02-21 03:22    37  ******************
 138    2024-02-21 03:23    37  ******************
 139    2024-02-21 03:24    36  *****************
 ...    ..( 16 skipped).    ..  *****************
 156    2024-02-21 03:41    36  *****************
 157    2024-02-21 03:42    37  ******************
 ...    ..( 11 skipped).    ..  ******************
 169    2024-02-21 03:54    37  ******************
 170    2024-02-21 03:55    36  *****************
 171    2024-02-21 03:56    37  ******************
 172    2024-02-21 03:57    37  ******************
 173    2024-02-21 03:58    37  ******************
 174    2024-02-21 03:59    36  *****************
 175    2024-02-21 04:00    37  ******************
 176    2024-02-21 04:01    36  *****************
 ...    ..(  6 skipped).    ..  *****************
 183    2024-02-21 04:08    36  *****************
 184    2024-02-21 04:09    37  ******************
 185    2024-02-21 04:10    36  *****************
 ...    ..(  6 skipped).    ..  *****************
 192    2024-02-21 04:17    36  *****************
 193    2024-02-21 04:18    37  ******************
 ...    ..( 33 skipped).    ..  ******************
 227    2024-02-21 04:52    37  ******************
 228    2024-02-21 04:53    36  *****************
 229    2024-02-21 04:54    37  ******************
 ...    ..( 16 skipped).    ..  ******************
 246    2024-02-21 05:11    37  ******************
 247    2024-02-21 05:12    36  *****************
 248    2024-02-21 05:13    37  ******************
 ...    ..( 11 skipped).    ..  ******************
 260    2024-02-21 05:25    37  ******************
 261    2024-02-21 05:26    36  *****************
 262    2024-02-21 05:27    36  *****************
 263    2024-02-21 05:28    37  ******************
 264    2024-02-21 05:29    36  *****************
 ...    ..(  8 skipped).    ..  *****************
 273    2024-02-21 05:38    36  *****************
 274    2024-02-21 05:39    35  ****************
 ...    ..(  5 skipped).    ..  ****************
 280    2024-02-21 05:45    35  ****************
 281    2024-02-21 05:46    34  ***************
 ...    ..(  4 skipped).    ..  ***************
 286    2024-02-21 05:51    34  ***************
 287    2024-02-21 05:52    33  **************
 ...    ..( 12 skipped).    ..  **************
 300    2024-02-21 06:05    33  **************
 301    2024-02-21 06:06    34  ***************
 302    2024-02-21 06:07    34  ***************
 303    2024-02-21 06:08    34  ***************
 304    2024-02-21 06:09    33  **************
 ...    ..(  7 skipped).    ..  **************
 312    2024-02-21 06:17    33  **************
 313    2024-02-21 06:18    32  *************
 ...    ..( 15 skipped).    ..  *************
 329    2024-02-21 06:34    32  *************
 330    2024-02-21 06:35    31  ************
 ...    ..( 62 skipped).    ..  ************
 393    2024-02-21 07:38    31  ************
SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled
Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 3) ==
0x01  0x008  4              15  ---  Lifetime Power-On Resets
0x01  0x010  4            3678  ---  Power-on Hours
0x01  0x018  6     19096327497  ---  Logical Sectors Written
0x01  0x020  6        96173385  ---  Number of Write Commands
0x01  0x028  6    103634734656  ---  Logical Sectors Read
0x01  0x030  6      1117227079  ---  Number of Read Commands
0x01  0x038  6     13240800000  ---  Date and Time TimeStamp
0x02  =====  =               =  ===  == Free-Fall Statistics (rev 1) ==
0x02  0x010  4              50  ---  Overlimit Shock Events
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4              99  ---  Spindle Motor Power-on Hours
0x03  0x010  4              99  ---  Head Flying Hours
0x03  0x018  4              15  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               0  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x03  0x038  4               0  ---  Number of Realloc. Candidate Logical Sectors
0x03  0x040  4              11  ---  Number of High Priority Unload Events
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               2  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              31  ---  Current Temperature
0x05  0x010  1              31  N--  Average Short Term Temperature
0x05  0x018  1              31  N--  Average Long Term Temperature
0x05  0x020  1              56  ---  Highest Temperature
0x05  0x028  1              20  ---  Lowest Temperature
0x05  0x030  1              52  N--  Highest Average Short Term Temperature
0x05  0x038  1              29  N--  Lowest Average Short Term Temperature
0x05  0x040  1              44  N--  Highest Average Long Term Temperature
0x05  0x048  1              31  N--  Lowest Average Long Term Temperature
0x05  0x050  4              82  ---  Time in Over-Temperature
0x05  0x058  1              55  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               5  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4             111  ---  Number of Hardware Resets
0x06  0x010  4              40  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0x07  =====  =               =  ===  == Solid State Device Statistics (rev 1) ==
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value
Pending Defects log (GP Log 0x0c)
No Defects Logged
SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  4            0  Command failed due to ICRC error
0x0002  4            0  R_ERR response for data FIS
0x0003  4            0  R_ERR response for device-to-host data FIS
0x0004  4            0  R_ERR response for host-to-device data FIS
0x0005  4            0  R_ERR response for non-data FIS
0x0006  4            0  R_ERR response for device-to-host non-data FIS
0x0007  4            0  R_ERR response for host-to-device non-data FIS
0x0008  4            0  Device-to-host non-data FIS retries
0x0009  4            1  Transition from drive PhyRdy to drive PhyNRdy
0x000a  4            1  Device-to-host register FISes sent due to a COMRESET
0x000b  4            0  CRC errors within host-to-device FIS
0x000d  4            0  Non-CRC errors within host-to-device FIS
0x000f  4            0  R_ERR response for host-to-device data FIS, CRC
0x0010  4            0  R_ERR response for host-to-device data FIS, non-CRC
0x0012  4            0  R_ERR response for host-to-device non-data FIS, CRC
0x0013  4            0  R_ERR response for host-to-device non-data FIS, non-CRC
			
				Last edited: