SMART error...gone?

tobo

Dabbler
Joined
Feb 6, 2015
Messages
32
I am running TrueNAS Scale 23.10.1.1 with a RAIDz1 setup on 4x 12TB Toshiba MG07ACA12TE, AMD Ryzen 4650P, 64GB ECC, LSI9211 / SAS2008 controller.

A few days ago I got an email notification about I/O errors:

Code:
The number of I/O errors associated with a ZFS device exceeded
acceptable levels. ZFS has marked the device as faulted.

impact: Fault tolerance of the pool may be compromised.
eid: 46
class: statechange
state: FAULTED
host: storage
time: 2024-02-17 02:19:07+0100
vpath: /dev/disk/by-partuuid/ad0e21f3-e97f-410b-aa47-202ab692b9dc
vguid: 0xF881419855BB7E2A
pool: cobalt (0x50681DE5DDA83CA1)


and

Code:
[*]Pool cobalt state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
[*]Disk TOSHIBA_MG07ACA12TE X1P0A0EDF95G is FAULTED


So I ordered a new drive and started the RMA process. In the meantime TrueNAS sucessfully did a resilver. But now looking at the storage it is back from DEGRADED to ONLINE, everything is green.
I have clearly seen SMART errors for read and write, but they are not there anymore. smartctl (output below) tells me that everything is fine. I have double checked the serial of the drive to make sure I am looking at the correct one.

How to proceed from here? Without any errors, I doubt that there will be a successful RMA process. But more important, a drive that has failed once should not be trusted I guess. Am I doing anything wrong looking at the results?

Thank you!

Code:
root@storage[~]# smartctl --health /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

root@storage[~]# smartctl --xall /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba MG07ACA... Enterprise Capacity HDD
Device Model:     TOSHIBA MG07ACA12TE
Serial Number:    X1P0A0EDF95G
LU WWN Device Id: 5 000039 b38db3aca
Firmware Version: 0104
User Capacity:    12,000,138,625,024 bytes [12.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Feb 21 07:38:20 2024 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  120) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (1166) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   050    -    0
  2 Throughput_Performance  P-S---   100   100   050    -    0
  3 Spin_Up_Time            POS--K   100   100   001    -    6957
  4 Start_Stop_Count        -O--CK   100   100   000    -    15
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  7 Seek_Error_Rate         PO-R--   100   100   050    -    0
  8 Seek_Time_Performance   P-S---   100   100   050    -    0
  9 Power_On_Hours          -O--CK   091   091   000    -    3678
 10 Spin_Retry_Count        PO--CK   100   100   030    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    15
 23 Helium_Condition_Lower  PO---K   100   100   075    -    0
 24 Helium_Condition_Upper  PO---K   100   100   075    -    0
191 G-Sense_Error_Rate      -O--CK   100   100   000    -    50
192 Power-Off_Retract_Count -O--CK   100   100   000    -    11
193 Load_Cycle_Count        -O--CK   100   100   000    -    15
194 Temperature_Celsius     -O---K   100   100   000    -    31 (Min/Max 20/56)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O--CK   100   100   000    -    0
198 Offline_Uncorrectable   ----CK   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
220 Disk_Shift              -O----   100   100   000    -    2228224
222 Loaded_Hours            -O--CK   091   091   000    -    3678
223 Load_Retry_Count        -O--CK   100   100   000    -    0
224 Load_Friction           -O---K   100   100   000    -    0
226 Load-in_Time            -OS--K   100   100   000    -    594
240 Head_Flying_Hours       P-----   100   100   001    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O     51  Comprehensive SMART error log
0x03       GPL     R/O      5  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x0c       GPL     R/O    513  Pending Defects log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x24       GPL     R/O  53248  Current Device Internal Status Data log
0x25       GPL     R/O  53248  Saved Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      3678         -
# 2  Short offline       Completed without error       00%      3661         -
# 3  Short offline       Completed without error       00%      3637         -
# 4  Short offline       Completed without error       00%      3614         -
# 5  Short offline       Completed without error       00%      3610         -
# 6  Short offline       Completed without error       00%      3590         -
# 7  Short offline       Completed without error       00%      3566         -
# 8  Short offline       Completed without error       00%      3542         -
# 9  Short offline       Completed without error       00%      3518         -
#10  Short offline       Completed without error       00%      3494         -
#11  Short offline       Completed without error       00%      3470         -
#12  Short offline       Completed without error       00%      3446         -
#13  Short offline       Completed without error       00%      3422         -
#14  Short offline       Completed without error       00%      3398         -
#15  Short offline       Completed without error       00%      3374         -
#16  Short offline       Completed without error       00%      3350         -
#17  Short offline       Completed without error       00%      3326         -
#18  Short offline       Completed without error       00%      3302         -
#19  Short offline       Completed without error       00%      3278         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       1 (0x0001)
Device State:                        Active (0)
Current Temperature:                    31 Celsius
Power Cycle Min/Max Temperature:     21/37 Celsius
Lifetime    Min/Max Temperature:     20/56 Celsius
Specified Max Operating Temperature:    55 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      5/55 Celsius
Min/Max Temperature Limit:           -40/70 Celsius
Temperature History Size (Index):    478 (393)

Index    Estimated Time   Temperature Celsius
 394    2024-02-20 23:41    31  ************
 ...    ..(138 skipped).    ..  ************
  55    2024-02-21 02:00    31  ************
  56    2024-02-21 02:01    32  *************
  57    2024-02-21 02:02    32  *************
  58    2024-02-21 02:03    32  *************
  59    2024-02-21 02:04    33  **************
 ...    ..(  2 skipped).    ..  **************
  62    2024-02-21 02:07    33  **************
  63    2024-02-21 02:08    34  ***************
 ...    ..(  4 skipped).    ..  ***************
  68    2024-02-21 02:13    34  ***************
  69    2024-02-21 02:14    35  ****************
 ...    ..( 17 skipped).    ..  ****************
  87    2024-02-21 02:32    35  ****************
  88    2024-02-21 02:33    36  *****************
 ...    ..( 27 skipped).    ..  *****************
 116    2024-02-21 03:01    36  *****************
 117    2024-02-21 03:02    37  ******************
 ...    ..(  8 skipped).    ..  ******************
 126    2024-02-21 03:11    37  ******************
 127    2024-02-21 03:12    36  *****************
 128    2024-02-21 03:13    36  *****************
 129    2024-02-21 03:14    37  ******************
 130    2024-02-21 03:15    36  *****************
 131    2024-02-21 03:16    36  *****************
 132    2024-02-21 03:17    36  *****************
 133    2024-02-21 03:18    37  ******************
 134    2024-02-21 03:19    36  *****************
 135    2024-02-21 03:20    36  *****************
 136    2024-02-21 03:21    36  *****************
 137    2024-02-21 03:22    37  ******************
 138    2024-02-21 03:23    37  ******************
 139    2024-02-21 03:24    36  *****************
 ...    ..( 16 skipped).    ..  *****************
 156    2024-02-21 03:41    36  *****************
 157    2024-02-21 03:42    37  ******************
 ...    ..( 11 skipped).    ..  ******************
 169    2024-02-21 03:54    37  ******************
 170    2024-02-21 03:55    36  *****************
 171    2024-02-21 03:56    37  ******************
 172    2024-02-21 03:57    37  ******************
 173    2024-02-21 03:58    37  ******************
 174    2024-02-21 03:59    36  *****************
 175    2024-02-21 04:00    37  ******************
 176    2024-02-21 04:01    36  *****************
 ...    ..(  6 skipped).    ..  *****************
 183    2024-02-21 04:08    36  *****************
 184    2024-02-21 04:09    37  ******************
 185    2024-02-21 04:10    36  *****************
 ...    ..(  6 skipped).    ..  *****************
 192    2024-02-21 04:17    36  *****************
 193    2024-02-21 04:18    37  ******************
 ...    ..( 33 skipped).    ..  ******************
 227    2024-02-21 04:52    37  ******************
 228    2024-02-21 04:53    36  *****************
 229    2024-02-21 04:54    37  ******************
 ...    ..( 16 skipped).    ..  ******************
 246    2024-02-21 05:11    37  ******************
 247    2024-02-21 05:12    36  *****************
 248    2024-02-21 05:13    37  ******************
 ...    ..( 11 skipped).    ..  ******************
 260    2024-02-21 05:25    37  ******************
 261    2024-02-21 05:26    36  *****************
 262    2024-02-21 05:27    36  *****************
 263    2024-02-21 05:28    37  ******************
 264    2024-02-21 05:29    36  *****************
 ...    ..(  8 skipped).    ..  *****************
 273    2024-02-21 05:38    36  *****************
 274    2024-02-21 05:39    35  ****************
 ...    ..(  5 skipped).    ..  ****************
 280    2024-02-21 05:45    35  ****************
 281    2024-02-21 05:46    34  ***************
 ...    ..(  4 skipped).    ..  ***************
 286    2024-02-21 05:51    34  ***************
 287    2024-02-21 05:52    33  **************
 ...    ..( 12 skipped).    ..  **************
 300    2024-02-21 06:05    33  **************
 301    2024-02-21 06:06    34  ***************
 302    2024-02-21 06:07    34  ***************
 303    2024-02-21 06:08    34  ***************
 304    2024-02-21 06:09    33  **************
 ...    ..(  7 skipped).    ..  **************
 312    2024-02-21 06:17    33  **************
 313    2024-02-21 06:18    32  *************
 ...    ..( 15 skipped).    ..  *************
 329    2024-02-21 06:34    32  *************
 330    2024-02-21 06:35    31  ************
 ...    ..( 62 skipped).    ..  ************
 393    2024-02-21 07:38    31  ************

SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 3) ==
0x01  0x008  4              15  ---  Lifetime Power-On Resets
0x01  0x010  4            3678  ---  Power-on Hours
0x01  0x018  6     19096327497  ---  Logical Sectors Written
0x01  0x020  6        96173385  ---  Number of Write Commands
0x01  0x028  6    103634734656  ---  Logical Sectors Read
0x01  0x030  6      1117227079  ---  Number of Read Commands
0x01  0x038  6     13240800000  ---  Date and Time TimeStamp
0x02  =====  =               =  ===  == Free-Fall Statistics (rev 1) ==
0x02  0x010  4              50  ---  Overlimit Shock Events
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4              99  ---  Spindle Motor Power-on Hours
0x03  0x010  4              99  ---  Head Flying Hours
0x03  0x018  4              15  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               0  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x03  0x038  4               0  ---  Number of Realloc. Candidate Logical Sectors
0x03  0x040  4              11  ---  Number of High Priority Unload Events
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               2  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              31  ---  Current Temperature
0x05  0x010  1              31  N--  Average Short Term Temperature
0x05  0x018  1              31  N--  Average Long Term Temperature
0x05  0x020  1              56  ---  Highest Temperature
0x05  0x028  1              20  ---  Lowest Temperature
0x05  0x030  1              52  N--  Highest Average Short Term Temperature
0x05  0x038  1              29  N--  Lowest Average Short Term Temperature
0x05  0x040  1              44  N--  Highest Average Long Term Temperature
0x05  0x048  1              31  N--  Lowest Average Long Term Temperature
0x05  0x050  4              82  ---  Time in Over-Temperature
0x05  0x058  1              55  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               5  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4             111  ---  Number of Hardware Resets
0x06  0x010  4              40  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0x07  =====  =               =  ===  == Solid State Device Statistics (rev 1) ==
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c)
No Defects Logged

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  4            0  Command failed due to ICRC error
0x0002  4            0  R_ERR response for data FIS
0x0003  4            0  R_ERR response for device-to-host data FIS
0x0004  4            0  R_ERR response for host-to-device data FIS
0x0005  4            0  R_ERR response for non-data FIS
0x0006  4            0  R_ERR response for device-to-host non-data FIS
0x0007  4            0  R_ERR response for host-to-device non-data FIS
0x0008  4            0  Device-to-host non-data FIS retries
0x0009  4            1  Transition from drive PhyRdy to drive PhyNRdy
0x000a  4            1  Device-to-host register FISes sent due to a COMRESET
0x000b  4            0  CRC errors within host-to-device FIS
0x000d  4            0  Non-CRC errors within host-to-device FIS
0x000f  4            0  R_ERR response for host-to-device data FIS, CRC
0x0010  4            0  R_ERR response for host-to-device data FIS, non-CRC
0x0012  4            0  R_ERR response for host-to-device non-data FIS, CRC
0x0013  4            0  R_ERR response for host-to-device non-data FIS, non-CRC

 
Last edited:

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
Hello,

can you please your detailed hardware information in the post? Especially mainboard, any pcie cards you may have connected drives to, power supply?

And please use CODE instead of ICODE for larger code blocks, this enhances readability.
1708499875961.png


As interpreting smart data is really beyond my knowledge I'm just giving you pointers on what to add to your post, someone else will probably dive into the details.


Total time to complete Offline
data collection: ( 120) seconds.
Offline data collection
Did you run a short or a long test on the drive?
 

tobo

Dabbler
Joined
Feb 6, 2015
Messages
32
Thank you chuck32, I updated the post above.

All the results I see are from short tests. I started a long test manually which will take some more hours to complete.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703

tobo

Dabbler
Joined
Feb 6, 2015
Messages
32
I guess any or all of those could have resulted in ZFS not getting what it asked for at the right time.
Thank your for pointing these out! I will report back once the smart results are in.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I suspect some of the data you originally posted was remembered incorrectly, please humor me and if I clearly have it wrong, I apologize now.

1) You received a ZFS failure message stating the drive was faulty and was placed offline.
2) You viewed the ZFS pool status and saw some read and write errors.
3) You believed what you were reading was physical hard drive errors and started the RMA process.
4) TrueNAS resilvered the drive and the errors went away.
5) You said to yourself, WTF just happened and examined the actual SMART data and saw no errors.
6) We are here now.

First, you had a ZFS failure which does not always mean hard drive failure. All the indications point to the fact that the drive didn't respond and is actually supported by the log. I don't know what PhyNRdy actually means other than Physically Not Ready. It could be from the data just not coming back fast enough. A SMART Extended/Long test will tell you if you have a media issue if it fails.

When the drive resilvered then the ZFS errors go away.

SMART errors typically remain recorded in the SMART data. The "WORST" value would retain the error value at a minimum, but all looks good. There is not Write Error count in SMART. If there was a write issue, it would manifest itself in the Read section because we Erase, Write, Read to write data on a drive.

Another issue I see is you have had 50 shock events. Damn that is not good. Was it used as a bowling ball, just kidding. But if you are bouncing the system around, the drive could momentarily drop (pun not intended) offline and then come back. So it is possible you have caused physical damage. If you need to move the server, power off first. Or just be gentle.

You also exceeded the maximum operating temperature, your warranty is likely void now. The max temp is 55C, you got to 56C.

Should this happen again, run smartctl -l sataphy /dev/sda and examine the data. Right now you have 1 set of errors. If the value increases then it's time to take a harder look at it. If you cannot identify a physical cause then I think you should examine the power connections first.

I recommend that you run a daily SMART short test and a weekly Long/Extended test. Set these up in the TrueNAS GUI. And you should manually kick off the long test for all your drives, make sure there are no other issues lurking.

SCT Error Recovery Control:
Read: Disabled
Write: Disabled
Also I recommend you enable this. A value of 70 is the same as 7 seconds, so the drive will wait 7 seconds before it says it cannot provide the data.

Good Luck and I hope the SMART testing shows no errors. That drive is not old at all.
 

tobo

Dabbler
Joined
Feb 6, 2015
Messages
32
Thank you @joeschmuck for all the insight, appreaciate learning new stuff everytime I ask a question here!

Your assumptions were on spot, there were no media errors after the long smart test. But you made me look for the other things (I can post smart results if useful). I found the PhyNRdy error also on the other drives. I am going to replace that loyal old RAID adapter next, because that is what all of them have in common.

50 shock events

This actually leaves me stumbled. One of the other drives has over 300 events. All the drives were brand new, are properly fixed in a 19" housing in a 19" rack. In the basement, solidly mounted to a wall. I am living in one of the most boring places on this planet when it comes to earthquakes. No heavy machines or trucks on the street...

I am going to replace the RAID controller next, as the drive(s) doesn't seem to be the issue. Will report back with any news.

As always, thanks to everyone who took the time looking into my issue!
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I would wait to see if the problem happens again before buying another LSI card. Speaking of thr LSI card, you do have it flashed to IT Mode, correct? If not then that could be your problem for sure. Do you have the computer on an UPS? If not, you should. A brownout "may" cause this although I would expect the computer to reboot or just crash.

Also consider the power supply. I'm not telling you to replace it but it is a possibility.

As for the shock events, were these brand new drives? Not refurbished or from a questionable 3rd party? Over 300 shock events is a big value. I'm not saying anything is physically wrong with your drives, if they pass a Long test, they should be good from a drive perspective.

I might know why they are registering like that... Your drives were built to be used in a seismograph and it is picking up earth tremors across the world. It's possible? Maybe not but stranger things happen :tongue:

Good luck!
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
Not only does it need to be IT mode, it also needs to be the latest firmware, which quite a few used ones do not have. That causes bugs with the controller, which causes weird issues in Truenas and drives.
 
Top