Intermittent SMART PHY/CRC Errors

NASbox

Guru
Joined
May 8, 2012
Messages
650
I have a CRON job running on my TrueNAS that watches a few of the key SMART
parameters on my boot drives. The count on each of the following parameters:

168|SATA_Phy_Error_Count
218|CRC_Error_Count

incremented by 1 on each of July 13, 14, 15, 19 and 21.

The counts are not super high:
168|SATA_Phy_Error_Count|32
218|CRC_Error_Count|32

but I'm pretty sure some sort of preemptive maintenance is in order.

My boot pool is a mirror of two budget 120GB SSDs running off of
SATA ports on the Motherboard. I have the system database on the
boot pool since I want the system to be functional without the data
pool if I want to troubleshoot the system with the data pool drives
removed.

The drive showing the errors is a KINGSTON Model# SA400S37120G
(Smart Info at the end of this post.)

The other drive is older and is an HP S700 120GB SSD that seems to be
fine.

IIUC this could be a drive problem, a cable probem, a (Motherboard) SATA
Port problem or a powersupply problem.

My question is how to troubleshoot given the intermittent nature of the
problem. Any suggestions would be much appreciated.


DMESG entries pertaining to the fault.
Code:
(ada3:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 d0 20 c3 7c 40 04 00 00 00 00 00
(ada3:ahcich5:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada3:ahcich5:0:0:0): Retrying command, 3 more tries remain

(ada3:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 30 08 cb b3 40 04 00 00 00 00 00
(ada3:ahcich5:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada3:ahcich5:0:0:0): Retrying command, 3 more tries remain

(ada3:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 20 9b ae 40 04 00 00 00 00 00
(ada3:ahcich5:0:0:0): CAM status: Uncorrectable parity/CRC error

SMART Output for drive:




smartctl -x /dev/ada3 smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p14 amd64] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Phison Driven SSDs Device Model: KINGSTON SA400S37120G Serial Number: REDACTED LU WWN Device Id: 5 0026b7 782ea1dc1 Firmware Version: S3500102 User Capacity: 120,034,123,776 bytes [120 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device TRIM Command: Available Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-3 T13/2161-D revision 4 SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Fri Jul 22 03:23:00 2022 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled AAM feature is: Unavailable APM feature is: Disabled Rd look-ahead is: Enabled Write cache is: Enabled DSN feature is: Unavailable ATA Security is: Disabled, frozen [SEC2] Wt Cache Reorder: Unavailable === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x02) Offline data collection activity was completed without error. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 120) seconds. Offline data collection capabilities: (0x11) SMART execute Offline immediate. No Auto Offline data collection support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. No Selective Self-test supported. SMART capabilities: (0x0002) Does not save SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 10) minutes. SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate -O--CK 100 100 000 - 100 9 Power_On_Hours -O--CK 100 100 000 - 21839 12 Power_Cycle_Count -O--CK 100 100 000 - 31 148 Unknown_Attribute ------ 100 100 000 - 0 149 Unknown_Attribute ------ 100 100 000 - 0 167 Write_Protect_Mode ------ 100 100 000 - 0 168 SATA_Phy_Error_Count -O--C- 100 100 000 - 33 169 Bad_Block_Rate ------ 100 100 000 - 0 170 Bad_Blk_Ct_Erl/Lat ------ 100 100 010 - 0/0 172 Erase_Fail_Count -O--CK 100 100 000 - 0 173 MaxAvgErase_Ct ------ 100 100 000 - 0 181 Program_Fail_Count -O--CK 100 100 000 - 0 182 Erase_Fail_Count ------ 100 100 000 - 0 187 Reported_Uncorrect -O--CK 100 100 000 - 0 192 Unsafe_Shutdown_Count -O--C- 100 100 000 - 19 194 Temperature_Celsius -O---K 044 062 000 - 44 (Min/Max 31/62) 196 Reallocated_Event_Count -O--CK 100 100 000 - 0 199 SATA_CRC_Error_Count -O--CK 100 100 000 - 0 218 CRC_Error_Count -O--CK 100 100 000 - 33 231 SSD_Life_Left ------ 090 090 000 - 90 233 Flash_Writes_GiB -O--CK 100 100 000 - 8865 241 Lifetime_Writes_GiB -O--CK 100 100 000 - 12051 242 Lifetime_Reads_GiB -O--CK 100 100 000 - 2641 244 Average_Erase_Count ------ 100 100 000 - 202 245 Max_Erase_Count ------ 100 100 000 - 222 246 Total_Erase_Count ------ 100 100 000 - 40787 ||||||_ K auto-keep |||||__ C event count ||||___ R error rate |||____ S speed/performance ||_____ O updated online |______ P prefailure warning General Purpose Log Directory Version 1 SMART Log Directory Version 1 [multi-sector log support] Address Access R/W Size Description 0x00 GPL,SL R/O 1 Log Directory 0x01 SL R/O 1 Summary SMART error log 0x02 SL R/O 1 Comprehensive SMART error log 0x03 GPL R/O 1 Ext. Comprehensive SMART error log 0x04 GPL,SL R/O 8 Device Statistics log 0x06 SL R/O 1 SMART self-test log 0x07 GPL R/O 1 Extended self-test log 0x10 GPL R/O 1 NCQ Command Error log 0x11 GPL R/O 1 SATA Phy Event Counters log 0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log 0x80-0x9f GPL,SL R/W 16 Host vendor specific log 0xde GPL VS 8 Device vendor specific log SMART Extended Comprehensive Error Log Version: 1 (1 sectors) Device Error Count: 33 (device log contains only the most recent 4 errors) CR = Command Register FEATR = Features Register COUNT = Count (was: Sector Count) Register LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8 LH = LBA High (was: Cylinder High) Register ] LBA LM = LBA Mid (was: Cylinder Low) Register ] Register LL = LBA Low (was: Sector Number) Register ] DV = Device (was: Device/Head) Register DC = Device Control Register ER = Error register ST = Status register Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 33 [0] log entry is empty Error 32 [3] log entry is empty Error 31 [2] occurred at disk power-on lifetime: 0 hours (0 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 04 -- 51 00 00 00 00 00 00 00 00 40 00 Error: ABRT Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- b0 00 d1 01 01 00 00 4f 00 c2 01 40 08 00:00:00.000 SMART READ ATTRIBUTE THRESHOLDS [OBS-4] 2f 00 00 01 01 00 00 00 00 00 03 40 08 00:00:00.000 READ LOG EXT 2f 00 00 01 01 00 00 00 00 00 00 40 08 00:00:00.000 READ LOG EXT b0 00 d5 01 01 00 00 4f 00 c2 00 40 08 00:00:00.000 SMART READ LOG b0 00 da 00 00 00 00 4f 00 c2 00 40 08 00:00:00.000 SMART RETURN STATUS Error 30 [1] occurred at disk power-on lifetime: 0 hours (0 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 04 -- 51 00 00 00 00 00 00 00 00 40 00 Error: ABRT Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- b0 00 d1 01 01 00 00 4f 00 c2 01 40 08 00:00:00.000 SMART READ ATTRIBUTE THRESHOLDS [OBS-4] 2f 00 00 01 01 00 00 00 00 00 03 40 08 00:00:00.000 READ LOG EXT 2f 00 00 01 01 00 00 00 00 00 00 40 08 00:00:00.000 READ LOG EXT b0 00 d5 01 01 00 00 4f 00 c2 00 40 08 00:00:00.000 SMART READ LOG b0 00 da 00 00 00 00 4f 00 c2 00 40 08 00:00:00.000 SMART RETURN STATUS SMART Extended Self-test Log Version: 1 (1 sectors) Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 21598 - # 2 Extended offline Completed without error 00% 21591 - # 3 Extended offline Completed without error 00% 20707 - # 4 Extended offline Completed without error 00% 18218 - # 5 Extended offline Completed without error 00% 13958 - # 6 Extended offline Completed without error 00% 7400 - # 7 Extended offline Completed without error 00% 6975 - # 8 Extended offline Completed without error 00% 1348 - # 9 Extended offline Completed without error 00% 0 - #10 Short offline Completed without error 00% 0 - Selective Self-tests/Logging not supported SCT Commands not supported Device Statistics (GP Log 0x04) Page Offset Size Value Flags Description 0x01 ===== = = === == General Statistics (rev 1) == 0x01 0x008 4 31 --- Lifetime Power-On Resets 0x01 0x010 4 21839 --- Power-on Hours 0x01 0x018 6 3799441290 --- Logical Sectors Written 0x01 0x020 6 9010952 --- Number of Write Commands 0x01 0x028 6 1245637882 --- Logical Sectors Read 0x01 0x030 6 1333687 --- Number of Read Commands 0x07 ===== = = === == Solid State Device Statistics (rev 1) == 0x07 0x008 1 22 --- Percentage Used Endurance Indicator |||_ C monitored condition met ||__ D supports DSN |___ N normalized value Pending Defects log (GP Log 0x0c) not supported SATA Phy Event Counters (GP Log 0x11) ID Size Value Description 0x0001 4 7 Command failed due to ICRC error 0x0002 4 7 R_ERR response for data FIS 0x0005 4 0 R_ERR response for non-data FIS 0x000a 4 18 Device-to-host register FISes sent due to a COMRESET
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
The count on each of the following parameters:

168|SATA_Phy_Error_Count
218|CRC_Error_Count

incremented by 1 on each of July 13, 14, 15, 19 and 21.
Are those linked to either the dates of SMART tests or scrubs?

The drive showing the errors is a KINGSTON Model# SA400S37120G
(Smart Info at the end of this post.)

The other drive is older and is an HP S700 120GB SSD that seems to be
fine.

IIUC this could be a drive problem, a cable probem, a (Motherboard) SATA
Port problem or a powersupply problem.

My question is how to troubleshoot given the intermittent nature of the
problem. Any suggestions would be much appreciated.
You also didn't mention the 100 read errors reported by SMART... those are from the drive itself, so indicate some level of failure unrelated to cabling.

The CRC errors can be the controller on the drive, the cabling or the SATA controller, so as you say, hard to narrow down unless something obvious like a loose connection or burning smell from the controller chip.

I would generally treat the drive as untrustworthy and consider living with a single boot device (keeping config backups just in case).
 

NASbox

Guru
Joined
May 8, 2012
Messages
650
Are those linked to either the dates of SMART tests or scrubs?
I don't think so... no way of finding out. I don't run regular schedule smart scans, so it's not likely a smart test. As for scrubs, I know the system does one every few days.... not sure what the default config is set to, but from another report I'm pretty sure the last two issues were not during a scrub. The one on the 21 wasn't for sure.

The report comes from a CRON job that I run daily that does a smrtctl -a, and compares a bunch of results with the ones from the previous day, and if they don't match, it spits out a report showing the old/new value. I wrote the script to alert me to just this type of situation. I am not getting any alerts from TrueNAS -- just the report I produce.
You also didn't mention the 100 read errors reported by SMART... those are from the drive itself, so indicate some level of failure unrelated to cabling.
Sorry what 100 read errors???? What am I missing. Are you confusing "Raw_Read_Error_Rate" with read errors?
The CRC errors can be the controller on the drive, the cabling or the SATA controller, so as you say, hard to narrow down unless something obvious like a loose connection or burning smell from the controller chip.

I would generally treat the drive as untrustworthy and consider living with a single boot device (keeping config backups just in case).
I hadn't though of the controller chip on the drive. I'll keep an eye on it for now, and an eye out for a sale on a replacement drive. SSDs are pretty cheap... about what a good USB drive used to cost back in day. When I get a moment I will likely open the box an pull all the cables an reset them just in case the contacts have oxidized.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703

NASbox

Guru
Joined
May 8, 2012
Messages
650
OK, so it's not a count of read errors... but it's also not OK...


That should be 120 (not 100) until something is wrong.
Thanks for the reply.... Great idea, wrong data sheet.... Different drives have slightly different interpretations.

I didn't know Kingston published this info. AFAIK Western Digital Doesn't, so I didn't even think to look. I did some additional searching which lead me to a Smartmon Tools page:

https://www.smartmontools.org/ticket/801

which lead me to the correct datasheet.

https://media.kingston.com/support/downloads/MKP_521_Phison_SMART_attribute.pdf

Here are the descriptions for the drive in question:

001 Read Error Rate
Counts the number of uncorrectable errors that accumulate when controller
reads data from Flash and ECC events occur.

168 SATA PHY Error Count
Counts the number of SATA PHY errors. This value includes all PHY error
counts, ex data FIS CRC , code errors, disparity errors, command FIS crc.
Value clears upon system power-down.

218 CRC Error Count
Counts the number of CRC error (read/write data FIS CRC error).

I'm not sure what to think about Read Error Rate - IIUC as the drive wears out, there will be errors, and the drive "handles" them. Since the drive has 90% life left, I would think that there would have been a few errors -- but I may well be wrong, and would welcome someone correcting me if I am.

Other than reset or change the cables, swap the drive, is there any meaningful troubleshooting to be done?
 
Top