Help troubleshooting a critical error

paulinventome

Explorer
Joined
May 18, 2015
Messages
62
Hi,

So I have seen a couple of these:

CRITICAL​

Device: /dev/ada7, 135 Currently unreadable (pending) sectors.​


It is referring to a 3x4TB SSD based ZRaid. I've seen it on two drives so far. All drives are connected to the motherboard SATA. I have lots of other drives in the system, using a combination of PCI SATA card, OWC NVmE PCI switch - all of these seem fine. It's just this 3xSSD RAID.

How can I go about troubleshooting this? I've checked the connections physically and they seem fine. Are there any other places within TrueNAS where I can see more of a log as to what causes this.

These errors happened over night, so some kind of housekeeping. I'm not aware of any actual issues reading and writing to the shared drive.

Bit stuck where to start on this one....

Kindest
Paul
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
You could take a look at my Hard Drive Troubleshooting Guide (link below) to start.

Pending sector errors my themselves are not a big issue but if you are not performing routine SMART Long tests, you could be in potential trouble. I highly recommend that you post in code brackets the output of smartctl -a /dev/ada7 so we can see all the drive data to diagnose the problem further and tell you if you have a real problem of not.

-Joe
 

paulinventome

Explorer
Joined
May 18, 2015
Messages
62
You could take a look at my Hard Drive Troubleshooting Guide (link below) to start.

Pending sector errors my themselves are not a big issue but if you are not performing routine SMART Long tests, you could be in potential trouble. I highly recommend that you post in code brackets the output of smartctl -a /dev/ada7 so we can see all the drive data to diagnose the problem further and tell you if you have a real problem of not.

-Joe
Thank you Joe.

The output is shown below. I ran Long Smart Tests on all 3 of the SSDs and they all passed.

Is there anything in this lot I can do something about?

Kindest
Paul

Code:
=== START OF INFORMATION SECTION ===
Device Model:     CT4000MX500SSD1
Serial Number:    2211E61C08D1
LU WWN Device Id: 5 00a075 1e61c08d1
Firmware Version: M3CR044
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Oct 13 17:14:57 2022 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  30) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x0031) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       2450
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       64
171 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
173 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       4
174 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       1
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   000   000   000    Pre-fail  Always       -       197
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   065   049   000    Old_age   Always       -       35 (Min/Max 24/51)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Unknown_SSD_Attribute   0x0030   100   100   001    Old_age   Offline      -       0
206 Unknown_SSD_Attribute   0x000e   100   100   000    Old_age   Always       -       0
210 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
246 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       10013873328
247 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       80650125
248 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       93326706

SMART Error Log Version: 1
Warning: ATA error count 0 inconsistent with error log pointer 3

ATA Error Count: 0
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error -2 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was in an unknownstate.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 ec 00 00 00 00 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 00 00      00:00:00.000  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00      00:00:00.000  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00      00:00:00.000  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00      00:00:00.000  IDENTIFY DEVICE
  c8 00 00 00 00 00 00 00      00:00:00.000  READ DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      2420         -
# 2  Short offline       Completed without error       00%      2419         -
# 3  Extended offline    Completed without error       00%      1869         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
Last edited by a moderator:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Are you sure this is the same drive that had the error? I do not see any pending sector errors, also note there is not ID 135 on this drive. I don't think this is the same drive.

Look at the SMART data for each drive (look at the spinning rust drives first as SSD's generally do not have "pending sector errors") you have and one of them should have an ID 135 and a RAW VALUE = 3 (or possibly higher by now)

Also, please use code tags for when posting data from a command. I fixed yours, look much easier to read now.
 

paulinventome

Explorer
Joined
May 18, 2015
Messages
62
Are you sure this is the same drive that had the error? I do not see any pending sector errors, also note there is not ID 135 on this drive. I don't think this is the same drive.

Look at the SMART data for each drive (look at the spinning rust drives first as SSD's generally do not have "pending sector errors") you have and one of them should have an ID 135 and a RAW VALUE = 3 (or possibly higher by now)

Also, please use code tags for when posting data from a command. I fixed yours, look much easier to read now.
Seems to be correct. The two alerts are for /ada5 and /ada7 which are both 4TB SSDs

The output for ada5 is below

Both the errors happened on the 9th Oct. Now in the output I assume I need to match 135 and 52 being some kind of ID? Does that match with LU WWN Device Id? I can go through all the drives and match the ID once I know what I am looking for.

The alerts are:

CRITICAL​

Device: /dev/ada7, 135 Currently unreadable (pending) sectors.​

2022-10-09 00:18:50 (Europe/London)

notifications_active

CRITICAL​

Device: /dev/ada5, 52 Currently unreadable (pending) sectors.​

2022-10-09 00:18:51 (Europe/London)

and this is /dev/ada5 below

=== START OF INFORMATION SECTION === Device Model: CT4000MX500SSD1 Serial Number: 2211E61C08D1 LU WWN Device Id: 5 00a075 1e61c08d1 Firmware Version: M3CR044 User Capacity: 4,000,787,030,016 bytes [4.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: Solid State Device Form Factor: 2.5 inches TRIM Command: Available Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-3 T13/2161-D revision 5 SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Fri Oct 14 09:10:16 2022 BST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 30) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x0031) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0 5 Reallocated_Sector_Ct 0x0032 100 100 010 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 2466 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 64 171 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 173 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 4 174 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 1 180 Unused_Rsvd_Blk_Cnt_Tot 0x0033 000 000 000 Pre-fail Always - 197 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 000 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 065 049 000 Old_age Always - 35 (Min/Max 24/51) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0 202 Unknown_SSD_Attribute 0x0030 100 100 001 Old_age Offline - 0 206 Unknown_SSD_Attribute 0x000e 100 100 000 Old_age Always - 0 210 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 246 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 10013873328 247 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 80650125 248 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 93361314 SMART Error Log Version: 1 Warning: ATA error count 0 inconsistent with error log pointer 3 ATA Error Count: 0 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error -2 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours) When the command that caused the error occurred, the device was in an unknownstate. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 00 ec 00 00 00 00 00 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ec 00 00 00 00 00 00 00 00:00:00.000 IDENTIFY DEVICE ec 00 00 00 00 00 00 00 00:00:00.000 IDENTIFY DEVICE ec 00 00 00 00 00 00 00 00:00:00.000 IDENTIFY DEVICE ec 00 00 00 00 00 00 00 00:00:00.000 IDENTIFY DEVICE c8 00 00 00 00 00 00 00 00:00:00.000 READ DMA SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 2420 - # 2 Short offline Completed without error 00% 2419 - # 3 Extended offline Completed without error 00% 1869 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.

Kindest
Paul
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Yes, the number 135 matches the ID number, also is should state Pending Sector in the title. This other drive appears fine to me as well.

How about posting the output of zpool status and the output of glabel status and maybe we can see what is going on. It's just very off that TtrueNAS would report a drive error and provide you the wrong drive ident. And ID 52? I've never seen that one before, sounds way off.
 

paulinventome

Explorer
Joined
May 18, 2015
Messages
62
Yes, the number 135 matches the ID number, also is should state Pending Sector in the title. This other drive appears fine to me as well.

How about posting the output of zpool status and the output of glabel status and maybe we can see what is going on. It's just very off that TtrueNAS would report a drive error and provide you the wrong drive ident. And ID 52? I've never seen that one before, sounds way off.
So I am struggling to match an ID on the output, where on earth is this ID on the report? I thought it was the first integer on LU WWN Device Id but it cannot be because I started going through all of them and I get the same number there? Could you help put me out of my misery by pointing out where I should be looking on the report?

The Zpool status is below:

4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. root@ome[~]# zpool status pool: Blaze state: ONLINE config: NAME STATE READ WRITE CKSUM Blaze ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 gptid/16b2dd52-3e40-11ed-a4fe-a0369f2e7348 ONLINE 0 0 0 gptid/16bf0a20-3e40-11ed-a4fe-a0369f2e7348 ONLINE 0 0 0 gptid/16b179e3-3e40-11ed-a4fe-a0369f2e7348 ONLINE 0 0 0 gptid/16af889d-3e40-11ed-a4fe-a0369f2e7348 ONLINE 0 0 0 errors: No known data errors pool: Core state: ONLINE scan: scrub repaired 0B in 01:00:52 with 0 errors on Sun Oct 9 01:00:52 2022 config: NAME STATE READ WRITE CKSUM Core ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 gptid/1ce75e92-0ff6-11ed-99d0-a0369f2e7348 ONLINE 0 0 0 gptid/1cd5fbf6-0ff6-11ed-99d0-a0369f2e7348 ONLINE 0 0 0 gptid/1d1f4f7b-0ff6-11ed-99d0-a0369f2e7348 ONLINE 0 0 0 errors: No known data errors pool: Vault state: ONLINE scan: scrub repaired 0B in 10:05:25 with 0 errors on Sun Sep 18 10:05:34 2022 config: NAME STATE READ WRITE CKSUM Vault ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 gptid/8a4c6ac7-f87d-11ec-824d-a0369f2e7348 ONLINE 0 0 0 gptid/8a32e3b2-f87d-11ec-824d-a0369f2e7348 ONLINE 0 0 0 gptid/8a46d78e-f87d-11ec-824d-a0369f2e7348 ONLINE 0 0 0 gptid/8a367e2e-f87d-11ec-824d-a0369f2e7348 ONLINE 0 0 0 gptid/8a2cbc01-f87d-11ec-824d-a0369f2e7348 ONLINE 0 0 0 gptid/8a52dcec-f87d-11ec-824d-a0369f2e7348 ONLINE 0 0 0 gptid/8a517739-f87d-11ec-824d-a0369f2e7348 ONLINE 0 0 0 gptid/8a341c9b-f87d-11ec-824d-a0369f2e7348 ONLINE 0 0 0 gptid/8a47ee8d-f87d-11ec-824d-a0369f2e7348 ONLINE 0 0 0 errors: No known data errors pool: boot-pool state: ONLINE scan: scrub repaired 0B in 00:00:04 with 0 errors on Sun Oct 16 03:45:04 2022 config:

and the glabel (?) status is

errors: No known data errors root@ome[~]# glabel status Name Status Components gptid/16b2dd52-3e40-11ed-a4fe-a0369f2e7348 N/A nvd0p2 gptid/16bf0a20-3e40-11ed-a4fe-a0369f2e7348 N/A nvd1p2 gptid/16b179e3-3e40-11ed-a4fe-a0369f2e7348 N/A nvd2p2 gptid/16af889d-3e40-11ed-a4fe-a0369f2e7348 N/A nvd3p2 gptid/1c5e440c-e8b0-11ec-948b-a0369f2e7348 N/A nvd4p1 gptid/8a4c6ac7-f87d-11ec-824d-a0369f2e7348 N/A ada0p2 gptid/8a32e3b2-f87d-11ec-824d-a0369f2e7348 N/A ada1p2 gptid/8a46d78e-f87d-11ec-824d-a0369f2e7348 N/A ada2p2 gptid/8a367e2e-f87d-11ec-824d-a0369f2e7348 N/A ada3p2 gptid/1ce75e92-0ff6-11ed-99d0-a0369f2e7348 N/A ada4p2 gptid/1cd5fbf6-0ff6-11ed-99d0-a0369f2e7348 N/A ada5p2 gptid/8a2cbc01-f87d-11ec-824d-a0369f2e7348 N/A ada6p2 gptid/1d1f4f7b-0ff6-11ed-99d0-a0369f2e7348 N/A ada7p2

Kindest
Paul
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Paul,
This is my mistake for confusing you, my apologies.

I wanted to see the other data you posted to ensure you didn't have other problems. I researched your Crucial drives and could not locate the meaning of the Unknown_Attribute fields. Everything does look good right now.

Typically I would expect ID's 196, 197, and 198 to change to indicate problems with the blocks of data, as well at ID 5 and 180 to denote a hard failure and thus mapping the block as unusable. So far you have zero "0" values in the RAW field. I suspect that ID 202 is the Wear Level because the Threshold Value is "1", the current value is "100". It will count down from 100 to 1 over time as erase cycles are performed on the cells.

The last thing you could do is look at the output of one of these drives using smartctl -x /dev/ada7 and this may produce more information. While typically the ID section is identical between the -a report and the -x report, maybe Crucial is doing something different. Take a look, if you desire to you can post that output here.

Otherwise I do not see anything wrong.

What version of TrueNAS are you running? I know it's Core, I just can't tell the version number. And you might consider submitting a bug report and post the messages and the output of your smartctl data. Maybe iXsystems can answer the issue.

Best of luck
 
Top