Register for the iXsystems Community to get an ad-free experience and exclusive discounts in our eBay Store.

pool status unhealthy

pomah

Member
Joined
Jul 30, 2012
Messages
55
Hello

I have the following in my truenas view:
1603531177454.png

When going into "status" for that pool I get this:
1603531223055.png

I see that there is a checksum error, I have done a "resilver" but that did not resolve it...

When looking at the SMART status for each drive I get this:

Ada2:
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 591) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 657) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x50bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 078 064 044 Pre-fail Always - 63718848
3 Spin_Up_Time 0x0003 094 093 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 45
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 091 060 045 Pre-fail Always - 1215557898
9 Power_On_Hours 0x0032 081 081 000 Old_age Always - 17350 (102 75 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 45
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 2
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 076 054 040 Old_age Always - 24 (Min/Max 22/26)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 48
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 95
194 Temperature_Celsius 0x0022 024 046 000 Old_age Always - 24 (0 17 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 2
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 17349 (153 215 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 111135773402
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 63713614017

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 17339 -
# 2 Extended offline Completed without error 00% 17320 -
# 3 Short offline Completed without error 00% 17291 -
# 4 Short offline Completed without error 00% 17267 -
# 5 Short offline Completed without error 00% 17243 -
# 6 Extended offline Completed without error 00% 17221 -
# 7 Extended offline Completed without error 00% 17203 -
# 8 Short offline Completed without error 00% 17171 -
# 9 Short offline Completed without error 00% 17147 -
#10 Short offline Completed without error 00% 17123 -
#11 Short offline Completed without error 00% 17099 -
#12 Short offline Completed without error 00% 17075 -
#13 Short offline Completed without error 00% 17051 -
#14 Extended offline Completed without error 00% 17035 -
#15 Extended offline Completed without error 00% 17020 -
#16 Extended offline Completed without error 00% 16871 -
#17 Extended offline Completed without error 00% 16699 -
#18 Extended offline Completed without error 00% 16532 -
#19 Extended offline Completed without error 00% 16364 -
#20 Extended offline Completed without error 00% 16196 -
#21 Extended offline Completed without error 00% 16030 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Ada1:
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 591) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 648) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x50bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 081 064 044 Pre-fail Always - 112598792
3 Spin_Up_Time 0x0003 096 094 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 51
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 091 060 045 Pre-fail Always - 1345892798
9 Power_On_Hours 0x0032 081 081 000 Old_age Always - 17363 (240 5 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 49
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 076 056 040 Old_age Always - 24 (Min/Max 22/27)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 55
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 277
194 Temperature_Celsius 0x0022 024 044 000 Old_age Always - 24 (0 17 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 17348 (23 132 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 111112718130
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 65385359355

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 17352 -
# 2 Extended offline Completed without error 00% 17333 -
# 3 Short offline Completed without error 00% 17304 -
# 4 Short offline Completed without error 00% 17280 -
# 5 Short offline Completed without error 00% 17256 -
# 6 Short offline Completed without error 00% 17255 -
# 7 Short offline Completed without error 00% 17255 -
# 8 Short offline Completed without error 00% 17232 -
# 9 Extended offline Completed without error 00% 17216 -
#10 Short offline Completed without error 00% 17184 -
#11 Short offline Completed without error 00% 17160 -
#12 Short offline Completed without error 00% 17136 -
#13 Short offline Completed without error 00% 17112 -
#14 Short offline Completed without error 00% 17088 -
#15 Short offline Completed without error 00% 17064 -
#16 Extended offline Completed without error 00% 17048 -
#17 Extended offline Completed without error 00% 16884 -
#18 Extended offline Completed without error 00% 16712 -
#19 Extended offline Completed without error 00% 16545 -
#20 Extended offline Completed without error 00% 16376 -
#21 Extended offline Completed without error 00% 16208 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

So as far as I can see there are no issues with the drives, no point in sending them in for a replacement, but there is an issue with the pool, how can I fix this, or am I missunderstanding something? My knowledge in all of this is low so most likely I am missunderstanding something...
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
9,145
Looks like your drive ada2 has a communication problem (ID 199). Read the hard drive troubleshooting guide in my signature link for more information and troubleshooting tips.
 

pomah

Member
Joined
Jul 30, 2012
Messages
55
Looks like your drive ada2 has a communication problem (ID 199). Read the hard drive troubleshooting guide in my signature link for more information and troubleshooting tips.

Thank, I was about to check the cables as per following your guide, but after an upgrade to the latest 12.0 release version and a following reboot, I have no more issues ... I guess I will be happy for that...

One more question, how come the "Storage/Disks/ S.M.A.R.T. Test Results" information page is empty? I would expect it to have the results from the smart test run before?
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
9,145
One more question, how come the "Storage/Disks/ S.M.A.R.T. Test Results" information page is empty? I would expect it to have the results from the smart test run before?
I can't answer that, I do not use 12.0 but once I do upgrade to it if there is a problem then I'd have to investigate it.

If you feel there is a problem with the software then I would request that you submit a bug report. Also keep in mind that your hardware configuration can also cause problems with the way TrueNAS/FreeNAS operates, not saying that is the issue here but something you should be aware of.

As for the SMART data, you should check it maybe weekly to see if some of those key ID values change (increase), if they do then you probably have a problem.

Good Luck
 

john60

Member
Joined
Nov 22, 2021
Messages
34
step of Hard Driver Troublemshooting Guide says to
2) Type smartctl –t long /dev/ada0 where “ada0” is the drive identifier. Note how long it will take for the test to complete. You may still use your system however it will slow down the testing.
3) Once the period of time has lapsed for the testing, obtain a SMART Status Result and return to the troubleshooting text.

I entered step 2 in the shell tab of the web interface. Are the results of smartctl displayed in the webpage or do I need to do something else to get the results? How accurate if the estimate it gives since 540 is a long time? Any indication in the web interface or otherwise that this test is running?

Also what does off-line mode mean? I ran this command for both disks reporting a checksum error.
what does off-line mode mean.png
 
Last edited:

winnielinnie

Neophyte Sage
Joined
Oct 22, 2019
Messages
691
One more question, how come the "Storage/Disks/ S.M.A.R.T. Test Results" information page is empty? I would expect it to have the results from the smart test run before?
Did you clear your web browser cache and/or try in an Incognito (or "Private") window?
 

Redcoat

Dedicated Sage
Joined
Feb 18, 2014
Messages
2,255
Also what does off-line mode mean? I ran this command for both disks reporting a checksum error.
The information on SMART tests can be found here: https://www.smartmontools.org/browser/trunk/smartmontools/smartctl.8.in

For "offline" in the context of your question it has: "offline - [ATA] runs SMART Immediate Offline Test. This immediately starts the test described above. This command can be given during normal system operation. The effects of this test are visible only in that it updates the SMART Attribute values, and if errors are found they will appear in the SMART error log, visible with the '-l error' option." I've never used the option described in italics (didn't know it existed) and so I have no idea what additional information it might provide. You used what I have always seen as the "standard" command to run the test.
 

Redcoat

Dedicated Sage
Joined
Feb 18, 2014
Messages
2,255
Are the results of smartctl displayed in the webpage or do I need to do something else to get the results? How accurate if the estimate it gives since 540 is a long time? Any indication in the web interface or otherwise that this test is running?
You need to go in and get them via Shell - the information on the Output is given in the guide you were using - immediately above the section where you read the instructions to start the test. I don't know if the time is accurate and , no, there is no indication it's running.
 

pomah

Member
Joined
Jul 30, 2012
Messages
55
This is an old thread, but for me it was solved, the issues with the SMART reports was a bug that was resolved a few updates back as well.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
9,145
The smartctl -l error /dev/da1 reports no error
Out of curiosity, why did you need to use the guide in the first place? Did you receive some sort of error message? I ask because reading the error log alone might not be good enough. If you had an error message then you should also list all the SMART data using "smartctl -a /dev/da1" and read the guide for the parameters to examine. Feel free to post the output of this command if you desire someone else to interpret it.
 
Top