pool status unhealthy

pomah

Explorer
Joined
Jul 30, 2012
Messages
55
Hello

I have the following in my truenas view:
1603531177454.png

When going into "status" for that pool I get this:
1603531223055.png

I see that there is a checksum error, I have done a "resilver" but that did not resolve it...

When looking at the SMART status for each drive I get this:

Ada2:
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 591) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 657) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x50bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 078 064 044 Pre-fail Always - 63718848
3 Spin_Up_Time 0x0003 094 093 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 45
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 091 060 045 Pre-fail Always - 1215557898
9 Power_On_Hours 0x0032 081 081 000 Old_age Always - 17350 (102 75 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 45
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 2
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 076 054 040 Old_age Always - 24 (Min/Max 22/26)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 48
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 95
194 Temperature_Celsius 0x0022 024 046 000 Old_age Always - 24 (0 17 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 2
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 17349 (153 215 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 111135773402
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 63713614017

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 17339 -
# 2 Extended offline Completed without error 00% 17320 -
# 3 Short offline Completed without error 00% 17291 -
# 4 Short offline Completed without error 00% 17267 -
# 5 Short offline Completed without error 00% 17243 -
# 6 Extended offline Completed without error 00% 17221 -
# 7 Extended offline Completed without error 00% 17203 -
# 8 Short offline Completed without error 00% 17171 -
# 9 Short offline Completed without error 00% 17147 -
#10 Short offline Completed without error 00% 17123 -
#11 Short offline Completed without error 00% 17099 -
#12 Short offline Completed without error 00% 17075 -
#13 Short offline Completed without error 00% 17051 -
#14 Extended offline Completed without error 00% 17035 -
#15 Extended offline Completed without error 00% 17020 -
#16 Extended offline Completed without error 00% 16871 -
#17 Extended offline Completed without error 00% 16699 -
#18 Extended offline Completed without error 00% 16532 -
#19 Extended offline Completed without error 00% 16364 -
#20 Extended offline Completed without error 00% 16196 -
#21 Extended offline Completed without error 00% 16030 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Ada1:
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 591) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 648) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x50bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 081 064 044 Pre-fail Always - 112598792
3 Spin_Up_Time 0x0003 096 094 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 51
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 091 060 045 Pre-fail Always - 1345892798
9 Power_On_Hours 0x0032 081 081 000 Old_age Always - 17363 (240 5 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 49
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 076 056 040 Old_age Always - 24 (Min/Max 22/27)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 55
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 277
194 Temperature_Celsius 0x0022 024 044 000 Old_age Always - 24 (0 17 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 17348 (23 132 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 111112718130
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 65385359355

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 17352 -
# 2 Extended offline Completed without error 00% 17333 -
# 3 Short offline Completed without error 00% 17304 -
# 4 Short offline Completed without error 00% 17280 -
# 5 Short offline Completed without error 00% 17256 -
# 6 Short offline Completed without error 00% 17255 -
# 7 Short offline Completed without error 00% 17255 -
# 8 Short offline Completed without error 00% 17232 -
# 9 Extended offline Completed without error 00% 17216 -
#10 Short offline Completed without error 00% 17184 -
#11 Short offline Completed without error 00% 17160 -
#12 Short offline Completed without error 00% 17136 -
#13 Short offline Completed without error 00% 17112 -
#14 Short offline Completed without error 00% 17088 -
#15 Short offline Completed without error 00% 17064 -
#16 Extended offline Completed without error 00% 17048 -
#17 Extended offline Completed without error 00% 16884 -
#18 Extended offline Completed without error 00% 16712 -
#19 Extended offline Completed without error 00% 16545 -
#20 Extended offline Completed without error 00% 16376 -
#21 Extended offline Completed without error 00% 16208 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

So as far as I can see there are no issues with the drives, no point in sending them in for a replacement, but there is an issue with the pool, how can I fix this, or am I missunderstanding something? My knowledge in all of this is low so most likely I am missunderstanding something...
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,972
Looks like your drive ada2 has a communication problem (ID 199). Read the hard drive troubleshooting guide in my signature link for more information and troubleshooting tips.
 

pomah

Explorer
Joined
Jul 30, 2012
Messages
55
Looks like your drive ada2 has a communication problem (ID 199). Read the hard drive troubleshooting guide in my signature link for more information and troubleshooting tips.

Thank, I was about to check the cables as per following your guide, but after an upgrade to the latest 12.0 release version and a following reboot, I have no more issues ... I guess I will be happy for that...

One more question, how come the "Storage/Disks/ S.M.A.R.T. Test Results" information page is empty? I would expect it to have the results from the smart test run before?
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,972
One more question, how come the "Storage/Disks/ S.M.A.R.T. Test Results" information page is empty? I would expect it to have the results from the smart test run before?
I can't answer that, I do not use 12.0 but once I do upgrade to it if there is a problem then I'd have to investigate it.

If you feel there is a problem with the software then I would request that you submit a bug report. Also keep in mind that your hardware configuration can also cause problems with the way TrueNAS/FreeNAS operates, not saying that is the issue here but something you should be aware of.

As for the SMART data, you should check it maybe weekly to see if some of those key ID values change (increase), if they do then you probably have a problem.

Good Luck
 

john60

Explorer
Joined
Nov 22, 2021
Messages
85
step of Hard Driver Troublemshooting Guide says to
2) Type smartctl –t long /dev/ada0 where “ada0” is the drive identifier. Note how long it will take for the test to complete. You may still use your system however it will slow down the testing.
3) Once the period of time has lapsed for the testing, obtain a SMART Status Result and return to the troubleshooting text.

I entered step 2 in the shell tab of the web interface. Are the results of smartctl displayed in the webpage or do I need to do something else to get the results? How accurate if the estimate it gives since 540 is a long time? Any indication in the web interface or otherwise that this test is running?

Also what does off-line mode mean? I ran this command for both disks reporting a checksum error.
what does off-line mode mean.png
 
Last edited:
Joined
Oct 22, 2019
Messages
3,589
One more question, how come the "Storage/Disks/ S.M.A.R.T. Test Results" information page is empty? I would expect it to have the results from the smart test run before?
Did you clear your web browser cache and/or try in an Incognito (or "Private") window?
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,924
Also what does off-line mode mean? I ran this command for both disks reporting a checksum error.
The information on SMART tests can be found here: https://www.smartmontools.org/browser/trunk/smartmontools/smartctl.8.in

For "offline" in the context of your question it has: "offline - [ATA] runs SMART Immediate Offline Test. This immediately starts the test described above. This command can be given during normal system operation. The effects of this test are visible only in that it updates the SMART Attribute values, and if errors are found they will appear in the SMART error log, visible with the '-l error' option." I've never used the option described in italics (didn't know it existed) and so I have no idea what additional information it might provide. You used what I have always seen as the "standard" command to run the test.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,924
Are the results of smartctl displayed in the webpage or do I need to do something else to get the results? How accurate if the estimate it gives since 540 is a long time? Any indication in the web interface or otherwise that this test is running?
You need to go in and get them via Shell - the information on the Output is given in the guide you were using - immediately above the section where you read the instructions to start the test. I don't know if the time is accurate and , no, there is no indication it's running.
 

pomah

Explorer
Joined
Jul 30, 2012
Messages
55
This is an old thread, but for me it was solved, the issues with the SMART reports was a bug that was resolved a few updates back as well.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,972
The smartctl -l error /dev/da1 reports no error
Out of curiosity, why did you need to use the guide in the first place? Did you receive some sort of error message? I ask because reading the error log alone might not be good enough. If you had an error message then you should also list all the SMART data using "smartctl -a /dev/da1" and read the guide for the parameters to examine. Feel free to post the output of this command if you desire someone else to interpret it.
 

nikinp

Contributor
Joined
Sep 7, 2014
Messages
116
Dear Truenasers,
Thanks for being such a supportive community.
I am quite inexperienced but have been using freenas for 10 years+.
I had an error message relating to a drive which had been "removed from the pool". I subsequently shut down and re-started. It now shows as online and an extended Smart test shows:

Remaining:​

0

Lifetime:​

15124

Error:​

N/A

However, in pools, the pool is still showing as "Online (unhealthy)" with a red cross.
Should I be concerned?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
Look at the status (either from the cog menu at the right of the pool or at the shell with zpool status -v).

The reboot will have cleared any errors on the pool from before, so running a scrub should prove it's fine or bring the error back.

If you're only looking at the Error state output from the SMART test, you may be missing issues with the disk...

Make sure you look at the output from smartctl -A /dev/daX (change to match your drive name)

If you have read or write errors shown there, it's an indication the drive is bad.
 

nikinp

Contributor
Joined
Sep 7, 2014
Messages
116
output from zpool status - v [NXXX_ZFS is shown as unhealthy}

root@truenas[~]# zpool status -v
pool: NXXX_ZFS
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: resilvered 56.1M in 00:00:20 with 0 errors on Tue Feb 1 15:12:35 2022
config:

NAME STATE READ WRITE CKSUM
NXXX_ZFS ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/765186a4-9084-11e4-a17a-9cb654092d28 ONLINE 0 0 0
gptid/78ac0533-9041-11ea-8ac1-9cb654092d28 ONLINE 0 0 2

errors: No known data errors

pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:06 with 0 errors on Tue Feb 1 03:45:06 2022
config:

NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
ada0p2 ONLINE 0 0 0

errors: No known data errors


output from smartctl -A /dev/ada1
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p11 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 194 177 021 Pre-fail Always - 5258
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 23
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 080 080 000 Old_age Always - 15136
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 23
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 9
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 179
194 Temperature_Celsius 0x0022 132 109 000 Old_age Always - 18
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
gptid/78ac0533-9041-11ea-8ac1-9cb654092d28 ONLINE 0 0 2
The 2 at the end there indicates checksum errors.

That usually points at cabling or in the worst case at the controller.

Check the SMART testing results from that disk to make sure, but check your cabling too.

You could use glabel status | grep gptid/78ac0533-9041-11ea-8ac1-9cb654092d28 to find which disk that is.
 

nikinp

Contributor
Joined
Sep 7, 2014
Messages
116
If it is cabling, and rectify. Should the 2 disappear on re-running and should that also update the health status in the GUI.
RE SMART testing. I pasted the results of the SMART extended testing in my first message. That output came from Storage/Disks/selected disk/SMART results. Is there something more you were expecting to see?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
If it is cabling, and rectify. Should the 2 disappear on re-running and should that also update the health status in the GUI.
You would run a zpool clear after you think you fixed it and the pool will go back to healthy... but that won't really be a test until you run another scrub.

The SMART results look OK... I guess I was mixing your thread with another one where the SMART results weren't shown, apologies for that, you did the right thing.

... as long as you're sure ada1 is the right disk to have checked... You could use glabel status | grep gptid/78ac0533-9041-11ea-8ac1-9cb654092d28 to find which disk that is.
 
Top