guermantes
Patron
- Joined
- Sep 27, 2017
- Messages
- 213
During scrub the other night my RAIDZ2 pool faulted one drive due to read errors. I have recent backups.
This is my first alarm experience since building my NAS in 2017, with these very disks (so they have been running for a while).
I am currently running a long smart test on all drives.
Is this a case where zpool clear could be in order or are 92 read errors a clear sign of a dying drive that should be replaced asap?
This is the zpool report generated by spearfoot's script:
And here is a full SMART report for the drive in question:
This is my first alarm experience since building my NAS in 2017, with these very disks (so they have been running for a while).
I am currently running a long smart test on all drives.
Is this a case where zpool clear could be in order or are 92 read errors a clear sign of a dying drive that should be replaced asap?
This is the zpool report generated by spearfoot's script:
Code:
########## ZPool status report summary for all pools on server TRUENAS ##########
+--------------+--------+------+------+------+----+----+--------+------+-----+
|Pool Name |Status |Read |Write |Cksum |Used|Frag|Scrub |Scrub |Last |
| | |Errors|Errors|Errors| | |Repaired|Errors|Scrub|
| | | | | | | |Bytes | |Age |
+--------------+--------+------+------+------+----+----+--------+------+-----+
|TANK ?|DEGRADED| 92| 0| 0| 51%| 8%| 2.78M| 0| 2|
|freenas-boot |ONLINE | 0| 0| 0| 17%| -| 0B| 0| 2|
+--------------+--------+------+------+------+----+----+--------+------+-----+
########## ZPool status report for TANK ##########
pool: TANK
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub repaired 2.78M in 07:16:53 with 0 errors on Mon Jan 1 09:17:20 2024
config:
NAME STATE READ WRITE CKSUM
TANK DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
gptid/bc101246-d6b5-11e7-9830-ac1f6b2518b2 ONLINE 0 0 0
gptid/bd0c4046-d6b5-11e7-9830-ac1f6b2518b2 FAULTED 92 0 0 too many errors
gptid/be124e45-d6b5-11e7-9830-ac1f6b2518b2 ONLINE 0 0 0
gptid/bf138e9a-d6b5-11e7-9830-ac1f6b2518b2 ONLINE 0 0 0
gptid/c01a3b3a-d6b5-11e7-9830-ac1f6b2518b2 ONLINE 0 0 0
gptid/c11c5cdc-d6b5-11e7-9830-ac1f6b2518b2 ONLINE 0 0 0
errors: No known data errorsAnd here is a full SMART report for the drive in question:
Code:
root@truenas:~ # smartctl -a /dev/ada2
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD40EFRX-68N32N0
Serial Number: WD-WCC7K1VUY2X5
LU WWN Device Id: 5 0014ee 2b9d98f66
Firmware Version: 82.00A82
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Jan 3 21:05:29 2024 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (44640) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 473) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x303d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 196 183 051 Pre-fail Always - 180
3 Spin_Up_Time 0x0027 157 154 021 Pre-fail Always - 7141
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 104
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0
9 Power_On_Hours 0x0032 034 034 000 Old_age Always - 48878
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 104
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 19
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 185
194 Temperature_Celsius 0x0022 118 101 000 Old_age Always - 32
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 202
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 48665 -
# 2 Extended offline Completed without error 00% 48581 -
# 3 Short offline Completed without error 00% 48497 -
# 4 Short offline Completed without error 00% 48330 -
# 5 Extended offline Completed without error 00% 48245 -
# 6 Short offline Completed without error 00% 48162 -
# 7 Short offline Completed without error 00% 47946 -
# 8 Extended offline Completed without error 00% 47862 -
# 9 Short offline Completed without error 00% 47778 -
#10 Short offline Completed without error 00% 47610 -
#11 Extended offline Completed without error 00% 47526 -
#12 Short offline Completed without error 00% 47443 -
#13 Short offline Completed without error 00% 47202 -
#14 Extended offline Completed without error 00% 47117 -
#15 Short offline Completed without error 00% 47034 -
#16 Short offline Completed without error 00% 46867 -
#17 Extended offline Completed without error 00% 46784 -
#18 Short offline Completed without error 00% 46700 -
#19 Short offline Completed without error 00% 46484 -
#20 Extended offline Completed without error 00% 46401 -
#21 Short offline Completed without error 00% 46317 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.