SMART read errors, questions about ZFS scrubbing and offlineing the disk

cat_b

Cadet
Joined
Jul 2, 2023
Messages
2
Hi everyone!

I'm fairly new to TrueNAS after flashing it onto my old linux NAS.

I recently encountered some read errors on one disk. I filed for an RMA, but I have some questions about operations to do in the meantime. I did see the TrueNAS disk replacement document at https://www.truenas.com/docs/core/coretutorials/storage/disks/diskreplace/ and had a couple additional questions.

0. Questions

  1. Despite self-test number 1 and 2 having status "Completed: read failure", I see that "SMART overall-health self-assesment test results: PASSED" and that Raw_Read_Error_Rate is still 0
    1. Does indicate some bad sectors, but no actual mechanical issues? Can I safely continue using the disk while I await RMA?
  2. Is it safe to run the machine with one disk offline while awaiting a replacement disk? (pool is raidz1 with 4 disks and a single vdev)
    1. Is there anything I need to do to ensure the bad sectors are marked as such?
  3. Should I scrub the pool before or after taking the faulty disk offline? The last scrub was a couple weeks ago, so the ZFS pool status still shows 0 errors.
  4. I was going to order a replacement disk while I wait for the RMA to process. This means after RMA is processed, I'll probably have an extra disk from the RMA. I'm still new to ZFS, and I've seen varied info about this online, but, given my storage pool that's raidz1 with 4 disks on a single vdev:
    1. Can I expand my pool to 5 disks in-place? Or do I need to completely reformat the vdev / pool to do that?
    2. If I can't expand the vdev in-place from 4 to 5 disks in raidz1, is there a number of additional disks that would allow me to do so?
    3. If I can, will I need to manually tell it to rebalance data evenly across the new disks?


1. What happened

  • I got an email notification that self-test error count has incremented:
CRITICAL
Device: /dev/ada3, Self-Test Log error count increased from 0 to 1.
2023-07-01 00:58:59

  • I checked the smart log with `smartctl -a /dev/ada3`
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD60EFRX-68L0BN1
Serial Number: WD-WX72D507JU0F
LU WWN Device Id: 5 0014ee 212c5502c
Firmware Version: 82.00A82
User Capacity: 6,001,175,126,016 bytes [6.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5700 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Jul 1 12:06:51 2023 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 113) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: ( 4244) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 696) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x303d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 228 196 021 Pre-fail Always - 7591
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 285
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 074 074 000 Old_age Always - 19694
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 248
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 81
193 Load_Cycle_Count 0x0032 196 196 000 Old_age Always - 13328
194 Temperature_Celsius 0x0022 122 097 000 Old_age Always - 30
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 10% 19683 3130118592
# 2 Short offline Completed without error 00% 19659 -
# 3 Short offline Completed without error 00% 19635 -
# 4 Short offline Completed without error 00% 19611 -
# 5 Short offline Completed without error 00% 19587 -
# 6 Short offline Completed without error 00% 19563 -
# 7 Short offline Completed without error 00% 19539 -
# 8 Short offline Completed without error 00% 19515 -
# 9 Short offline Completed without error 00% 19491 -
#10 Short offline Completed without error 00% 19467 -
#11 Short offline Completed without error 00% 19443 -
#12 Short offline Completed without error 00% 19419 -
#13 Extended offline Completed without error 00% 19396 -
#14 Short offline Completed without error 00% 19371 -
#15 Short offline Completed without error 00% 19347 -
#16 Short offline Completed without error 00% 19323 -
#17 Short offline Completed without error 00% 19299 -
#18 Short offline Completed without error 00% 19275 -
#19 Short offline Completed without error 00% 19251 -
#20 Short offline Completed without error 00% 19227 -
#21 Short offline Completed without error 00% 19204 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

  • I filed an RMA for the disk (hardware details in section below)
  • I ran a long smart test on the offending disk with `smartctl -t /dev/ada3`
  • One day later after the long test completes, I check the smart log again with `smartctl -a /dev/ada3`
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD60EFRX-68L0BN1
Serial Number: WD-WX72D507JU0F
LU WWN Device Id: 5 0014ee 212c5502c
Firmware Version: 82.00A82
User Capacity: 6,001,175,126,016 bytes [6.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5700 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Jul 2 17:03:53 2023 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 113) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: ( 4244) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 696) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x303d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 228 196 021 Pre-fail Always - 7591
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 285
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 073 073 000 Old_age Always - 19723
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 248
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 81
193 Load_Cycle_Count 0x0032 196 196 000 Old_age Always - 13328
194 Temperature_Celsius 0x0022 121 097 000 Old_age Always - 31
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 88

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 10% 19711 3130073184
# 2 Short offline Completed: read failure 10% 19683 3130118592
# 3 Short offline Completed without error 00% 19659 -
# 4 Short offline Completed without error 00% 19635 -
# 5 Short offline Completed without error 00% 19611 -
# 6 Short offline Completed without error 00% 19587 -
# 7 Short offline Completed without error 00% 19563 -
# 8 Short offline Completed without error 00% 19539 -
# 9 Short offline Completed without error 00% 19515 -
#10 Short offline Completed without error 00% 19491 -
#11 Short offline Completed without error 00% 19467 -
#12 Short offline Completed without error 00% 19443 -
#13 Short offline Completed without error 00% 19419 -
#14 Extended offline Completed without error 00% 19396 -
#15 Short offline Completed without error 00% 19371 -
#16 Short offline Completed without error 00% 19347 -
#17 Short offline Completed without error 00% 19323 -
#18 Short offline Completed without error 00% 19299 -
#19 Short offline Completed without error 00% 19275 -
#20 Short offline Completed without error 00% 19251 -
#21 Short offline Completed without error 00% 19227 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

  • Note that the LBA of the first error on the extended offline test and short offline test are different, which seems bad

2. Hardware Info

  • Storage pool contains:
    • 4x WD red plus 6tb
  • Boot pool contains two nvme ssds
  • Running TrueNAS-13.0-U4
  • CPU: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
  • 16 gb of ram, can look up timings or model if needed
  • Can look up motherboard if needed
  • zpool status -v
pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:01 with 0 errors on Sat Jul 1 03:45:01 2023
config:

NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvd0p2 ONLINE 0 0 0
nvd1p2 ONLINE 0 0 0

errors: No known data errors

pool: wd_reds_6g_x4
state: ONLINE
scan: scrub repaired 0B in 06:03:46 with 0 errors on Sun Jun 11 06:03:46 2023
config:

NAME STATE READ WRITE CKSUM
wd_reds_6g_x4 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/8b11663a-e8b1-11ed-9657-e0d55e2ccd56 ONLINE 0 0 0
gptid/8b24cb3b-e8b1-11ed-9657-e0d55e2ccd56 ONLINE 0 0 0
gptid/8b1bd536-e8b1-11ed-9657-e0d55e2ccd56 ONLINE 0 0 0
gptid/8b03e3c9-e8b1-11ed-9657-e0d55e2ccd56 ONLINE 0 0 0

errors: No known data errors

Let me know if you need more specific info about anything! Thanks!
 
Joined
Oct 22, 2019
Messages
3,641
Despite self-test number 1 and 2 having status "Completed: read failure"
If the drive cannot complete a selftest without encountering read errors, you should consider it a failing drive that needs to be replaced ASAP. Especially when dealing with important data. (Hope you have backups, regardless!)


Does indicate some bad sectors, but no actual mechanical issues? Can I safely continue using the disk while I await RMA?
Is it safe to run the machine with one disk offline while awaiting a replacement disk?
ZFS should be able to handle it while you wait. If you offline the disk, you immediately lose redundancy. (Might put you at greater risk for losing the pool if you offline the disk, and then another disk fails during the interim.)


Should I scrub the pool before or after taking the faulty disk offline? The last scrub was a couple weeks ago, so the ZFS pool status still shows 0 errors.
For you own sanity, it makes sense to run a scrub again, since you're now savvy of a potentially failing drive.


Can I expand my pool to 5 disks in-place?
Nope. RAIDZ expansion is not (yet) available.


is there a number of additional disks that would allow me to do so?
Nope. For the same reason above.


Just because the drive is failing its selftests, doesn't mean ZFS will find any errors. And vice versa, a drive can pass its selftests, yet ZFS will find errors. It depends on multiple factors. Running another full scrub will at least check all records on the pool, and attempt to repair any issues it finds. Even if it passes a scrub with 0 errors, you should still replace the drive, since it failed a short and extended selftest.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Despite self-test number 1 and 2 having status "Completed: read failure", I see that "SMART overall-health self-assesment test results: PASSED" and that Raw_Read_Error_Rate is still 0
  1. Does indicate some bad sectors, but no actual mechanical issues? Can I safely continue using the disk while I await RMA?
Yes, you had a read failure during a self-test read media test. The "PASSED" result is basically an electronics check, not a media check.
Generally this test failure does not on it's own mean a mechanical failure. If you drive is within warranty, absolutely you can RMA the drive.

Is it safe to run the machine with one disk offline while awaiting a replacement disk? (pool is raidz1 with 4 disks and a single vdev)
  1. Is there anything I need to do to ensure the bad sectors are marked as such?
No, it's not safe for your data, however with that said, just because a smart read test failed, it doesn't mean your data has failed. Run a SCRUB to know if your data has been impacted (see the next answer). Leave the failing drive in the system until you can install a replacement.
Should I scrub the pool before or after taking the faulty disk offline? The last scrub was a couple weeks ago, so the ZFS pool status still shows 0 errors.
You do not need to do a scrub unless you want to know if any data has become corrupt. Running a scrub does put more strain on the drive and heats it up. Only run a scrub if you must.
  1. I was going to order a replacement disk while I wait for the RMA to process. This means after RMA is processed, I'll probably have an extra disk from the RMA. I'm still new to ZFS, and I've seen varied info about this online, but, given my storage pool that's raidz1 with 4 disks on a single vdev:
    1. Can I expand my pool to 5 disks in-place? Or do I need to completely reformat the vdev / pool to do that?
    2. If I can't expand the vdev in-place from 4 to 5 disks in raidz1, is there a number of additional disks that would allow me to do so?
    3. If I can, will I need to manually tell it to rebalance data evenly across the new disks?
You really need to read about ZFS pools/vdevs if you are going to run this kind of system, but the short answer is you cannot properly just add one drive to expand the capacity, there are ways to expand capacity without destroying your current pool/vdev. And nope, you can't tell it to rebalance the data across the drives, not that I'm aware of. Again, you need to read up on ZFS, what it is and what it isn't.

Check out this YouTube video on pools and vdevs. https://www.youtube.com/watch?v=_aACgNm8UCw


And I can see @winnielinnie just posted before me but I typed all this in so I'm posting it anyway.
 
Joined
Oct 22, 2019
Messages
3,641
And I can see @winnielinnie just posted before me but I typed all this in so I'm posting it anyway.
I'm using this new Google Chrome extension that monitors a web site or forum to view real-time user input. I was able to wrap up what I was typing (based on copying your real-time input), and hit "Post Reply" as fast as possible.

Don't believe me? I'll prove it...

Right now you're writing a new topic, asking if anyone has experience with waterproof harddrives that can operate in a fish tank without harming the aquatic life. Oh... okay, looks like you paused. Why'd you stop typing? Wait... now you're... okay... you're backspacing everything you wrote? Ah, I see. You must have had second thoughts about creating such a topic. Yup. Looks like you just clicked "Delete" on your new draft.
 
Top