Alerts for device errors but no errors shows in zpool status???

David Dyer-Bennet · Apr 6, 2021

The pool shows clean

fsfs% zpool status zpdb
pool: zpdb
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0 in 0 days 06:53:20 with 0 errors on Sun Mar 7 06:53:35 2021
config:

NAME STATE READ WRITE CKSUM
zpdb ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/aafd09cf-d071-11e9-8b36-2c4d54526dc1 ONLINE 0 0 0
gptid/abca6f0d-d071-11e9-8b36-2c4d54526dc1 ONLINE 0 0 0
gptid/ac92cc59-d071-11e9-8b36-2c4d54526dc1 ONLINE 0 0 0
gptid/ad5c80fc-d071-11e9-8b36-2c4d54526dc1 ONLINE 0 0 0
gptid/af090aed-d071-11e9-8b36-2c4d54526dc1 ONLINE 0 0 0
gptid/b078a38e-d071-11e9-8b36-2c4d54526dc1 ONLINE 0 0 0

errors: No known data errors

But I'm getting alerts like this:

screenshot_2021-04-06-freenas-192-168-1-205-png.46372

So, how panicked should I be?

winnielinnie · Apr 6, 2021

~~The output you pasted shows nothing wrong, as far as I can tell. Did you forget to attach something else?~~

I see it now! Maybe it didn't upload / attach the first time.

David Dyer-Bennet · Apr 6, 2021

The preview showed the screenshot of the alert, yeah. Let's see if I can edit it....seems to have worked from my end, at least.

winnielinnie · Apr 6, 2021

Okay, I see it now. Looks like the image didn't load / attach properly before your edit.

What is the output of:
smartctl -l error /dev/ada5
and
smartctl -l selftest /dev/ada5

You can also paste the entire output of smartctl -a /dev/ada5 which will provide more information, but be sure to remove/hide the serial number.

It's absolutely possible to have read errors on a physical drive in one of your vdevs, yet the zpool status (and your saved data) is still healthy.

It's a matter of safely replacing a dying drive and resilvering before it's too late. Each individual has their own levels of urgency in what they deem is critical enough to warrant a drive replacement. Personally, any errors for a drive, I treat as a ticking time-bomb and get a replacement before waiting for more errors. That's just me. Others can chime in with different approaches.

David Dyer-Bennet · Apr 6, 2021

Yeah, if those are real sector-degradation errors, I will be replacing the drive.

fsfs% sudo smartctl -a /dev/ada5
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p14 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: HGST Ultrastar He6
Device Model: HGST HUS726060ALA640
Serial Number: [redacted]
LU WWN Device Id: 5 000cca 232d711ef
Firmware Version: AHGNT1EN
User Capacity: 6,001,175,126,016 bytes [6.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Apr 7 01:03:39 2021 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 57) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 998) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 131 131 054 Pre-fail Offline - 87
3 Spin_Up_Time 0x0007 160 160 024 Pre-fail Always - 646 (Average 597)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 23
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 39
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 130 130 020 Pre-fail Offline - 12
9 Power_On_Hours 0x0012 098 098 000 Old_age Always - 16593
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 23
22 Helium_Level 0x0023 100 100 025 Pre-fail Always - 100
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 752
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 752
194 Temperature_Celsius 0x0002 162 162 000 Old_age Always - 37 (Min/Max 22/47)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 1
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0

SMART Error Log Version: 1
ATA Error Count: 1
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 948 hours (39 days + 12 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 41 00 08 35 63 40 Error: ICRC, ABRT at LBA = 0x00633508 = 6501640

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 28 08 35 63 40 08 4d+14:42:08.245 READ FPDMA QUEUED
60 00 a0 08 44 63 40 08 4d+14:42:08.216 READ FPDMA QUEUED
60 00 98 08 43 63 40 08 4d+14:42:08.216 READ FPDMA QUEUED
60 00 90 08 42 63 40 08 4d+14:42:08.216 READ FPDMA QUEUED
60 00 88 08 41 63 40 08 4d+14:42:08.216 READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

That looks like an actual CRC error on a read being reported? But...if so, why is there nothing in the ZFS pool error counts?

Anything else interesting to read out of the smartctl report?

Can't off-hand figure what I'm protecting myself against by hiding the serial number? But I have done so, better safe than sorry.

Looks like it's less than 2 years old (given that it's on full-time). I suppose that means I have to check the warranty. No chance of haviing the paperwork still I don't think, though.

Jailer · Apr 7, 2021

David Dyer-Bennet said:
Anything else interesting to read out of the smartctl report?

Yes, you're not running scheduled smart tests. Run a long smart test and then post the output here in code tags. smartctl -t long /dev/ada5

winnielinnie · Apr 7, 2021

This is something to note as well:

5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 39

If still within warranty, I would have it replaced, personally. Who knows in what succession, or how recently, those 39 sectors had to be reallocated.

All the caveats about offlining, resilivering, GELI encryption, apply if you decide to continue with a replacement.

Like @Jailer said, manually run an "extended" (long) SMARTCTL selftest on the drive. It will take a good while for it to complete. (About 16 to 17 hours). You can monitor its progress with occasional"-a" or "-l selftest" if supported by the drive. Don't try to run another test before the extended one finishes, or else it will abort the current test in progress. The moment the test bumps into any read error, it will stop early, such as at 90% remaining, 50% remaining, 10% remaining, or wherever.

winnielinnie · Apr 7, 2021

David Dyer-Bennet said:
Can't off-hand figure what I'm protecting myself against by hiding the serial number? But I have done so, better safe than sorry.

One example is a matter of preventing a no-gooder from registering your drive before you do.

HoneyBadger · Apr 7, 2021

5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 39
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 1

These three stand out to me as a drive giving you fair early warning of a failure brewing. You've had 39 sectors that failed and were reallocated, and one right now is pending allocation but is offline and uncorrectable (can't be reallocated)

David Dyer-Bennet said:
The pool shows clean

I'd say to run a pool scrub since your last one is over a month old and see if it catches/corrects anything. Since the other 39 were "correctable" but you're now looking at an "uncorrectable" I'd like to see if the scrub reports data repaired.

David Dyer-Bennet · Apr 7, 2021

The pool scrub runs about monthly, and didn't show anything after earlier error alerts just like this. I took them as some sort of startup issue at first, until I'd seen a couple of cases in other situations. (Actual startups being rather rare of course.)

Dunno where the smart test run went, there used to be a regular automatic one of those, too. But I'll see if I can get that fixed!

I'm not bothering to force a long smart-test, I'm just ordering a replacement drive (and then dealing with warranty if I can; a spare drive in inventory is good practice anyway, I've repurposed some of those as backup drives, having a designated spare is good). (Also...having made the choice to replace, no point in doing anything that's the slightest bit stressful to the array until the replacement arrives! This is a 4+2 array so I'm not down to zero redundancy even if ada5 fails permanently, but still.)

No encryption, self or other, on this array. Seems like I *always* have slight bits of trouble doing a drive replacement, but so far never bad enough to lose anything. Doesn't happen very often, and I only manage this one server (sometimes have managed a second jointly with a friend, over at their place), so I don't get practice on that much!

Oh -- does a modern SED get weird if I don't do anything to activate encryption? Never had one of those, might well end up with one on the replacement.

HoneyBadger · Apr 7, 2021

I would schedule periodic short/long SMART tests on your drives regardless of your plan for the immediate replacement; the more early-warning you get, the better.

For modern SEDs - as long as you never claim ownership and set the password, it'll behave just like a normal non-encrypted device. My HDD based machine has SAS SEDs but doesn't leverage FDE.

David Dyer-Bennet · Apr 7, 2021

Just to confirm what the man pages seem to say -- the smartctl "offline" test is safe to use on a device that's part of an active pool, correct? It may, depending on device, have an impact on performance, but mostly load on the device stretches out the test rather than the other way around?

HoneyBadger · Apr 7, 2021

David Dyer-Bennet said:
Just to confirm what the man pages seem to say -- the smartctl "offline" test is safe to use on a device that's part of an active pool, correct? It may, depending on device, have an impact on performance, but mostly load on the device stretches out the test rather than the other way around?

Correct in all statements. Offline tests are safe to use on pool member devices.

It may have a minor impact on performance, in that if the drive is testing a sector that is physically far away from the "active data" then it will have to seek back to that data, then return to the "testing zone" - and the SMART test only proceeds when the device is idle, so a heavy workload will cause the test to take longer.

David Dyer-Bennet · Apr 7, 2021

My remark about stress should indeed apply to the smart tests also, and I wasn't thinking of it that way -- so I was actually thinking same as you, get the smart tests working now but don't force a new scrub.

New drive is due Monday. 8TB rather than 6, which I know is of no benefit now, but as I replace things along the way, eventually the last 6TB will go and whatever the new smallest drive is will be the limit on array size. While I have to look up the commands, I've done that a number of times before and I feel happy doing it again, it seems to just work. (Well, there is *one* benefit now -- 6TB drives are expensive since they're old, not in production I think

.)

David Dyer-Bennet · Apr 7, 2021

And I don't have heavy workloads of anything real, just rare sysadmin things -- scrubs are the closest, or the very VERY rare case when I copy one array to another. I'm basically the only user of this server, it archives my photos and an archival collection of photos of science fiction fandom for the local club, mostly. And...if my head is in sysadmin space I'm probably not doing anything heavy even as the single user. This server is nearly entirely about protecting the data, not performance.

(Scrubs were running at least 35 days apart on a particular day of the week; I've dropped that to 27 days but left the other requirements. Smart tests seem to be enabled but I'll wait until the time for scheduled tests comes to be sure of that.)

HoneyBadger · Apr 7, 2021

I'd wager you'll notice absolutely zero impact of a SMART test on your system. I wouldn't schedule it at the same time as a scrub though, because then you could have the two workloads in a duel for HDD head position.

David Dyer-Bennet · Apr 7, 2021

HoneyBadger said:
I'd wager you'll notice absolutely zero impact of a SMART test on your system. I wouldn't schedule it at the same time as a scrub though, because then you could have the two workloads in a duel for HDD head position.

Yep, I've got them reasonably spread out, I think (I don't really believe in them until I've seen them run).

David Dyer-Bennet · Apr 12, 2021

And the story isn't quite over; but the replacement drive arrived yesterday, I have made the replacement after some very kind hand-holding in another thread (I replace drives rarely enough I never quite remember it, or if I do the software has changed since then), and the resilver is almost half done.

Oh, and I remembered to turn on SMART tests for the replacement drive (the manual does remind people of that).

Important Announcement for the TrueNAS Community.

Alerts for device errors but no errors shows in zpool status???

David Dyer-Bennet

Patron

Attachments

winnielinnie

MVP

David Dyer-Bennet

Patron

winnielinnie

MVP

David Dyer-Bennet

Patron

Jailer

Not strong, but bad

winnielinnie

MVP

winnielinnie

MVP

HoneyBadger

actually does care

David Dyer-Bennet

Patron

HoneyBadger

actually does care

David Dyer-Bennet

Patron

HoneyBadger

actually does care

David Dyer-Bennet

Patron

David Dyer-Bennet

Patron

HoneyBadger

actually does care

David Dyer-Bennet

Patron

David Dyer-Bennet

Patron

Similar threads

Important Announcement for the TrueNAS Community.

Alerts for device errors but no errors shows in zpool status???

Patron

Attachments

MVP

Patron

MVP

Patron

Not strong, but bad

MVP

MVP

actually does care

Patron

actually does care

Patron

actually does care

Patron

Patron

actually does care

Patron

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Alerts for device errors but no errors shows in zpool status???"

Similar threads