Unrecoverable error, but disks seem okay?

Davvo · Apr 17, 2023

Dwarf Cavendish said:
Here is the thing though: if I have a mirrored vdev irrepairable data corruption shouldn't happen unless both disks failed on the same bit of data or there was some other hardware problem, right? Since both disks' SMART tests are still looking good I think that I'll start out doing memtesting once I got my pool repaired (just now kicked off the scrub).

I hope that it's RAM, because that I can easily swap out (I suppose that ECC can only do so much with bad RAM). And in that case I'll replace ada1 with a fresh disk so that they're less likely to both fail around the same time.

With ECC RAM, iirc, either the single-digit bit flipped is restored or the system goes panik mode so it should be unlikely for that to have been the cause.

You can memtest your RAM if you suspect hardware is at fault there.

You can get errors on a disk part of a mirror, and they get fixed if the other copy has the good data when you scrub the pool.

joeschmuck · Apr 17, 2023

Dwarf Cavendish said:
I hope that it's RAM, because that I can easily swap out (I suppose that ECC can only do so much with bad RAM). And in that case I'll replace ada1 with a fresh disk so that they're less likely to both fail around the same time.

I should respond to this too. Odds of your ECC RAM being bad are very slim but I like that you want to test them out. If the test fails, you found an issue, if the test passes, you feel more comfortable with them. If you had a bad bit, I would expect you to have lost lots of data. The system would hopefully crash on you. Lots of bad things. But it is good to test your RAM, especially when you question it. It would be the easy thing to replace, I agree.

Dwarf Cavendish · Apr 27, 2023

Well, what shall we make of this now...?

I took the system offline for a while and took the disks out. I put it on my desk and ran tests on it while I wasn't at work. It passed a full day of mprime torture testing and 7 full passes of memtesting (plus some more partial ones).
Considering the age of both disks I decided to replace them both.
Resilvering the first disk went without problems. During resilvering the second one I got an email that my system had an unexpected reboot...

Hardware problems in the SATA controller perhaps...? Or problems with this particular release of TrueNAS Core...?

Davvo · Apr 27, 2023

Generally we suggest to run memtest continuosly for a week.
What is your boot drive? Can you please tell us your complete hardware config?
Oh and make sure you didn't experience a power loss at home (or wherever the system Is located).

If you replaced the drives, then I would also replace the SATA and power cables.

Dwarf Cavendish · Apr 27, 2023

My hardware config is as follows:

HPE MicroServer Gen10 873830-421.
Extra RAM module: Kingston Server Premier KSM26ES8/8HD.
Boot drive: Transcend SSD370 (Premium) 32GB, connected to the one separate SATA port.
Drives: 2x WD Red+ 4TB, loaded in the drive mounting slots. There aren't any cables to replace here.

While I was running long SMART tests for my old disks I also ran a long SMART test for my boot drive. Its current SMART output:

Code:

smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Silicon Motion based SSDs
Device Model:     TS32GSSD370S
Serial Number:    E065290301
Firmware Version: P1225CE
User Capacity:    32,017,047,552 bytes [32.0 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Apr 27 20:54:15 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:         (    0) seconds.
Offline data collection
capabilities:              (0x71) SMART execute Offline immediate.
                    No Auto Offline data collection support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0002)    Does not save SMART data before
                    entering power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      (   1) minutes.
Conveyance self-test routine
recommended polling time:      (   1) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0000   100   100   000    Old_age   Offline      -       0
  5 Reallocated_Sector_Ct   0x0000   100   100   000    Old_age   Offline      -       0
  9 Power_On_Hours          0x0000   100   100   000    Old_age   Offline      -       51
 12 Power_Cycle_Count       0x0000   100   100   000    Old_age   Offline      -       123
160 Uncorrectable_Error_Cnt 0x0000   100   100   000    Old_age   Offline      -       0
161 Valid_Spare_Block_Cnt   0x0000   100   100   000    Old_age   Offline      -       51
163 Initial_Bad_Block_Count 0x0000   100   100   000    Old_age   Offline      -       17
164 Total_Erase_Count       0x0000   100   100   000    Old_age   Offline      -       18038
165 Max_Erase_Count         0x0000   100   100   000    Old_age   Offline      -       102
166 Min_Erase_Count         0x0000   100   100   000    Old_age   Offline      -       1
167 Average_Erase_Count     0x0000   100   100   000    Old_age   Offline      -       17
168 Max_Erase_Count_of_Spec 0x0000   100   100   000    Old_age   Offline      -       3000
169 Remaining_Lifetime_Perc 0x0000   100   100   000    Old_age   Offline      -       100
175 Program_Fail_Count_Chip 0x0000   100   100   000    Old_age   Offline      -       0
176 Erase_Fail_Count_Chip   0x0000   100   100   000    Old_age   Offline      -       0
177 Wear_Leveling_Count     0x0000   100   100   050    Old_age   Offline      -       1
178 Runtime_Invalid_Blk_Cnt 0x0000   100   100   000    Old_age   Offline      -       0
181 Program_Fail_Cnt_Total  0x0000   100   100   000    Old_age   Offline      -       0
182 Erase_Fail_Count_Total  0x0000   100   100   000    Old_age   Offline      -       0
192 Power-Off_Retract_Count 0x0000   100   100   000    Old_age   Offline      -       19
194 Temperature_Celsius     0x0000   100   100   000    Old_age   Offline      -       23
195 Hardware_ECC_Recovered  0x0000   100   100   000    Old_age   Offline      -       0
196 Reallocated_Event_Count 0x0000   100   100   016    Old_age   Offline      -       0
197 Current_Pending_Sector  0x0000   100   100   000    Old_age   Offline      -       0
198 Offline_Uncorrectable   0x0000   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0000   100   100   050    Old_age   Offline      -       0
232 Available_Reservd_Space 0x0000   100   100   000    Old_age   Offline      -       100
241 Host_Writes_32MiB       0x0000   100   100   000    Old_age   Offline      -       4270
242 Host_Reads_32MiB        0x0000   100   100   000    Old_age   Offline      -       118713
245 TLC_Writes_32MiB        0x0000   100   100   000    Old_age   Offline      -       18038

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%        51         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
    7        0    65535  Read_scanning was completed without error
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Furthermore:

I never updated the BIOS. HPE are really annoying with this, you can't just download updates. But I also never really had any reason to want to.
I don't think we have issues with power outages here. I think I would notice that in other ways as well.
I did notice a message about a core dump together with something with "/dev/ada0p1", "operation not permitted" in the green thingie at the bottom of the screen.

Lastly some tunables

amdtemp_load = YES. When I started out with FreeNAS 11.1 I couldn't read temperature sensors. I volunteered to test out changes made for this CPU by the FreeBSD devs and when it got included into FreeNAS I had to load this module to make the CPU temperature sensors work.
hint.acpi_throttle.0.disabled = YES. If I read the docs right this should be enabled by default.
hw.pci.realloc_bars = 1. Without this screen output wouldn't work for the MicroServer, this used to be a known issue with this hardware (not sure if it still is). But I suppose I don't necessarily care about that when working headlessly.

Dwarf Cavendish · May 21, 2023

Well, a few weeks in it looks like everything is stable again. It would seem that disabling hw.pci.realloc_bars has done the trick here.

A minor annoyance is that for some reason I don't get an email for finished scrub tasks anymore. But that I can look into another day

.

Important Announcement for the TrueNAS Community.

Unrecoverable error, but disks seem okay?

Davvo

MVP

joeschmuck

Old Man

Dwarf Cavendish

Contributor

Davvo

MVP

Dwarf Cavendish

Contributor

Dwarf Cavendish

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

Unrecoverable error, but disks seem okay?

Davvo

MVP

joeschmuck

Old Man

Dwarf Cavendish

Contributor

Davvo

MVP

Dwarf Cavendish

Contributor

Dwarf Cavendish

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Unrecoverable error, but disks seem okay?"

Similar threads