SOLVED Zpool degraded checksum error

velocity08

Dabbler
Joined
Nov 29, 2019
Messages
33
Hi All

just digging through an alert this morning on one of our pools.

Its not showing a particular device just checksum errors on the pool it self.

It's currently scrubbing and has identified a file that may be corrupt.

This is an all SSD pool RAIDZ2, we have spares drives.

Any input on the below would be greatly appreciated.

Code:
# zpool status -v
  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:07 with 0 errors on Sat Oct  3 03:45:07 2020
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          nvd0p2    ONLINE       0     0     0

errors: No known data errors

  pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub in progress since Sun Sep 20 00:00:10 2020
        18.7T scanned at 16.9M/s, 18.3T issued at 16.6M/s, 22.1T total
        0 repaired, 82.73% done, 2 days 19:00:46 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            DEGRADED     0     0   504
          raidz2-0                                      DEGRADED     0     0 1.97K
            gptid/32091c9f-6d99-11ea-9b56-0cc47af37ed4  DEGRADED     0     0     0  too many errors
            gptid/3209b4c5-6d99-11ea-9b56-0cc47af37ed4  DEGRADED     0     0     0  too many errors
            gptid/32474a7a-6d99-11ea-9b56-0cc47af37ed4  DEGRADED     0     0     0  too many errors
            gptid/32850be8-6d99-11ea-9b56-0cc47af37ed4  DEGRADED     0     0     0  too many errors
            gptid/3278e193-6d99-11ea-9b56-0cc47af37ed4  DEGRADED     0     0     0  too many errors
            gptid/325737be-6d99-11ea-9b56-0cc47af37ed4  DEGRADED     0     0     0  too many errors
            gptid/32a1c533-6d99-11ea-9b56-0cc47af37ed4  DEGRADED     0     0     0  too many errors
            gptid/32585743-6d99-11ea-9b56-0cc47af37ed4  DEGRADED     0     0     0  too many errors

errors: Permanent errors have been detected in the following files:

        <0x23522>:<0x1>
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
I'm not knowledgeable enough to help you with this but I do know that those who are are going to want to know what your complete list of hardware is. That looks like metedata corruption and if it is it's not good.
 

velocity08

Dabbler
Joined
Nov 29, 2019
Messages
33
SuperMicro 24 Bay Server SYS-2029U-E1CRTP (Mother board MBD-X11DPU)
128 GB ECC Memory DDR4-2933
Dual Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz
SuperMicro HBA AOC-S3008L-L8E
8 x Samsung 7 TB SSD drives(see below)

Code:
]# smartctl -a /dev/da0
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     SAMSUNG MZ7LH7T6HMLA-00005
Serial Number:    S487NY0MB02105
LU WWN Device Id: 5 002538 e09b40ebb
Firmware Version: HXT7404Q
User Capacity:    7,681,501,126,656 bytes [7.68 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Oct  3 11:30:29 2020 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled


Ive run a manual short smart scan on all of the drives

they all pretty much look like the below:

Code:
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       4648
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       26
177 Wear_Leveling_Count     0x0013   099   099   005    Pre-fail  Always       -       5
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0013   100   100   010    Pre-fail  Always       -       23406
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   074   058   000    Old_age   Always       -       26
194 Temperature_Celsius     0x0022   074   058   000    Old_age   Always       -       26 (Min/Max 20/42)
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
202 Exception_Mode_Status   0x0033   100   100   010    Pre-fail  Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       19
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       66208803730
242 Total_LBAs_Read         0x0032   099   099   000    Old_age   Always       -       155551411733
243 SATA_Downshift_Ct       0x0032   100   100   000    Old_age   Always       -       0
244 Thermal_Throttle_St     0x0032   100   100   000    Old_age   Always       -       0
245 Timed_Workld_Media_Wear 0x0032   100   100   000    Old_age   Always       -       65535
246 Timed_Workld_RdWr_Ratio 0x0032   100   100   000    Old_age   Always       -       65535
247 Timed_Workld_Timer      0x0032   100   100   000    Old_age   Always       -       65535
251 NAND_Writes             0x0032   100   100   000    Old_age   Always       -       92828725568

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      4646         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


let me know if any other info is required.

""Cheers
G
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
Hi @velocity08,

The zpool status result you posted says that all your drives are now in error. There is some hope that the errors on one drive are different than the errors on the other drives, so hope that your data is still OK.

But first, do you have backups ? If you do not, do one ASAP.

Second, it would be surprising for all of these drives to fail at once. It is probably more something common to all the drives that is failing. Something like your HBA.

So first, I would do backup. Second, I would power down the server, remove all the cables and connectors and replug everything. Once everything back, boot the server and check the pool again. It may need to resilver. If it does, just let it finish the process. Once done or if no resilver is required, I would do a scrub after that.
 

velocity08

Dabbler
Joined
Nov 29, 2019
Messages
33
Hi @velocity08,

The zpool status result you posted says that all your drives are now in error. There is some hope that the errors on one drive are different than the errors on the other drives, so hope that your data is still OK.

But first, do you have backups ? If you do not, do one ASAP.

Second, it would be surprising for all of these drives to fail at once. It is probably more something common to all the drives that is failing. Something like your HBA.

So first, I would do backup. Second, I would power down the server, remove all the cables and connectors and replug everything. Once everything back, boot the server and check the pool again. It may need to resilver. If it does, just let it finish the process. Once done or if no resilver is required, I would do a scrub after that.

Thanks for the tip

it actually just finished a scrub a few hours ago and and shows no errors on drives only the checksum errors on the root of the pool not even individual drives.

It's very possible that the HBA is playing up.

I've just added a hot spare to the pool which is now re-silvering the pool, this should also scrub the data at the same time.

Ill wait for the resilver to finish which is currently estimated at 7 days but i don't think this is accurate.

i think it's a wait and see game at the moment.

thanks for your input.

Hopefully there are a few more members who have encountered this issue.

""Cheers
G
 

velocity08

Dabbler
Joined
Nov 29, 2019
Messages
33
An interesting piece of information that may be the cause of these checksum errors.

roughly 20 days ago we had a power supply go bang, very loudly which actually knocked out an entire feed to the rack, this is by design to safeguard and isolate power and find the cause of the power trip.

The system still operated fine on a single PS, we had a spare (as you should) and replaced online with no down time.

This may be the cause of these checksum errors.

the timeline roughly fits.

I'm theorising that a subset of data being written at the time of the power supply issue may have been affected, due to COW the data writes striped across the entire pool may have been affected.

thereby creating these checksum errors across the pool as apposed to just on a single drive.

thoughts?

I'm hoping another scrub could fix these errors on the next sweep.

Anyone else encountered something similar?

""Cheers
G
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
Hi again,

I'm theorising that a subset of data being written at the time of the power supply issue may have been affected, due to COW the data writes striped across the entire pool may have been affected.

This is VERY low probability. Almost impossible. Not enough data would have been "moving" at that single moment to create that much mess at a logical level. Second, FreeNAS is doing checksum all the time on all its operation. Until a data is saved and validated, it is not marked as "completed".

But the mechanical impact you described is 100% compatible with some misconnections. The HBA is loose in its PCI slot, some cables are now loose or something like that.

In every domain, when you have multiple symptoms or problems happening at once, you are always better to find the common cause first. Here, a mechanical impact damaged some mechanical connections. Up to you to find which one.
 

velocity08

Dabbler
Joined
Nov 29, 2019
Messages
33
Hey All

just thought i would report back with outcome and steps that we performed to resolve the issue.

Added hot spare to the RadiZ2 - just in case
Allowed for the resilver to finish
Pool checksum errors still exist
Clear errors from pool
Scrub pool again roughly 3-5 days
Pool is now showing clean no errors

Notes:

Pool was originally at 86% capacity approx
We take a lot of snapshots, every 1 hour retain for 24 hours + daily retained for 7 days

Not sure of the probability of the error being on the snapshots which is possible since the checksum's where showing as pool wide and not on any specific drive.

Taking into account the power supply blow up which i have read from others on this and other forums as being a potential cause of checkcum errors we currently believe the PS blowing may have been the cause of this issue as there is no other evidence to suggest there was a different issue at play on a brand new enterprise grade server which has been checked multiple times and burnt in prior to production.

Any hoo only time will tell and we will keep you posted if there are any changes.

Appreciate the comments and advise from all.

""Cheers
G
 
Top