Unhealthy pool, one checksum error. What should I do?

Steasenburger

Explorer
Joined
Feb 12, 2020
Messages
52
Hi guys,
i was surprised to see that my SSD pool that only consists of one NVME SSD shows up as unhealthy.
I searched how i can find the exact error and it seems like there is one checksum error at some file:
Code:
truenas# zpool status -xv

  pool: ssd

 state: ONLINE

status: One or more devices has experienced an error resulting in data

    corruption.  Applications may be affected.

action: Restore the file in question if possible.  Otherwise restore the

    entire pool from backup.

   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A

  scan: scrub repaired 0B in 00:00:04 with 1 errors on Tue Sep 14 22:25:14 2021

config:

 

    NAME                                          STATE     READ WRITE CKSUM

    ssd                                           ONLINE       0     0     0

      gptid/e29055a8-270f-11ea-88d3-40b07609074b  ONLINE       0     0     2

 

errors: Permanent errors have been detected in the following files:

 

        /var/db/system/rrd-ae1f6b9a44d94d15a639a40c1e91b328/localhost/df-mnt-hdd-NasStorage/df_complex-free.rrd 


I am using this pool for almost no data, that's all on the RAID 5 HDD pool, but the system dataset is stored on it.
I've read somewhere that i should restore the pool from my backups, but unfortunately I don't have one... It's just one single SSD.
I am not sure if i can just ignore this single checksum error, or should I do something about it? Delete this file?
Best regards
 

Steasenburger

Explorer
Joined
Feb 12, 2020
Messages
52
I've also just noticed this mesage:
Code:
warning
WARNING
The following system core files were found: rrdcached.core. Please create a ticket at https://jira.ixsystems.com/ and attach the relevant core files along with a system debug. Once the core files have been archived and attached to the ticket, they may be removed by running the following command in shell: 'rm /var/db/system/cores/*'.
2021-09-13 00:12:49 (Europe/Berlin)


I am pretty sure that this has something to do with the pool error, because the only file i found in the cores folder is "rrdcached.core".
Should i report it anyway and remove the file afterwards?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I am pretty sure that this has something to do with the pool error, because the only file i found in the cores folder is "rrdcached.core".
Should i report it anyway and remove the file afterwards?
The file with the error is certainly related to the second error.

You should be able to delete that file and it will be automatically regenerated.

Reporting it won't hurt. so why not if you have the time.

Before/in parallel with that, you will want to look at the smartctl data for the disk in order to understand if it is failing or if the error was just an anomaly.

At very least, zpool clear on the pool after you remove the file and watch it closely for a while to see if more arrive, take backups of anything important from that pool if there is anything.
 

Steasenburger

Explorer
Joined
Feb 12, 2020
Messages
52
I've already checked the output of smarctl, but this is looking good to me:
Code:
truenas# smartctl -a /dev/nvme0
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       KINGSTON SA2000M8250G
Serial Number:                      50026B72825275E2
Firmware Version:                   S5Z42105
PCI Vendor/Subsystem ID:            0x2646
IEEE OUI Identifier:                0x0026b7
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          250,059,350,016 [250 GB]
Namespace 1 Utilization:            234,681,708,544 [234 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            0026b7 2825275e25
Local Time is:                      Tue Sep 14 23:27:19 2021 CEST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     75 Celsius
Critical Comp. Temp. Threshold:     80 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.00W       -        -    0  0  0  0        0       0
 1 +     4.60W       -        -    1  1  1  1        0       0
 2 +     3.80W       -        -    2  2  2  2        0       0
 3 -   0.0450W       -        -    3  3  3  3     2000    2000
 4 -   0.0040W       -        -    4  4  4  4    15000   15000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        37 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    10%
Data Units Read:                    355,151 [181 GB]
Data Units Written:                 22,947,451 [11.7 TB]
Host Read Commands:                 3,012,686
Host Write Commands:                705,186,671
Controller Busy Time:               2,094
Power Cycles:                       45
Power On Hours:                     15,066
Unsafe Shutdowns:                   20
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged


The SSD is just 1-2 years old, so it should be fine I guess.

Do I really need to do backups of my pool if there is more or less only the system dataset on it?
I will try to remove the file and report again afterwards.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Do I really need to do backups of my pool if there is more or less only the system dataset on it?
No, only important things... system dataset isn't really your important data if you have a config backup elsewhere.

The SSD is just 1-2 years old, so it should be fine I guess.
Indeed it's looking OK for now.
 

Steasenburger

Explorer
Joined
Feb 12, 2020
Messages
52
No, only important things... system dataset isn't really your important data if you have a config backup elsewhere.
Yes i do have this. So i think i am fine :)

I just removed the file, executed the scrub command and zpool status and it showed now errors.
Thank you very much for your fast help!
 

Jecvay

Cadet
Joined
Jul 20, 2021
Messages
8
I have the same problem but after i remove the broken file :

errors: Permanent errors have been detected in the following files:
/var/db/system/rrd-30a73be39a88448d941ce234da3c976f/localhost/aggregation-cpu-average/cpu-system.rrd


and do a scrub, the file regenerated and the error dosn't fixed.

and the Checksum become "2" from "1"
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I have the same problem but after i remove the broken file :

errors: Permanent errors have been detected in the following files:
/var/db/system/rrd-30a73be39a88448d941ce234da3c976f/localhost/aggregation-cpu-average/cpu-system.rrd


and do a scrub, the file regenerated and the error dosn't fixed.

and the Checksum become "2" from "1"
What drives are you using?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
its same with #1, SSD pool that only consists of one NVME SSD
I'm asking about a brand/model.

It can be interesting to understand the controller used as it may have some involvement in the problem.

Silicon logic have a flash controller known to have issues with TRIM (under FreeBSD) for example.
 

Jecvay

Cadet
Joined
Jul 20, 2021
Messages
8
I'm asking about a brand/model.

It can be interesting to understand the controller used as it may have some involvement in the problem.

Silicon logic have a flash controller known to have issues with TRIM (under FreeBSD) for example.

thanks, I delete the /var/run/rrdcached.pid and /var/run/rrdcached.sock and the /mnt/pool_ssd/rrd-30a73be39a88448d941ce234da3c976f/*
and restart the truenas, and do a scrub, and it fixed.

my truenas is under ESXI and SSD's controller is vmware nvme controller.

thank you!
 
Top