Data Corruption in zpool - how to proceed?

phomchick

Dabbler
Joined
Oct 2, 2017
Messages
26
My TrueNAS Core core-dumped last night because of data corruption in the pool:

Code:
 pool: FREENAS
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 05:40:09 with 7 errors on Thu Mar 23 16:50:56 2023
config:

    NAME                                            STATE     READ WRITE CKSUM
    FREENAS                                         ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/1eeb0ced-3a01-11ec-8b31-90b11c00f1fe  ONLINE       0     0     0
        gptid/199b9746-3a01-11ec-8b31-90b11c00f1fe  ONLINE       0     0     1
        gptid/2c3f167c-3a01-11ec-8b31-90b11c00f1fe  ONLINE       0     0     1
        gptid/391d4ed5-3f19-11ec-9c78-90b11c00f1fe  ONLINE       0     0     1
        gptid/3b788e1b-3a01-11ec-8b31-90b11c00f1fe  ONLINE       0     0     1
        gptid/396ccb49-3a01-11ec-8b31-90b11c00f1fe  ONLINE       0     0     1
      raidz2-1                                      ONLINE       0     0     0
        gptid/b9034d57-3a63-11ec-911c-90b11c00f1fe  ONLINE       0     0     0
        gptid/c5acd85d-3a63-11ec-911c-90b11c00f1fe  ONLINE       0     0     0
        gptid/cbe3eca3-3a63-11ec-911c-90b11c00f1fe  ONLINE       0     0     0
        gptid/d7d8b49e-3a63-11ec-911c-90b11c00f1fe  ONLINE       0     0     0
        gptid/e15ca22d-3a63-11ec-911c-90b11c00f1fe  ONLINE       0     0     0
        gptid/e47d04f4-3a63-11ec-911c-90b11c00f1fe  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /var/db/system/rrd-481d11dc8be44177a340c08096234a34/localhost/df-mnt-FREENAS-FreeNAS_Dup/df_complex-used.rrd
        /var/db/system/rrd-481d11dc8be44177a340c08096234a34/localhost/df-mnt-FREENAS-system_backups/df_complex-used.rrd
        /var/db/system/rrd-481d11dc8be44177a340c08096234a34/localhost/df-mnt-FREENAS-FreeNAS_Dup/df_complex-free.rrd
        /var/db/system/rrd-481d11dc8be44177a340c08096234a34/localhost/df-mnt-FREENAS-WinShare/df_complex-free.rrd
        /var/db/system/rrd-481d11dc8be44177a340c08096234a34/localhost/df-mnt-FREENAS-system_backups/df_complex-free.rrd
        /var/db/system/rrd-481d11dc8be44177a340c08096234a34/localhost/df-mnt-FREENAS/df_complex-free.rrd
        /var/db/system/rrd-481d11dc8be44177a340c08096234a34/localhost/df-root/df_complex-used.rrd


The status report for the pool shows all disks online, but the pool is unhealthy. Unfortunately, I don't have a backup of the pool.

Any suggestions on how to proceed?

Thank you....
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Those are (if I recall correctly) stats reporting files. Not sure what happens if you just delete them. Either way they are TrueNAS data, not your data
You have a bunch of chksum errors on a zpool possibly caused by a cabling issue - is there a common cable?

Now according to your specs you have:
  1. 8x 4Tb HGST Deskstar NAS 3.5"drives
  2. 7x 4Tb Seagate 2.5" drives
Your zpool status gives 2 vdevs of 6 drives. Can we have some more details about the drives, which drives are in which vdev and how they are connected. Also specifically - what model are the Seagate 2.5" drives?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Since those files are all in the system dataset, the easy solution is to move it off that pool (even briefly to the boot pool if there are no other options) and back again.
 

phomchick

Dabbler
Joined
Oct 2, 2017
Messages
26
Those are (if I recall correctly) stats reporting files. Not sure what happens if you just delete them. Either way they are TrueNAS data, not your data
You have a bunch of chksum errors on a zpool possibly caused by a cabling issue - is there a common cable?

Now according to your specs you have:
  1. 8x 4Tb HGST Deskstar NAS 3.5"drives
  2. 7x 4Tb Seagate 2.5" drives
Your zpool status gives 2 vdevs of 6 drives. Can we have some more details about the drives, which drives are in which vdev and how they are connected. Also specifically - what model are the Seagate 2.5" drives?
The machine of interest is a Dell R730 with 2x Seagate 2.5"147Gb SAS drives for the boot pool, and 12 mixed-brand Toshiba and Seagate 2.5" 4Tb SATA drives for the main data pool. (The equipment inventory in my signature wasn't up to date). This current problem persuaded me to look at the S.M.A.R.T. reports for all of the drives, and every drive reported a passed self test, and no errors recorded. I'll set up the SMART reporting with a shell script and email so I can keep a tab on things. I don't think there's a hardware problem - I need to concentrate on fixing the corrupted system files. Thanks for the help!
 

phomchick

Dabbler
Joined
Oct 2, 2017
Messages
26
Since those files are all in the system dataset, the easy solution is to move it off that pool (even briefly to the boot pool if there are no other options) and back again.
If that's all there is to the fix, I'm willing to try it. One can move the system dataset while the system is running? I assume there is a 'zfs' command syntax to move data sets between pools? I don't have any other pools other than the boot-pool, but there's room there for this operation. Can you help with some additional details? Thank you!
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
I was wondering if the drives were SMR

As for moving the dataset - its in the GUI. The only caveat is that if you are AD integrated you need to disconnect from AD first as you can't move the system dataset whilst being AD attached. I am assuming you are on TN Core 12.0-U8 and not FreeNAS11.3-u5 about which I know nothing (you sig says both). Its in System / System Dataset

Re SMART - look up https://www.truenas.com/community/resources/multi_report-sh-version-for-core-and-scale.179/ this contains a script that does everything you want (including cleaning the kitchen sink - sorta) emails a config and a smart report whenever you want
 

phomchick

Dabbler
Joined
Oct 2, 2017
Messages
26
I was wondering if the drives were SMR

As for moving the dataset - its in the GUI. The only caveat is that if you are AD integrated you need to disconnect from AD first as you can't move the system dataset whilst being AD attached. I am assuming you are on TN Core 12.0-U8 and not FreeNAS11.3-u5 about which I know nothing (you sig says both). Its in System / System Dataset

Re SMART - look up https://www.truenas.com/community/resources/multi_report-sh-version-for-core-and-scale.179/ this contains a script that does everything you want (including cleaning the kitchen sink - sorta) emails a config and a smart report whenever you want
Thank you. Only one question "AD integrated"? UPDATE: Ah, Active Directory - nope
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The machine of interest is a Dell R730 with 2x Seagate 2.5"147Gb SAS drives for the boot pool, and 12 mixed-brand Toshiba and Seagate 2.5" 4Tb SATA drives for the main data pool.
Is that with an HBA330?
 

phomchick

Dabbler
Joined
Oct 2, 2017
Messages
26
Is that with an HBA330?
A PERC H730 Mini.

But there is definitely something weird. After I moved the system data set off and back I see check sum errors in all of the drives in one vdev.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
I believe that the H730 Mini is not a reccomended card for ZFS
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
@jgreco, you might want to follow this one, could be interesting.

A PERC H730 Mini.

But there is definitely something weird. After I moved the system data set off and back I see check sum errors in all of the drives in one vdev.
Ok, so the H730 is a clear suspect here. First of all, to get the advice out of the way: You should replace it with an HAB330. Performance will very likely improve and you sidestep nearly all of the question marks about reliability.

Now, I'd appreciate your help in digging deeper: Which version of the firmware are you running? And what driver is being used by the OS? What does camcontrol devlist report?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
could be interesting.

Not really that interesting. For whatever reason, there was a multi-month stretch where I was warning probably one or two users a day about these H730's, but it seems to have trickled off. I'm kinda happy about it but it gets wearying. We should have a Xenforo add-on that suggests keyword-to-help-article posts because I really do tire of searching these out manually by reading as many posts as I can.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
We should have a Xenforo add-on that suggests keyword-to-help-article posts because I really do tire of searching these out manually by reading as many posts as I can.
Now that would be nice, I tried to do that manually early on and gave up quickly with a renewed appreciation for the MMU's job.
 

phomchick

Dabbler
Joined
Oct 2, 2017
Messages
26
@jgreco, you might want to follow this one, could be interesting.


Ok, so the H730 is a clear suspect here. First of all, to get the advice out of the way: You should replace it with an HAB330. Performance will very likely improve and you sidestep nearly all of the question marks about reliability.

Now, I'd appreciate your help in digging deeper: Which version of the firmware are you running? And what driver is being used by the OS? What does camcontrol devlist report?
Here is the output..
Code:
root@freenas2[~]# camcontrol devlist
<SEAGATE ST9146853SS YS0A>         at scbus1 target 0 lun 0 (pass0,da0)
<SEAGATE ST9146853SS YS0D>         at scbus1 target 1 lun 0 (pass1,da1)
<ATA TOSHIBA MQ04ABB4 0U>          at scbus1 target 2 lun 0 (pass2,da2)
<ATA ST4000LM024-2AN1 0001>        at scbus1 target 3 lun 0 (pass3,da3)
<ATA TOSHIBA MQ04ABB4 0U>          at scbus1 target 4 lun 0 (pass4,da4)
<ATA TOSHIBA MQ04ABB4 0U>          at scbus1 target 5 lun 0 (pass5,da5)
<ATA ST4000LM024-2AN1 0001>        at scbus1 target 6 lun 0 (pass6,da6)
<ATA ST4000LM024-2AN1 0001>        at scbus1 target 7 lun 0 (pass7,da7)
<ATA TOSHIBA MQ04ABB4 0U>          at scbus1 target 9 lun 0 (pass8,da8)
<ATA TOSHIBA MQ04ABB4 0U>          at scbus1 target 10 lun 0 (pass9,da9)
<ATA TOSHIBA MQ04ABB4 0U>          at scbus1 target 11 lun 0 (pass10,da10)
<ATA TOSHIBA MQ04ABB4 0U>          at scbus1 target 12 lun 0 (pass11,da11)
<ATA TOSHIBA MQ04ABB4 0U>          at scbus1 target 13 lun 0 (pass12,da12)
<ATA ST4000LM024-2AN1 0001>        at scbus1 target 14 lun 0 (pass13,da13)
<AHCI SGPIO Enclosure 2.00 0001>   at scbus6 target 0 lun 0 (ses0, pass14)
<HL-DT-ST DVD+-RW GTA0N A3C0>      at scbus12 target 0 lun 0 (cd0, pass15)
<AHCI SGPIO Enclosure 2.00 0001>   at scbus13 target 0 lun 0 (ses1, pass16)
 

phomchick

Dabbler
Joined
Oct 2, 2017
Messages
26
It seems to me that the .rrd files are corrupted and moving them around isn't fixing them. What happens if I just delete them?

And what is the checksum error? What is being checked? Is this really a hardware problem?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Is this really a hardware problem?

It's really a hardware problem in that the hardware you've selected doesn't work correctly with TrueNAS. If you take the 730 out and trade it with another 730 in a box where it has been installed and used appropriately, for example perhaps by ESXi, the "new" 730 will begin showing problems and the "old" 730 now in the ESXi host will work reliably. It really depends on how you define "hardware problem". An inability for a hardware component to work correctly with software components known not to work with it is more along the lines of a build quality issue.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
And what is the checksum error? What is being checked? Is this really a hardware problem?
ZFS checksums every block on disk (that's half the point of ZFS) and is saying that it's getting checksum errors from blocks retrieved from those disks.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
TOSHIBA MQ04ABB4 0U

Good drives, a little aged and small on the cache side (16MB) but traditional CMR.

ST4000LM024-2AN1

Unfortunately, these are SMR drives. Not guaranteed to be responsible, and don't have the known issues with sector IDNF that WD Red SMR does, but potentially a contributing factor if they are timing out commands under load.

PERC H730 Mini.

May be an issue if it is behaving in RAID mode, still not as ideal as HBA330. Do you recall setting it to HBA Mode in the BIOS/EFI configuration?

dmesg | grep LSI should hopefully pull out the line that's showing the driver being loaded. I expect to see mrsas as the result.
 

phomchick

Dabbler
Joined
Oct 2, 2017
Messages
26
The H730's so-called "HBA mode" is known not to be sufficient.
Hmmm. I had an H310 Mini in the server but took it out and replaced it with the H730, though I don't remember why.

And the H730 is clearly in HBA mode as the above camcontrol devlist output shows.
 
Top