Corrupt files; where?

winnielinnie · Dec 15, 2022

ppmax said:
It looks like this is a snapshot generated from an @update.

That snapshot is not the main issue here, since as you can see...

ppmax said:
volume1/.system/syslog-097c772a7a1e40b5a1773cbed6329f54:/log/mdnsresponder.log

...this is not from a snapshot. It's on the live filesystem.

So now I would be worried about an underlying issue, such as failing drives and/or cables, controllers, ports, connections, etc.

Yes, you can destroy the @update snapshot, but it doesn't address the underlying issue. Plus, the corrupt file is not even found on the snapshot, but rather the live filesystem itself.

ppmax · Dec 15, 2022

Ok, I see I made a reading comprehension error and an interpretation error.

I was looking at the dir named syslog-097c772a7a1e40b5a1773cbed6329f54 and the file named mdnsresponder.log and assumed this was captured in a snapshot like the previous errors found in previous scrubs. My bad.

Thinking this through, what doesn't make sense is that I believe mdnsresponder is related to Apple's Bonjour discovery protocol. I ran some AFP and bounjour years ago from the system that died...but am not doing so now on the new box. I just did a ps -ax and don't see a mdnsresponder process on the host or in the jail I run (plex).

It is my hunch that what I'm seeing is the result of a bunch of snapshots that were created via the UI in previous versions, were deleted at some point, and left a bunch of cruft laying around (per your previous post)

Here's output from zfs list | grep 097c772a7a1e40b5a1773cbed6329f54

Code:

volume1/.system/configs-097c772a7a1e40b5a1773cbed6329f54  1.01G   805G  1.01G  legacy
volume1/.system/rrd-097c772a7a1e40b5a1773cbed6329f54       187M   805G   187M  legacy
volume1/.system/syslog-097c772a7a1e40b5a1773cbed6329f54   72.5M   805G  72.5M  legacy

Strangely, there is no output from zfs mount | grep 097c772a7a1e40b5a1773cbed6329f54

I don't grok the zfs data model; however it seems like real corruption on the live file system would require a live mount corresponding to this UUID
097c772a7a1e40b5a1773cbed6329f54

Thoughts?

(BTW thanks again for your time and interest)

winnielinnie · Dec 15, 2022

ppmax said:
Thinking this through, what doesn't make sense is that I believe mdnsresponder is related to Apple's Bonjour discovery protocol.

It doesn't matter what file contains the corrupted records. That you have corrupted records (and newly discovered ones from a subsequent scrub) alludes to a more important concern. (Loose cables? Bad ports? Faulty hardware? Failing drive(s)?)

If this was my system, I would run extended SMART selftests on all my data drives, and then check for loose connections and/or see if there is an issue with a controller card or motherboard ports.

The extended selftests might result with "no errors", since it is the drive's own internal test. A bad data connection will not be revealed in such a test, since no data is being read/written across the cable and interface.

EDIT: I missed this the first time I perused the thread:

ppmax said:
I had a home built box that suffered a cascade of failures and eventually died. I rescued the drives and stuffed them into a used Dell T110 II that I bought. Then a drive died. Hopefully I'm beyond all the stuff that caused this situation.

winnielinnie · Dec 15, 2022

Wait a second...

Did you create this pool with an ashift of 9?

Ideally, it should be an ashift of 12.

What is the output of:
zpool get ashift volume1

ppmax · Dec 15, 2022

Hmmm...I'm not seeing ashift as a valid argument for zpool get....FWIW Im on FreeNAS-11.3-U5

I ran zpool get all volume1 and see this:

Code:

NAME     PROPERTY                       VALUE                          SOURCE
volume1  size                           3.62T                          -
volume1  capacity                       53%                            -
volume1  altroot                        /mnt                           local
volume1  health                         ONLINE                         -
volume1  guid                           9954950771419226450            default
volume1  version                        -                              default
volume1  bootfs                         -                              default
volume1  delegation                     on                             default
volume1  autoreplace                    off                            default
volume1  cachefile                      /data/zfs/zpool.cache          local
volume1  failmode                       wait                           default
volume1  listsnapshots                  off                            default
volume1  autoexpand                     off                            default
volume1  dedupditto                     0                              default
volume1  dedupratio                     1.00x                          -
volume1  free                           1.69T                          -
volume1  allocated                      1.93T                          -
volume1  readonly                       off                            -
volume1  comment                        -                              default
volume1  expandsize                     -                              -
volume1  freeing                        0                              default
volume1  fragmentation                  10%                            -
volume1  leaked                         0                              default
volume1  bootsize                       -                              default
volume1  checkpoint                     -                              -
volume1  feature@async_destroy          enabled                        local
volume1  feature@empty_bpobj            active                         local
volume1  feature@lz4_compress           active                         local
volume1  feature@multi_vdev_crash_dump  enabled                        local
volume1  feature@spacemap_histogram     active                         local
volume1  feature@enabled_txg            active                         local
volume1  feature@hole_birth             active                         local
volume1  feature@extensible_dataset     enabled                        local
volume1  feature@embedded_data          active                         local
volume1  feature@bookmarks              enabled                        local
volume1  feature@filesystem_limits      enabled                        local
volume1  feature@large_blocks           enabled                        local
volume1  feature@sha512                 enabled                        local
volume1  feature@skein                  enabled                        local
volume1  feature@device_removal         enabled                        local
volume1  feature@obsolete_counts        enabled                        local
volume1  feature@zpool_checkpoint       enabled                        local
volume1  feature@spacemap_v2            active                         local

FWIW I provisioned this pool back in the FreeNAS 8 days I think...2012?

winnielinnie · Dec 15, 2022

ppmax said:
FWIW I provisioned this pool back in the FreeNAS 8 days I think...2012?

Aside from the drive you recently replaced, these drives have been running for 10 years...?

ppmax · Dec 15, 2022

Lol, yes, although I think I've had 2 drive failures in that period. I've upgraded FreeNAS and the pool each major release until 11.3 which was rock solid for me for years until my homebuilt box started to crap the bed. I've been pretty good about checking disk/pool health over the years.

Re ashift

I ran zdb on my pool's cache file; ashift=9...but I don't ever recall having a choice in the matter

Code:

root@freenas[~]# zdb -U /data/zfs/zpool.cache
volume1:
    version: 5000
    name: 'volume1'
    state: 0
    txg: 60232943
    pool_guid: 9954950771419226450
    hostid: 2697208017
    hostname: 'freenas.ppnet.loc'
    com.delphix:has_per_vdev_zaps
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 9954950771419226450
        children[0]:
            type: 'raidz'
            id: 0
            guid: 18373258727760369669
            nparity: 2
            metaslab_array: 23
            metaslab_shift: 35
            ashift: 9
            asize: 3992209850368
            is_log: 0
            com.delphix:vdev_zap_top: 32
            children[0]:
                type: 'disk'
                id: 0
                guid: 3677551062957330210
                path: '/dev/gptid/5dd66899-aabe-11e1-90d1-6805ca067062'
                phys_path: 'id1,enc@n3061686369656d30/type@0/slot@1/elmdesc@Slot_00/p2'
                whole_disk: 0
                DTL: 99
                com.delphix:vdev_zap_leaf: 35
            children[1]:
                type: 'disk'
                id: 1
                guid: 16371098446567684798
                path: '/dev/gptid/86e2281d-7a50-11ed-9e5f-d067e5eda5bd'
                phys_path: 'id1,enc@n3061686369656d30/type@0/slot@4/elmdesc@Slot_03/p2'
                DTL: 287
                com.delphix:vdev_zap_leaf: 132
                resilver_txg: 60232934
            children[2]:
                type: 'disk'
                id: 2
                guid: 15516672902972882828
                path: '/dev/gptid/5ea4a2db-aabe-11e1-90d1-6805ca067062'
                phys_path: 'id1,enc@n3061686369656d30/type@0/slot@2/elmdesc@Slot_01/p2'
                whole_disk: 0
                DTL: 92
                com.delphix:vdev_zap_leaf: 38
            children[3]:
                type: 'disk'
                id: 3
                guid: 14684645065347388563
                path: '/dev/gptid/5f14865b-aabe-11e1-90d1-6805ca067062'
                phys_path: 'id1,enc@n3061686369656d30/type@0/slot@3/elmdesc@Slot_02/p2'
                whole_disk: 0
                DTL: 91
                com.delphix:vdev_zap_leaf: 148
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data

Also just noticed the disk I swapped in to replace the failing drive apparently has block size=512...but the forums seem to indicate this is no big deal...but I don't subscribe to the belief that "what you read on the internet must be true"

Code:

config:
    NAME                                            STATE     READ WRITE CKSUM
    volume1                                         ONLINE       0     0     1
      raidz2-0                                      ONLINE       0     0     2
        gptid/5dd66899-aabe-11e1-90d1-6805ca067062  ONLINE       0     0     0
        gptid/86e2281d-7a50-11ed-9e5f-d067e5eda5bd  ONLINE       0     0     0  block size: 512B configured, 4096B native
        gptid/5ea4a2db-aabe-11e1-90d1-6805ca067062  ONLINE       0     0     0
        gptid/5f14865b-aabe-11e1-90d1-6805ca067062  ONLINE       0     0     0

Whoops: Im going to run smartctl -t long on my drive and see what that yields...will post back in hours

Thanks!

Arwen · Dec 16, 2022

The warning below;

block size: 512B configured, 4096B native

means your disk is using 4096 bytes per sector / block. But, your pool was configured as 512 bytes per sector / block.

This is to be expected since your pool started long ago, when 4096 byte sector disks were not common, (or exist).

Using a native 4096 byte sector drive in a 512 byte sector pool will cause a little slow down. That disk has to perform read, modify and write for any block writes that don't re-write the whole 4096 byte sector.

The opposite case, writing 4096 blocks on a 512 byte sector disk does not have the same problem. Thus, our recommendation now to always use ashift 12 / 4096 bytes sectors on new pools. Their are likely exceptions to this rule.

Arwen · Dec 16, 2022

Oh, in regards to the latest error. In my opinion, whence the affected file is restored or removed, the pool error(s) cleared, any new scrub should show clean. I believe that this so called newest error is just another left over from the old failures. My hint is that the error is on the RAID-Z2 and pool lines, not any current disk.

But, I freely admit I can be wrong.

winnielinnie · Dec 16, 2022

Arwen said:
My hint is that the error is on the RAID-Z2 and pool lines, not any current disk.

I'm more worried about simultaneous drive failures since there are at least two drives (from what I understand), that have been running in this system for the last 10 years.

Surely it's within plausibility that during a resilver of this RAIDZ2 vdev, one of the other two drives that are 10 years old can begin to fail?

I guess the extended SMART selftests might reveal something.

(There's also a third drive that might be several years old as well. The only "newish" drive is the one with 4K sectors.)

ppmax · Dec 16, 2022

winnielinnie said:
It doesn't matter what file contains the corrupted records. That you have corrupted records (and newly discovered ones from a subsequent scrub) alludes to a more important concern. (Loose cables? Bad ports? Faulty hardware? Failing drive(s)?)

If this was my system, I would run extended SMART selftests on all my data drives, and then check for loose connections and/or see if there is an issue with a controller card or motherboard ports.

The extended selftests might result with "no errors", since it is the drive's own internal test. A bad data connection will not be revealed in such a test, since no data is being read/written across the cable and interface.

EDIT: I missed this the first time I perused the thread:

Quick update from @winnielinnie suggestion upthread:
Extended/long SMART test on all 4 drives completed without error. Everything seems nominal. Prior to replacing one of the drives I was getting getting warnings in the console for that specific drive...and those have subsided since the replacement/resilver.

@Arwen thank you for your explanation re sector size. I assume any conversion to utilize larger sector size will require destroy current/create new pool. If that's the case I can live with slower reads/writes.

Re:

Oh, in regards to the latest error. In my opinion, whence the affected file is restored or removed, the pool error(s) cleared, any new scrub should show clean. I believe that this so called newest error is just another left over from the old failures. My hint is that the error is on the RAID-Z2 and pool lines, not any current disk.

But, I freely admit I can be wrong.

This is my suspicion too...the errors appear in datasets that don't appear to be mounted.

Re:

I'm more worried about simultaneous drive failures since there are at least two drives (from what I understand), that have been running in this system for the last 10 years.

Surely it's within plausibility that during a resilver of this RAIDZ2 vdev, one of the other two drives that are 10 years old can begin to fail?

I guess the extended SMART selftests might reveal something.

Totally plausible; just not seeing evidence of this concurrently.

Out of curiosity is there any way to determine the timestamp of when the item "syslog-097c772a7a1e40b5a1773cbed6329f54" was created?

zfs list shows that there are numerous items with UUID 097c772a7a1e40b5a1773cbed6329f54:

Code:

root@freenas[~]# zfs list | grep 097c772a7a1e40b5a1773cbed6329f54
volume1/.system/configs-097c772a7a1e40b5a1773cbed6329f54  1.01G   805G  1.01G  legacy
volume1/.system/rrd-097c772a7a1e40b5a1773cbed6329f54       187M   805G   187M  legacy
volume1/.system/syslog-097c772a7a1e40b5a1773cbed6329f54   72.5M   805G  72.5M  legacy

...and it is my hunch that those items were created *years* ago. It is also true that scrub errors containing that UUID have popped up more than once.

Arwen · Dec 16, 2022

ppmax said:
...
@Arwen thank you for your explanation re sector size. I assume any conversion to utilize larger sector size will require destroy current/create new pool. If that's the case I can live with slower reads/writes.
...

Yes, it will work fine, just a tad slower at times depending on how much the native 4096 sector disk has to re-write.

And yes, the way to change a pool's vDev from 512 to 4096 is to destroy it, and re-create.

ppmax said:
...
Out of curiosity is there any way to determine the timestamp of when the item "syslog-097c772a7a1e40b5a1773cbed6329f54" was created?

zfs list shows that there are numerous items with UUID 097c772a7a1e40b5a1773cbed6329f54:

Code:
root@freenas[~]# zfs list | grep 097c772a7a1e40b5a1773cbed6329f54 volume1/.system/configs-097c772a7a1e40b5a1773cbed6329f54 1.01G 805G 1.01G legacy volume1/.system/rrd-097c772a7a1e40b5a1773cbed6329f54 187M 805G 187M legacy volume1/.system/syslog-097c772a7a1e40b5a1773cbed6329f54 72.5M 805G 72.5M legacy

...and it is my hunch that those items were created *years* ago. It is also true that scrub errors containing that UUID have popped up more than once.

You can try this command and see what it says:
zfs get creation volume1/.system/syslog-097c772a7a1e40b5a1773cbed6329f54

You can also get a long listing for the file in question, using ls -l which will tell you it's last modified time.

winnielinnie · Dec 16, 2022

Residual datasets from yore?

This might help to figure out which ones are leftovers.

Compare the output from these:

Code:

zfs list -r -o name,mountpoint volume1/.system

zfs mount | grep /.system

mount | grep /var/db/

ppmax · Dec 16, 2022

@Arwen thanks for the tip re zfs get creation:

Code:

root@freenas[~]# zfs get creation volume1/.system/syslog-097c772a7a1e40b5a1773cbed6329f54
NAME                                                     PROPERTY  VALUE                  SOURCE
volume1/.system/syslog-097c772a7a1e40b5a1773cbed6329f54  creation  Sun Mar  1 12:16 2015  -

I couldn't get the file attributes via ls -al (alias ll) since this thing isn't mounted and I can't navigate to it.

@winnielinnie

Residual datasets from yore?

Yes. These were apparently created in 2015 lol. Thanks for the commands to run; I've been attempting similar by comparing output of zfs list and zfs mount manually and your suggestions make it more clear.

Here's the output of your commands + switches. TLDR the datasets containing the UUID 097c772a7a1e40b5a1773cbed6329f54 aren't mounted...and are therefore safe to destroy?

Code:

root@freenas[~]# zfs list -r -o name,mountpoint volume1/.system
NAME                                                      MOUNTPOINT
volume1/.system                                           legacy
volume1/.system/configs-097c772a7a1e40b5a1773cbed6329f54  legacy
volume1/.system/configs-0aeee2e60241454fb5b4c63b115661e1  legacy
volume1/.system/cores                                     legacy
volume1/.system/perftest                                  legacy
volume1/.system/rrd-097c772a7a1e40b5a1773cbed6329f54      legacy
volume1/.system/rrd-0aeee2e60241454fb5b4c63b115661e1      legacy
volume1/.system/rrd-5aff9b55f6744f32844e671d651f6466      legacy
volume1/.system/samba4                                    legacy
volume1/.system/syslog-097c772a7a1e40b5a1773cbed6329f54   legacy
volume1/.system/syslog-0aeee2e60241454fb5b4c63b115661e1   legacy
volume1/.system/syslog-5aff9b55f6744f32844e671d651f6466   legacy
volume1/.system/webui                                     legacy
root@freenas[~]# zfs mount | grep /.system
volume1/.system                 /var/db/system
volume1/.system/cores           /var/db/system/cores
volume1/.system/samba4          /var/db/system/samba4
volume1/.system/syslog-0aeee2e60241454fb5b4c63b115661e1  /var/db/system/syslog-0aeee2e60241454fb5b4c63b115661e1
volume1/.system/rrd-0aeee2e60241454fb5b4c63b115661e1  /var/db/system/rrd-0aeee2e60241454fb5b4c63b115661e1
volume1/.system/configs-0aeee2e60241454fb5b4c63b115661e1  /var/db/system/configs-0aeee2e60241454fb5b4c63b115661e1
volume1/.system/webui           /var/db/system/webui
root@freenas[~]# mount | grep /var/db
volume1/.system on /var/db/system (zfs, local, nfsv4acls)
volume1/.system/cores on /var/db/system/cores (zfs, local, nfsv4acls)
volume1/.system/samba4 on /var/db/system/samba4 (zfs, local, nfsv4acls)
volume1/.system/syslog-0aeee2e60241454fb5b4c63b115661e1 on /var/db/system/syslog-0aeee2e60241454fb5b4c63b115661e1 (zfs, local, nfsv4acls)
volume1/.system/rrd-0aeee2e60241454fb5b4c63b115661e1 on /var/db/system/rrd-0aeee2e60241454fb5b4c63b115661e1 (zfs, local, nfsv4acls)
volume1/.system/configs-0aeee2e60241454fb5b4c63b115661e1 on /var/db/system/configs-0aeee2e60241454fb5b4c63b115661e1 (zfs, local, nfsv4acls)
volume1/.system/webui on /var/db/system/webui (zfs, local, nfsv4acls)

Thanks again for the help

ppmax · Dec 16, 2022

Here's a better way to visualize the "orphan" datasets. Items in red aren't mounted:

root@freenas[~]# zfs list -r -o name,mountpoint volume1/.system
NAME MOUNTPOINT
volume1/.system legacy
volume1/.system/configs-097c772a7a1e40b5a1773cbed6329f54 legacy
volume1/.system/configs-0aeee2e60241454fb5b4c63b115661e1 legacy
volume1/.system/cores legacy
volume1/.system/perftest legacy
volume1/.system/rrd-097c772a7a1e40b5a1773cbed6329f54 legacy
volume1/.system/rrd-0aeee2e60241454fb5b4c63b115661e1 legacy
volume1/.system/rrd-5aff9b55f6744f32844e671d651f6466 legacy
volume1/.system/samba4 legacy
volume1/.system/syslog-097c772a7a1e40b5a1773cbed6329f54 legacy
volume1/.system/syslog-0aeee2e60241454fb5b4c63b115661e1 legacy
volume1/.system/syslog-5aff9b55f6744f32844e671d651f6466 legacy
volume1/.system/webui legacy

winnielinnie · Dec 16, 2022

It appears you can safely destroy the following datasets:

Code:

volume1/.system/configs-097c772a7a1e40b5a1773cbed6329f54

volume1/.system/rrd-097c772a7a1e40b5a1773cbed6329f54

volume1/.system/rrd-5aff9b55f6744f32844e671d651f6466

volume1/.system/syslog-097c772a7a1e40b5a1773cbed6329f54

volume1/.system/syslog-5aff9b55f6744f32844e671d651f6466

Do not use the "force" (-f) parameter, just in case something is in use. You'll want it to fail and give you a "BUSY" error if that's the case. Otherwise, without -f, it should safely destroy them.

Double check your commands. Don't accidentally try to destroy the ones that end with "...1e1"

ppmax · Dec 16, 2022

@Arwen, @winnielinnie...I know you are waiting on baited breath for this news

Code:

config:

    NAME                                            STATE     READ WRITE CKSUM
    volume1                                         ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/5dd66899-aabe-11e1-90d1-6805ca067062  ONLINE       0     0     0
        gptid/86e2281d-7a50-11ed-9e5f-d067e5eda5bd  ONLINE       0     0     0  block size: 512B configured, 4096B native
        gptid/5ea4a2db-aabe-11e1-90d1-6805ca067062  ONLINE       0     0     0
        gptid/5f14865b-aabe-11e1-90d1-6805ca067062  ONLINE       0     0     0

errors: No known data errors

Sincere thanks to you both for your help and guidance! Lol, every time a drive fails I learn more thanks to this fabulous community and members like you.

THANKS AGAIN

Important Announcement for the TrueNAS Community.

Corrupt files; where?

winnielinnie

MVP

ppmax

Contributor

winnielinnie

MVP

winnielinnie

MVP

ppmax

Contributor

winnielinnie

MVP

ppmax

Contributor

Arwen

MVP

Arwen

MVP

winnielinnie

MVP

ppmax

Contributor

Arwen

MVP

winnielinnie

MVP

ppmax

Contributor

ppmax

Contributor

winnielinnie

MVP

ppmax

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

Corrupt files; where?

MVP

Contributor

MVP

MVP

Contributor

MVP

Contributor

MVP

MVP

MVP

Contributor

MVP

MVP

Contributor

Contributor

MVP

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Corrupt files; where?"

Similar threads