In a precarious situation

irixion

Dabbler
Joined
Jan 28, 2018
Messages
10
I currently have a system running TrueNAS 12.

Gigabyte GA-H110N
Intel G4560 @ 3.5 GHz420GB Corsair LPX DD4 RAM

2 USB drives that are mirrored (Freenas)
4x Seagate IronWolf drives, mirrored (drive 1 with 3, 2 with 4)
Solid Gear Mini-ITX 270 Watt PSU


The plan was to replace two of the drives with higher capacity drives. Through the GUI, I offlined drive 1, replaced it with an 8TB and then had it resilver--everything went smoothly, or so I thought. I then tried doing the same thing with drive 3, except for whatever reason, the drive kept spitting out checksum errors while resilvering. I figured I'd just let it complete and then figure out what was going on after. Left it running overnight--had a power loss. Powered the system back on, and the resilvering continued with similar errors.

The console now spits out a tonne of " rrdcached plugin: stat (/var/db/collectd/rrd/freenas.local/zfs\_arc\_v2/<VariousFileNames>.rrd) failed: Integrity check failed"

The problem now is that while I still have access to the Pool, I can't read some of the files. Trying to access with SMB gives me an I/O error. I still have my original 4TB disks (disk 1 and 3), though I'm not sure if there's a way to re-add them to the original pool. I don't have a snapshot of the original 4TB disks, only a snapshot created after the first resilver. (So one 8TB and one 4TB) I did burn in the first drive, but not the second.

Is there any way to salvage my data? I'd just like to get my data off and recreate the pool. The more I try and troubleshoot the worse it seems to get. Any help would be greatly appreciated. TIA
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
the drive kept spitting out checksum errors while resilvering
This points to bad cabling, so probably should have been checked and corrected immediately.

4GB Corsair LPX DD4 RAM
If all you have is 4GB, you're going to have problems, no matter what.

You may find you can get a summary of your situation with zpool status -v

Once we understand what files are corrupt, we may be able to target the activities required to proceed.
 

irixion

Dabbler
Joined
Jan 28, 2018
Messages
10
This points to bad cabling, so probably should have been checked and corrected immediately.


If all you have is 4GB, you're going to have problems, no matter what.

You may find you can get a summary of your situation with zpool status -v

Once we understand what files are corrupt, we may be able to target the activities required to proceed.
Sorry, I forgot to add my second stick--I have 20GB of RAM. I've since done a zpool clear. I had hundreds of thousands of errors. The idea was to get my original drives in and..somehow..get them back into the pool. One of the discs I detached from the pool. Now I realize I can't reattach a disk. I'm just going to shut the NAS down until I figure out what I'm doing. Most I believe were along the lines of what I'm getting in the output log right now:
Jul 29 13:33:41 freenas 1 2022-07-29T13:33:41.227835-04:00 freenas.local collectd 1883 - - rrdcached plugin: stat (/var/db/collectd/rrd/freenas.local/df-var-db-system-syslog-1c36823397d74dbc9e7d4fbefe8b455e/df_complex-used.rrd) failed: Integrity check failed
Jul 29 13:33:41 freenas 1 2022-07-29T13:33:41.228271-04:00 freenas.local collectd 1883 - - rrdcached plugin: stat (/var/db/collectd/rrd/freenas.local/df-var-db-system-rrd-1c36823397d74dbc9e7d4fbefe8b455e/df_complex-reserved.rrd) failed: Input/output error
Jul 29 13:33:41 freenas 1 2022-07-29T13:33:41.228288-04:00 freenas.local collectd 1883 - - rrdcached plugin: stat (/var/db/collectd/rrd/freenas.local/df-var-db-system-configs-1c36823397d74dbc9e7d4fbefe8b455e/df_complex-reserved.rrd) failed: Integrity check failed
Jul 29 13:33:41 freenas 1 2022-07-29T13:33:41.228334-04:00 freenas.local collectd 1883 - - rrdcached plugin: stat (/var/db/collectd/rrd/freenas.local/df-var-db-system-samba4/df_complex-used.rrd) failed: Integrity check failed
Jul 29 13:33:41 freenas 1 2022-07-29T13:33:41.228762-04:00 freenas.local collectd 1883 - - rrdcached plugin: stat (/var/db/collectd/rrd/freenas.local/df-var-db-system-services/df_complex-free.rrd) failed: Integrity check failed
Jul 29 13:33:41 freenas 1 2022-07-29T13:33:41.228919-04:00 freenas.local collectd 1883 - - rrdcached plugin: stat (/var/db/collectd/rrd/freenas.local/df-var-db-system-configs-1c36823397d74dbc9e7d4fbefe8b455e/df_complex-free.rrd) failed: Integrity check failed
Jul 29 13:33:41 freenas 1 2022-07-29T13:33:41.229618-04:00 freenas.local collectd 1883 - - rrdcached plugin: stat (/var/db/collectd/rrd/freenas.local/df-var-db-system-services/df_complex-reserved.rrd) failed: Integrity check failed
Jul 29 13:33:41 freenas 1 2022-07-29T13:33:41.229725-04:00 freenas.local collectd 1883 - - rrdcached plugin: stat (/var/db/collectd/rrd/freenas.local/df-var-db-system-configs-1c36823397d74dbc9e7d4fbefe8b455e/df_complex-used.rrd) failed: Integrity check failed
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
When you finally feel brave enough to put it back online (which can be at any time as just booting a machine shouldn't destroy a pool), feel free to share zpool status -v for us to help you work out what's going on.
 

irixion

Dabbler
Joined
Jan 28, 2018
Messages
10
When you finally feel brave enough to put it back online (which can be at any time as just booting a machine shouldn't destroy a pool), feel free to share zpool status -v for us to help you work out what's going on.
Thrown onto Pastebin as it's far too large to post.


I should mention that eventually the web UI stops responding and I need to power the machine down to get it to respond again. I'm also getting errors reading files that aren't on that list. Should I scrub?

Part of me wants to clone one of my original drives onto the degraded drive but I'm sure that's a horrible idea.

Drive 1: Replaced with 8TB drive (Have untouched original 4TB)
Drive 2: Untouched 4TB
Drive 3: Replaced with 8TB drive, though no longer in pool (Have untouched original 4TB)
Drive 4: Untouched 4TB

Drive 1 and 3 are mirrored (Or, were I suppose) and 2 and 4 are mirrored.
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I should mention that eventually the web UI stops responding and I need to power the machine down to get it to respond again. I'm also getting errors reading files that aren't on that list. Should I scrub?
OK, so no. At least not yet.

What's damaged is mostly the system dataset (likely to be causing the lockups) and some part of one snapshot, which can probably be easily corrected by moving the system dataset to the boot pool and then back (if you do it right after a reboot, you should be able to get it to happen) and destroying that snapshot.

On top of that, it looks like you'll lose a VM and a couple of other things.

Once you have a system dataset that's free of errors, you can work on a scrub.

Before all of that, you probably want to look at why that first disk is throwing so many errors. They are all checksum, so probably cabling.

Then get a mirror going to restore redundancy to your pool.
 

irixion

Dabbler
Joined
Jan 28, 2018
Messages
10
I've moved the system dataset onto the boot pool. Moving it back (even after restarting) gives me errors again in the system shell, so I'm leaving it on the boot device for the time being. I've checked my cabling--even changed my SATA cable, checked all connections, all snug. I have these disks sitting in a hot swappable bay. What would cause these errors to just crop up after the drive was replaced? Perhaps the original resilvering process didn't complete correctly? I did also notice that after switching the system data set, that some more snapshots have appeared, though only for Bay.1_3/.system/samba4. I'm letting it scrub right now. Still some errors cropping up.

Getting this in the system shell during scrub.

Aug 4 22:06:58 freenas 1 2022-08-05T02:06:58.339077+00:00 freenas.local devd 431 - - notify_clients: send() failed; dropping unresponsive client

Is my drive misbehaving?

I've got the system config backed up from before I replaced any of the drives as well. I'm just going to assume that everything I think I know is wrong so I apologize if some of this is obvious.
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
What would cause these errors to just crop up after the drive was replaced? Perhaps the original resilvering process didn't complete correctly?
There may have been connection problems during the resilver or there may even still be problems with the port (can you try other SATA ports?).

If there were problems during the resilver, it should have been detected at that time and won't have finished properly, but I guess there's some small chance there's something left hanging from that event.

In terms of protecting your important data (assuming some or even all of it is important to you (outside of the .system area, which is only the system dataset and not important), you would do well to copy all of it that can currently be accessed to somewhere safe. There's no real way to bring back disks that were formerly pool members, so you could re-use at least one of those. (use rsync -auv or zfs send | zfs recv to get the data copied).

After you get all you can off, we can go for more invasive tests

What does zpool status currently look like?
 

irixion

Dabbler
Joined
Jan 28, 2018
Messages
10
There may have been connection problems during the resilver or there may even still be problems with the port (can you try other SATA ports?).

If there were problems during the resilver, it should have been detected at that time and won't have finished properly, but I guess there's some small chance there's something left hanging from that event.

In terms of protecting your important data (assuming some or even all of it is important to you (outside of the .system area, which is only the system dataset and not important), you would do well to copy all of it that can currently be accessed to somewhere safe. There's no real way to bring back disks that were formerly pool members, so you could re-use at least one of those. (use rsync -auv or zfs send | zfs recv to get the data copied).

After you get all you can off, we can go for more invasive tests

What does zpool status currently look like?
I've gone ahead and deleted the snapshots. I've tried different ports as well, Zpool status gives me:

1659735106978.png


As for the moving data--I've tried getting it off using SMB, with I/O errors. Should I wait for it to finish scrubbing and then try to move it? Would I add a blank disk to a separate pool and then do zfs send/receive? How can it self repair if the disk it was (supposed to be) mirroring is no longer in the pool? Can I interrupt the scrubbing and shut the system down?
 
Last edited:

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
Are the USB drives used for booting or storing data?
 

irixion

Dabbler
Joined
Jan 28, 2018
Messages
10
Are the USB drives used for booting or storing data?
The USB drives are strictly to boot from. No data is stored on them. Not sure what this even is, but it's in the system log. My freenas boot sticks are 16GB, and going through boot settings/stats say there's under 2 GB used.

Aug 6 16:04:40 freenas 1 2022-08-06T20:04:40.658557+00:00 freenas.local zfsd 434 - - CaseFile::Serialize: Unable to open /var/db/zfsd/cases/pool_13619038203381090411_vdev_3676503558838221277.case.
Aug 6 16:05:20 freenas syslog-ng[1044]: Error suspend timeout has elapsed, attempting to write again; fd='23'
Aug 6 16:05:20 freenas syslog-ng[1044]: I/O error occurred while writing; fd='23', error='No space left on device (28)'
Aug 6 16:05:20 freenas syslog-ng[1044]: Suspending write operation because of an I/O error; fd='23', time_reopen='60'
Aug 6 16:05:55 freenas 1 2022-08-06T20:05:55.532440+00:00 freenas.local devd 431 - - notify_clients: send() failed; dropping unresponsive client
Aug 6 16:06:20 freenas syslog-ng[1044]: Error suspend timeout has elapsed, attempting to write again; fd='23'
Aug 6 16:06:20 freenas syslog-ng[1044]: I/O error occurred while writing; fd='23', error='No space left on device (28)'

My pool status now looks like:

root@freenas[~]# zpool status
pool: Bay.1_3
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub in progress since Mon Aug 1 23:59:53 2022
2.68T scanned at 1.96M/s, 2.68T issued at 1.95M/s, 4.93T total
66.6M repaired, 54.27% done, no estimated completion time
config:

NAME STATE READ WRITE CKSUM
Bay.1_3 DEGRADED 0 0 0
gptid/bc525c46-0d0d-11ed-8e03-1c1b0d9d09dc DEGRADED 0 0 79.4M too many errors (repairing)
mirror-1 ONLINE 0 0 0
gptid/cd6882fb-9992-11e7-be50-1c1b0d9d09dc ONLINE 0 0 0
gptid/ce55cd01-9992-11e7-be50-1c1b0d9d09dc ONLINE 0 0 0

errors: 2423724 data errors, use '-v' for a list

pool: freenas-boot
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub repaired 19K in 00:11:45 with 0 errors on Sat Aug 6 03:56:48 2022
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
da0p2 ONLINE 0 0 0
da1p2 ONLINE 0 0 2

Is one of my USB drives also failing?
 
Last edited:

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
So you are saying that your USB drives are actually not drives but memory sticks? If so, you should replace them with e.g. a USB-to-SATA adapter and a real disk drive asap. Unless these memory sticks are enterprise-grade (i.e. expensive) they will wear out rather quickly. Older versions of FreeNAS could be run like this, but that has changed a number of years ago.
 

irixion

Dabbler
Joined
Jan 28, 2018
Messages
10
So is my data on the previously removed drives no good then? They weren't wiped. I have backups of most of the data on there. As far as going ahead and doing a zpool status -v, it gives me thousands of files. Is there a way to get it to dump all of that into a text file so I at least know what files are affected?


Doing a zpool history gives me this on the day(s) I moved the drives.

2022-07-15.10:44:41 zpool set cachefile=/data/zfs/zpool.cache Bay.1_3
2022-07-15.10:44:47 zfs snapshot Bay.1_3/.system/samba4@wbc-1657896282
2022-07-15.11:45:45 zpool offline Bay.1_3 /dev/gptid/cd6882fb-9992-11e7-be50-1c1b0d9d09dc
2022-07-15.11:47:32 zpool online Bay.1_3 /dev/gptid/cd6882fb-9992-11e7-be50-1c1b0d9d09dc
2022-07-15.11:50:25 zpool online Bay.1_3 /dev/gptid/a01534d6-7fca-11e7-8eaf-1c1b0d9d09dc
2022-07-15.11:52:36 zpool replace Bay.1_3 12662031583597268007 /dev/gptid/162cb3a7-0456-11ed-a981-1c1b0d9d09dc
2022-07-15.11:53:00 zpool detach Bay.1_3 /dev/gptid/a01534d6-7fca-11e7-8eaf-1c1b0d9d09dc
2022-07-16.21:58:43 zpool import 13619038203381090411 Bay.1_3
2022-07-16.21:58:43 zpool set cachefile=/data/zfs/zpool.cache Bay.1_3
2022-07-16.22:00:21 zfs snapshot Bay.1_3/.system/samba4@wbc-1658023127
2022-07-17.15:37:52 zpool import 13619038203381090411 Bay.1_3
2022-07-17.15:37:52 zpool set cachefile=/data/zfs/zpool.cache Bay.1_3

I could use a bit of guidance. If my previous drives' data can't be recovered then I'll just destroy the pool and start from scratch. I would like to get a list of all the files affected though, if possible.
 
Top