Problem with degraded volume

csoltenborn

Cadet
Joined
Feb 7, 2021
Messages
7
Hi,

I have a degraded TrueNAS pool, and I'm writing in the hope that you guys can help me out. And please let me know in case this isn't the the right place to ask...

I'm running TrueNAS Core 12.0-U4 on the following hardware: Asus B450M-A/CSM, AMD Ryzen 5 3400G, 2x 8GB Corsair DIMM DDR4-3000, 2x 16GB SanDisk Ultra Fit (Boot Device), 2x 4TB WD Red Plus SA3, 1x 512GB HDD TOSHIBA MK5061GSYN, 1x 256GB SSD SAMSUNG MMDOE56G5MXP-0VB.

The degraded pool is Main, which consists of the two 4TB WD HDDs (in mirror mode). Relevant structure of Main (not sure how to print that info via zpool):
Main (Filesystem)
-- Media (Filesystem)
-- VMs (Filesystem)
----- HomeAssistant_OS (Volume)
----- Ubuntu-94ajr6 (Volume)

The HomeAssistant_OS volume is an installation of HomeAssistant OS (following this tutorial if I remember correctly: https://community.home-assistant.io...os-on-freenas-without-iocage-or-docker/133738).

Output of "zpool status -v Main":

pool: Main
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 02:02:46 with 15 errors on Wed Jun 16 04:02:46 2021
config:

NAME STATE READ WRITE CKSUM
Main DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
gptid/60e2112b-0bab-11eb-90a2-3c7c3f7ce9b6 DEGRADED 0 0 52 too many errors
gptid/60eeea63-0bab-11eb-90a2-3c7c3f7ce9b6 DEGRADED 0 0 52 too many errors

errors: Permanent errors have been detected in the following files:

Main/VMs/HomeAssistant_OS@manual-2021-05-21_14-20:<0x1>
Main/Media:<0x0>

I have also attached the output of smartctl for drives ada2 and ada3 (the WD HDDs) - looks good to me (the drives are pretty new), but I'm not completely sure.

Now, I do not see any issues despite the fact that my HomeAssistant VM does not start any more - if I boot the machine and connect via VNC, I get lots of errors as seen in the attached screenshot (no idea how to copy VNC's output), and that never appears to stop.

So, here are my questions:
1) I do not understand the file format of the the zpool command - what does "@manual-2021-02-21_14-20:<0x1>" mean? The part after the @ obviously points to a snapshot, but what about the rest? Same for the other "file": What is "Main/Media:<0x0>"?
2) Unfortunately, I do not have a backup (this is not a critical system - I have now ordered an external backup drive :smile: ). Is there any way to repair the VM (or at least save my HomeAssistant configuration files from it)?
3) I suspect that the problem has been caused by an "unscheduled system reboot" reported by TrueNAS in the context of the problem (probably either because TrueNAS crashed or because of a power outage!? Not sure). Is there anything I can do to prevent the file system from being corrupted by problems like this?
4) Is there any tutorial which systematically explains how to solve problems like this? I'm a bit surprised by the lack of information/help TrueNAS provides for my problem via the UI...

Thanks in advance
Christian
 

Attachments

  • smartctl_ada2.txt
    6.1 KB · Views: 138
  • smartctl_ada3.txt
    6.1 KB · Views: 154
  • VM_Boot_Errors.JPG
    VM_Boot_Errors.JPG
    298.3 KB · Views: 116

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I have also attached the output of smartctl for drives ada2 and ada3 (the WD HDDs) - looks good to me (the drives are pretty new), but I'm not completely sure.
The errors shown are all CRC errors, so you should really start by checking your connections/cabling.

What is "Main/Media:<0x0>
That's the directory/dataset metadata for that path.

what does "@manual-2021-02-21_14-20:<0x1>" mean?
That identifies the snapshot by name and then the first block in the snapshot, probably just means you should destroy that snapshot.
 

csoltenborn

Cadet
Joined
Feb 7, 2021
Messages
7
Thanks, @sretalla ! Couple of followup questions if I may:

Concerning the cabling: What is there to check? How can I test the connections? Should I just get "more expensive"cables?

Metadata: Does that mean that Main/Media is completely "lost"? I appear to be able to access at least many of the files in that folder... What's the recommended way to get rid of the error message (and have a stable state of the data in that folder)? Please advice me on how to deal with that...

You haven't said anything about accessing my HomeAssistant_OS volume - is that something to ask in the HomeAssistant community?

Thanks in advance!
Christian
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Does that mean that Main/Media is completely "lost"? I appear to be able to access at least many of the files in that folder.
Not necessarily.

I would extract all the content you can get from it to a backup location and prepare to destroy/recreate it later after working out the CRC errors.

You haven't said anything about accessing my HomeAssistant_OS volume - is that something to ask in the HomeAssistant community?
The corruption there is in a snapshot, so shouldn't strictly be part of the problem, so I would first work out the CRC errors and see if it can just be brought back without further need for recovery.

If not, you may need to see about copying out the zvol with dd and then mounting it on another system (I would wait for that).

Concerning the cabling: What is there to check? How can I test the connections? Should I just get "more expensive"cables?
Usually not so much the cost, but the connection between the SATA controller and the disks (a SATA cable). Check the connections at both ends.

Also check the output from dmesg at the shell and see if there are any clues there from CAM STATUS messages.

What's the recommended way to get rid of the error message (and have a stable state of the data in that folder)?
You'll need to destroy the dataset and re-create it (populated with the data you could grab).

You might also want to share some information about the rest of your hardware so we can help to assess if that's a contributing factor. (see the Forum Rules in red at the top of the page here for guidance on what to share).
 

csoltenborn

Cadet
Joined
Feb 7, 2021
Messages
7
Thanks once more... But for what do I check the connections at both ends? Is it just a mere "pull them off, then push them back carefully, making sure everything looks and feels solid", or is there more that can be done? I've built three computers during my lifetime, but cabling never was an issue up to now, so I'm just not sure what to check for...

Media folder: I will try that!

Missing HW information: I had checked the Forum Rules, and I believe that I provided all recommended information (HD and network controller are the ones of the mainboard) - are you missing anything in particular? Sorry if I've overseen something obvious...
 

csoltenborn

Cadet
Joined
Feb 7, 2021
Messages
7
I've checked the cables of the WD drives and re-plugged them - let's see. But one thing confuses me: For me, it looks like zpool status from above says that both drives have (the same?!) errors - isn't it rather unlikely that both cables have problems, let alone resulting in the same errors?

The dmesg output does not contain the string "cam". Here's the part of the WD drives:

ada2 at ahcich8 bus 0 scbus4 target 0 lun 0
ada2: <WDC WD40EFRX-68N32N0 82.00A82> ACS-3 ATA SATA 3.x device
ada2: Serial Number WD-WCC7K2JF0DSA
ada2: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada2: Command Queueing enabled
ada2: 3815447MB (7814037168 512 byte sectors)
ada2: quirks=0x1<4K>
ada3 at ahcich9 bus 0 scbus5 target 0 lun 0
ada3: <WDC WD40EFRX-68N32N0 82.00A82> ACS-3 ATA SATA 3.x device
ada3: Serial Number WD-WCC7K4KLH8NF
ada3: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada3: Command Queueing enabled
ada3: 3815447MB (7814037168 512 byte sectors)
ada3: quirks=0x1<4K>

Couple of more problems:
- copying the content of /mnt/Media to one of my other drives (cp -Rv /mnt/Main/Media /mnt/HDD_512/Temp) crashes the machine (!) at some point in time - any suggestions on how to backup the files in a more robust way?
- after deleting the HomeAssistant_OS snapshot, zpool status -v Main now results in the following (the first "file" is kind of scary :smile:)

pool: Main
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 02:02:46 with 15 errors on Wed Jun 16 04:02:46 2021
config:

NAME STATE READ WRITE CKSUM
Main DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
gptid/60e2112b-0bab-11eb-90a2-3c7c3f7ce9b6 DEGRADED 0 0 0 too many errors
gptid/60eeea63-0bab-11eb-90a2-3c7c3f7ce9b6 DEGRADED 0 0 0 too many errors

errors: Permanent errors have been detected in the following files:

<0x1138>:<0x1>
Main/Media:<0x0>
Main/VMs/HomeAssistant_OS:<0x1>
 
Top