SOLVED One or more devices has experienced an error resulting in data corruption. Help???

Alex762

Dabbler
Joined
Jul 31, 2022
Messages
28
Hello. I'm rather new with TrueNas and today I discovered that I got an error a few days ago saying "Pool storage state is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected."

I tried to run the "zpool status -v" command and this is what I got (sorry for screenshot, couldn't copy/paste the output):
1705003227623.png


So basically somehow a video I took years ago got corrupted, not a big deal, but what should I do next? Action says I can restore the file or restore from backup, but how? I have a RAID 1 (mirrored) configuration, but I'm not entirely sure what to do without risking to writing the wrong command and lose my data.
Any help will be greatly appreciated, thank you!!
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
If you're not using snapshots and also don't have a backup, you've lost the non-corrupt version of that file forever now.

The 2 options you have:

1. test playing the file all the way through and if the player can cope with whatever the corruption is, you're in luck, nothing of consequence lost, just copy the file and delete the original (if you can) to make the error go away.

2. accept the loss of the file and delete it directly.

What you're going to want to do more generally is understand how this can happen to you... since it "should" be impossible with a ZFS pool with redundancy and properly designed and implemented hardware.

Perhaps if you share your hardware information, we could help you get to the bottom of that.
 

Alex762

Dabbler
Joined
Jul 31, 2022
Messages
28
If you're not using snapshots and also don't have a backup, you've lost the non-corrupt version of that file forever now.

The 2 options you have:

1. test playing the file all the way through and if the player can cope with whatever the corruption is, you're in luck, nothing of consequence lost, just copy the file and delete the original (if you can) to make the error go away.

2. accept the loss of the file and delete it directly.

What you're going to want to do more generally is understand how this can happen to you... since it "should" be impossible with a ZFS pool with redundancy and properly designed and implemented hardware.

Perhaps if you share your hardware information, we could help you get to the bottom of that.
Thank you for the suggestion! I feel much dumber right now because I always thought Raid 1 also prevented data from such corruptions, but guess this works only if a drive fails completely. Well luckily that file wasn't much of an issue anyway, I will try to copy/paste that file in my PC and see if it plays. Still very weird that the ZSF didn't do the trick here :/

My setup is a bit sketchy, but it works somehow for what I use it for:
-CPU: Intel Core i3-3240
-RAM: 8GB (planning to upgrade it)
-Storage: 1x Kingston 125 GB SSD (boot/TrueNas system disk), 2x HDD Western Digital Red 1TB Raid 1 (data)
It has a single pool in mirrored mode, nothing fancy, really..

Perhaps what happened is that I also got a warning saying that the system unexpectedly rebooted when I was away from xmas a few days ago, and then I got this corruption error just two days ago. Still weird how it happened tho, I doubt there was a power outage since I got it hooked up on an UPS, so I have literally no idea how all of this happened... Any way to best avoid it for the future?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Are those very old WD Reds or are they Red Plus newer ones?

If they are SMR (not Plus), then they are the wrong hardware for ZFS and might explain the problem.
 

Alex762

Dabbler
Joined
Jul 31, 2022
Messages
28
Are those very old WD Reds or are they Red Plus newer ones?

If they are SMR (not Plus), then they are the wrong hardware for ZFS and might explain the problem.
Yes they are the Red Plus, I luckily was well aware of the SMR having issues so I avoided them in the first place.

The video plays fine anyway, I tried to copy/paste it from the NAS to my PC and I had an "error 0x8007045D: The request could not be performed because of an I/O device error" (?) I tried to transfer another random file just to be sure and got no issues.

As some other suggested me, I tried to re-run a "zpool scrub" and stop it after like 10 minutes, but still had the pool in error. Guess I'll try to delete the file and see what happens...
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
You keep saying your disks are set up in "RAID 1". How exactly are the disks attached? Are you using some sort of hardware raid controller? Specifics matter so please be detailed and specific.
 

Alex762

Dabbler
Joined
Jul 31, 2022
Messages
28
You keep saying your disks are set up in "RAID 1". How exactly are the disks attached? Are you using some sort of hardware raid controller? Specifics matter so please be detailed and specific.
I'm not using any hardware controller, all disks are directly attached to the motherboard.
 

Alex762

Dabbler
Joined
Jul 31, 2022
Messages
28
Small update: I have deleted the file in question since I couldn't do anything (didn't need it anyway and wasn't that important, so yey more disk space now I guess). I have re-run a scrub which didn't give me any errors, but the GUI still displays that pool is still unhealty. Another run of zpool status resulted in this:
1705010252159.png

What are those cksum errors? Is my disk about to actually die? (?)
 
Last edited:

Alex762

Dabbler
Joined
Jul 31, 2022
Messages
28
Small update #2: After noticing the 220 checksum errors on both drive (weird?) I ran a SMART test on both drives, and all were successful. I have shut down the machine as precaution, tomorrow before booting it back up I will try to look for any lose cable, dust or whatever stuff is happening inside. Feel free to leave any suggestion below and thanks in advance for the help!
 

Alex762

Dabbler
Joined
Jul 31, 2022
Messages
28
Update #3: I checked if everything was ok inside, checked cables and connections: no issues whatsoever. I rebooted the machine back, pool was magically online and healthy again. No errors, zero checksums.

BUT- I did another scrub, and I'm back again with this:
1705048230393.png

Pool went in error, AGAIN, but still the drive is accessible, and this time no data corruption got involved.
I still have 3 checksum errors, still better than the 220 from yesterday but here they are still. SMART tests on drives seems they are not giving out any issue at all. I checked for line 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always which should be normal (?)

At this point I really do have run out of ideas on wtf is happening :/
 
Last edited:

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
chksum errors are often, but not always the result of bad data cabling. I would suggest reseating / then replacing the sata cables as a cheap diagnotic there
 

Alex762

Dabbler
Joined
Jul 31, 2022
Messages
28
chksum errors are often, but not always the result of bad data cabling. I would suggest reseating / then replacing the sata cables as a cheap diagnotic there
Thanks a lot for the suggestions. Yesterday I provided to swap position and replace a couple of SATA cables, I re-run a scrub and now everything went ok, no errors or checksums for now. I do really hope the issue is solved, but in any case I did a backup on an external hard drive as extra precaution.

Again, thanks a lot everyone for the help, and I wish you a good day.
 
Top