SOLVED One or more devices has experienced an error resulting in data corruption. Help???

Alex762 · Jan 11, 2024

Hello. I'm rather new with TrueNas and today I discovered that I got an error a few days ago saying "Pool storage state is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected."

I tried to run the "zpool status -v" command and this is what I got (sorry for screenshot, couldn't copy/paste the output):

So basically somehow a video I took years ago got corrupted, not a big deal, but what should I do next? Action says I can restore the file or restore from backup, but how? I have a RAID 1 (mirrored) configuration, but I'm not entirely sure what to do without risking to writing the wrong command and lose my data.
Any help will be greatly appreciated, thank you!!

sretalla · Jan 11, 2024

If you're not using snapshots and also don't have a backup, you've lost the non-corrupt version of that file forever now.

The 2 options you have:

1. test playing the file all the way through and if the player can cope with whatever the corruption is, you're in luck, nothing of consequence lost, just copy the file and delete the original (if you can) to make the error go away.

2. accept the loss of the file and delete it directly.

What you're going to want to do more generally is understand how this can happen to you... since it "should" be impossible with a ZFS pool with redundancy and properly designed and implemented hardware.

Perhaps if you share your hardware information, we could help you get to the bottom of that.

Alex762 · Jan 11, 2024

sretalla said:
If you're not using snapshots and also don't have a backup, you've lost the non-corrupt version of that file forever now.

The 2 options you have:

1. test playing the file all the way through and if the player can cope with whatever the corruption is, you're in luck, nothing of consequence lost, just copy the file and delete the original (if you can) to make the error go away.

2. accept the loss of the file and delete it directly.

What you're going to want to do more generally is understand how this can happen to you... since it "should" be impossible with a ZFS pool with redundancy and properly designed and implemented hardware.

Perhaps if you share your hardware information, we could help you get to the bottom of that.

Thank you for the suggestion! I feel much dumber right now because I always thought Raid 1 also prevented data from such corruptions, but guess this works only if a drive fails completely. Well luckily that file wasn't much of an issue anyway, I will try to copy/paste that file in my PC and see if it plays. Still very weird that the ZSF didn't do the trick here :/

My setup is a bit sketchy, but it works somehow for what I use it for:
-CPU: Intel Core i3-3240
-RAM: 8GB (planning to upgrade it)
-Storage: 1x Kingston 125 GB SSD (boot/TrueNas system disk), 2x HDD Western Digital Red 1TB Raid 1 (data)
It has a single pool in mirrored mode, nothing fancy, really..

Perhaps what happened is that I also got a warning saying that the system unexpectedly rebooted when I was away from xmas a few days ago, and then I got this corruption error just two days ago. Still weird how it happened tho, I doubt there was a power outage since I got it hooked up on an UPS, so I have literally no idea how all of this happened... Any way to best avoid it for the future?

sretalla · Jan 11, 2024

Are those very old WD Reds or are they Red Plus newer ones?

If they are SMR (not Plus), then they are the wrong hardware for ZFS and might explain the problem.

Alex762 · Jan 11, 2024

sretalla said:
Are those very old WD Reds or are they Red Plus newer ones?

If they are SMR (not Plus), then they are the wrong hardware for ZFS and might explain the problem.

Yes they are the Red Plus, I luckily was well aware of the SMR having issues so I avoided them in the first place.

The video plays fine anyway, I tried to copy/paste it from the NAS to my PC and I had an "error 0x8007045D: The request could not be performed because of an I/O device error" (?) I tried to transfer another random file just to be sure and got no issues.

As some other suggested me, I tried to re-run a "zpool scrub" and stop it after like 10 minutes, but still had the pool in error. Guess I'll try to delete the file and see what happens...

Jailer · Jan 11, 2024

You keep saying your disks are set up in "RAID 1". How exactly are the disks attached? Are you using some sort of hardware raid controller? Specifics matter so please be detailed and specific.

Alex762 · Jan 11, 2024

Jailer said:
You keep saying your disks are set up in "RAID 1". How exactly are the disks attached? Are you using some sort of hardware raid controller? Specifics matter so please be detailed and specific.

I'm not using any hardware controller, all disks are directly attached to the motherboard.

Alex762 · Jan 11, 2024

Small update: I have deleted the file in question since I couldn't do anything (didn't need it anyway and wasn't that important, so yey more disk space now I guess). I have re-run a scrub which didn't give me any errors, but the GUI still displays that pool is still unhealty. Another run of zpool status resulted in this:

What are those cksum errors? Is my disk about to actually die? (?)

Alex762 · Jan 11, 2024

Small update #2: After noticing the 220 checksum errors on both drive (weird?) I ran a SMART test on both drives, and all were successful. I have shut down the machine as precaution, tomorrow before booting it back up I will try to look for any lose cable, dust or whatever stuff is happening inside. Feel free to leave any suggestion below and thanks in advance for the help!

Alex762 · Jan 12, 2024

Update #3: I checked if everything was ok inside, checked cables and connections: no issues whatsoever. I rebooted the machine back, pool was magically online and healthy again. No errors, zero checksums.

BUT- I did another scrub, and I'm back again with this:

Pool went in error, AGAIN, but still the drive is accessible, and this time no data corruption got involved.
I still have 3 checksum errors, still better than the 220 from yesterday but here they are still. SMART tests on drives seems they are not giving out any issue at all. I checked for line 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always which should be normal (?)

At this point I really do have run out of ideas on wtf is happening :/

NugentS · Jan 12, 2024

chksum errors are often, but not always the result of bad data cabling. I would suggest reseating / then replacing the sata cables as a cheap diagnotic there

Alex762 · Jan 14, 2024

NugentS said:
chksum errors are often, but not always the result of bad data cabling. I would suggest reseating / then replacing the sata cables as a cheap diagnotic there

Thanks a lot for the suggestions. Yesterday I provided to swap position and replace a couple of SATA cables, I re-run a scrub and now everything went ok, no errors or checksums for now. I do really hope the issue is solved, but in any case I did a backup on an external hard drive as extra precaution.

Again, thanks a lot everyone for the help, and I wish you a good day.

Important Announcement for the TrueNAS Community.

SOLVED One or more devices has experienced an error resulting in data corruption. Help???

Alex762

Dabbler

sretalla

Powered by Neutrality

Alex762

Dabbler

sretalla

Powered by Neutrality

Alex762

Dabbler

Jailer

Not strong, but bad

Alex762

Dabbler

Alex762

Dabbler

Alex762

Dabbler

Alex762

Dabbler

NugentS

MVP

Alex762

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

SOLVED One or more devices has experienced an error resulting in data corruption. Help???

Dabbler

Powered by Neutrality

Dabbler

Powered by Neutrality

Dabbler

Not strong, but bad

Dabbler

Dabbler

Dabbler

Dabbler

MVP

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "One or more devices has experienced an error resulting in data corruption. Help???"

Similar threads