SOLVED Another reason to use ECC RAM

Okeur75

Dabbler
Joined
Nov 16, 2022
Messages
36
Hello,

No unresolved issue here, I just wanted to share my experience with a recent problem I had with data corruption due to faulty RAM.

Current data protection setup:
  • Snapshot every week for data on HDD
  • Snapshot every day for VM disks on SSD
  • Scrub every month for data
  • Scrub every week for VM disks
  • (Replication every week to a 2nd server)

It has been running like this for 18 months.

Last scrub finished yesterday and displayed errors on all zpool (data and VM disks).
The same amount of checksum error appeared on all disks at once while SMART test reported no issue on the disks itselves.
3 media were affected as well as the snapshot version of these media for the last 3 snapshots.

Code:
root@truenas[/home/admin]# zpool status -v Tank
....
        NAME                                      STATE     READ WRITE CKSUM
        Tank                                      ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            deb7a7eb-91bc-4635-bbfe-ff7cdd76276c  ONLINE       0     0   566
            ed462975-79bd-4acf-b659-1dbd732cf096  ONLINE       0     0   566
            921ba2a4-f8a6-4755-a22d-149a6fe6de9a  ONLINE       0     0   566
            b3da2cea-d073-4c4a-8c11-021fe0663471  ONLINE       0     0   567
            a2c15e66-1e2b-4091-879e-6dc3291f90a6  ONLINE       0     0   566
            68db74fc-bcc2-4682-81b5-616ef6bac80b  ONLINE       0     0   566
            d397c37a-6543-42d0-a5da-3e7c3e19cb3e  ONLINE       0     0   566
            d18c8134-878a-48c3-9ffa-6436418fb071  ONLINE       0     0   566

errors: Permanent errors have been detected in the following files:

        /mnt/Tank/Emby/Emby-TVShows/Formula 1 Drive to Survive/Season 06/Formula 1 Drive to Survive S06E04 The Last Chapter.mkv
        /mnt/Tank/Emby/Emby-TVShows/Lycée Toulouse-Lautrec/Saison 2/Lycée Toulouse-Lautrec S02E04 La résilience.mkv
        /mnt/Tank/Emby/Emby-TVShows/From/Season 01/From S01E09 Into the Woods.mkv
        Tank/Emby/Emby-TVShows@auto-2024-04-01_00-00:/Formula 1 Drive to Survive/Season 06/Formula 1 Drive to Survive S06E04 The Last Chapter.mkv
        Tank/Emby/Emby-TVShows@auto-2024-04-01_00-00:/Lycée Toulouse-Lautrec/Saison 2/Lycée Toulouse-Lautrec S02E04 La résilience.mkv
        Tank/Emby/Emby-TVShows@auto-2024-04-01_00-00:/From/Season 01/From S01E09 Into the Woods.mkv
        Tank/Emby/Emby-TVShows@auto-2024-03-18_00-00:/Formula 1 Drive to Survive/Season 06/Formula 1 Drive to Survive S06E04 The Last Chapter.mkv
        Tank/Emby/Emby-TVShows@auto-2024-03-18_00-00:/Lycée Toulouse-Lautrec/Saison 2/Lycée Toulouse-Lautrec S02E04 La résilience.mkv
        Tank/Emby/Emby-TVShows@auto-2024-03-18_00-00:/From/Season 01/From S01E09 Into the Woods.mkv
        Tank/Emby/Emby-TVShows@auto-2024-03-25_00-00:/Formula 1 Drive to Survive/Season 06/Formula 1 Drive to Survive S06E04 The Last Chapter.mkv
        Tank/Emby/Emby-TVShows@auto-2024-03-25_00-00:/Lycée Toulouse-Lautrec/Saison 2/Lycée Toulouse-Lautrec S02E04 La résilience.mkv
        Tank/Emby/Emby-TVShows@auto-2024-03-25_00-00:/From/Season 01/From S01E09 Into the Woods.mkv


I got surprised the SSD were affected as well...

Code:
root@truenas[/home/admin]# zpool status -v VM-OS
....
errors: Permanent errors have been detected in the following files:

        VM-OS/VM-Disks/emby-ov6oq7:<0x1>
        VM-OS/VM-Disks/emby-ov6oq7@auto-2024-04-01_03-00:<0x1>
        VM-OS/VM-Disks/emby-ov6oq7@auto-2024-03-31_03-00:<0x1>


I performed a manual snapshot of the VM emby and removed the affected snapshots.
Ran a second scrub and the pool came back healthy.
Ok maybe it was a legit data corruption... So I did the same for the data by removing the 3 affected media files as well as the corresponding snapshots.
Ran again the scrub and this time it was another media file and all its related snapshot back to one year ago that were corrupted !

I decided to shutdown the server and perform a memtest, it failed almost immediately at 4%.
Removed some sticks and investigated, I found 3 out of 8 sticks to be defective.

I removed 4 sticks (dual channel, I cannot remove 3 out of 8) and performed a scrub again, everything is fine.

Conclusion : My data was not corrupted at all, and the bad RAM corrupted it during the scrub --> Go ECC
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
Conclusion : My data was not corrupted at all, and the bad RAM corrupted it during the scrub --> Go ECC
Would love to hear what more knowledgeable members have to say, but I doubt the actual data got corrupted.
Would have been interesting to pull to damaged files and compare their checksums against known good data.

My bet would be that due to the failing memory it was not possible to read the correct checksum and ultimately the files were deemed damaged.

It's been a while but I remember this as an interesting read:


I advocate the use of ECC memory, don't get me wrong.
 

Okeur75

Dabbler
Joined
Nov 16, 2022
Messages
36
I haven't performed the checksum on the "corrupted" files however they were unreadable anymore. I was unable to download them from emby, and the playback stalled at a specific timestamp.
I considered this as corrupted but I may be wrong.
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
I considered this as corrupted but I may be wrong.
I agree. And the files were scrubbed before?

What I was aiming at is, I did not think bad memory would corrupt a file and its snapshots after it was written to the disks (and deemed okay).

The bad memory certainly can corrupt the file when you first write it to the disk or change it. But a read only snapshot that once "survived" a scrub should not be corrupted due to the scrub. At least this is my understanding.
 
Top