Hello,
No unresolved issue here, I just wanted to share my experience with a recent problem I had with data corruption due to faulty RAM.
Current data protection setup:
It has been running like this for 18 months.
Last scrub finished yesterday and displayed errors on all zpool (data and VM disks).
The same amount of checksum error appeared on all disks at once while SMART test reported no issue on the disks itselves.
3 media were affected as well as the snapshot version of these media for the last 3 snapshots.
I got surprised the SSD were affected as well...
I performed a manual snapshot of the VM emby and removed the affected snapshots.
Ran a second scrub and the pool came back healthy.
Ok maybe it was a legit data corruption... So I did the same for the data by removing the 3 affected media files as well as the corresponding snapshots.
Ran again the scrub and this time it was another media file and all its related snapshot back to one year ago that were corrupted !
I decided to shutdown the server and perform a memtest, it failed almost immediately at 4%.
Removed some sticks and investigated, I found 3 out of 8 sticks to be defective.
I removed 4 sticks (dual channel, I cannot remove 3 out of 8) and performed a scrub again, everything is fine.
Conclusion : My data was not corrupted at all, and the bad RAM corrupted it during the scrub --> Go ECC
No unresolved issue here, I just wanted to share my experience with a recent problem I had with data corruption due to faulty RAM.
Current data protection setup:
- Snapshot every week for data on HDD
- Snapshot every day for VM disks on SSD
- Scrub every month for data
- Scrub every week for VM disks
- (Replication every week to a 2nd server)
It has been running like this for 18 months.
Last scrub finished yesterday and displayed errors on all zpool (data and VM disks).
The same amount of checksum error appeared on all disks at once while SMART test reported no issue on the disks itselves.
3 media were affected as well as the snapshot version of these media for the last 3 snapshots.
Code:
root@truenas[/home/admin]# zpool status -v Tank .... NAME STATE READ WRITE CKSUM Tank ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 deb7a7eb-91bc-4635-bbfe-ff7cdd76276c ONLINE 0 0 566 ed462975-79bd-4acf-b659-1dbd732cf096 ONLINE 0 0 566 921ba2a4-f8a6-4755-a22d-149a6fe6de9a ONLINE 0 0 566 b3da2cea-d073-4c4a-8c11-021fe0663471 ONLINE 0 0 567 a2c15e66-1e2b-4091-879e-6dc3291f90a6 ONLINE 0 0 566 68db74fc-bcc2-4682-81b5-616ef6bac80b ONLINE 0 0 566 d397c37a-6543-42d0-a5da-3e7c3e19cb3e ONLINE 0 0 566 d18c8134-878a-48c3-9ffa-6436418fb071 ONLINE 0 0 566 errors: Permanent errors have been detected in the following files: /mnt/Tank/Emby/Emby-TVShows/Formula 1 Drive to Survive/Season 06/Formula 1 Drive to Survive S06E04 The Last Chapter.mkv /mnt/Tank/Emby/Emby-TVShows/Lycée Toulouse-Lautrec/Saison 2/Lycée Toulouse-Lautrec S02E04 La résilience.mkv /mnt/Tank/Emby/Emby-TVShows/From/Season 01/From S01E09 Into the Woods.mkv Tank/Emby/Emby-TVShows@auto-2024-04-01_00-00:/Formula 1 Drive to Survive/Season 06/Formula 1 Drive to Survive S06E04 The Last Chapter.mkv Tank/Emby/Emby-TVShows@auto-2024-04-01_00-00:/Lycée Toulouse-Lautrec/Saison 2/Lycée Toulouse-Lautrec S02E04 La résilience.mkv Tank/Emby/Emby-TVShows@auto-2024-04-01_00-00:/From/Season 01/From S01E09 Into the Woods.mkv Tank/Emby/Emby-TVShows@auto-2024-03-18_00-00:/Formula 1 Drive to Survive/Season 06/Formula 1 Drive to Survive S06E04 The Last Chapter.mkv Tank/Emby/Emby-TVShows@auto-2024-03-18_00-00:/Lycée Toulouse-Lautrec/Saison 2/Lycée Toulouse-Lautrec S02E04 La résilience.mkv Tank/Emby/Emby-TVShows@auto-2024-03-18_00-00:/From/Season 01/From S01E09 Into the Woods.mkv Tank/Emby/Emby-TVShows@auto-2024-03-25_00-00:/Formula 1 Drive to Survive/Season 06/Formula 1 Drive to Survive S06E04 The Last Chapter.mkv Tank/Emby/Emby-TVShows@auto-2024-03-25_00-00:/Lycée Toulouse-Lautrec/Saison 2/Lycée Toulouse-Lautrec S02E04 La résilience.mkv Tank/Emby/Emby-TVShows@auto-2024-03-25_00-00:/From/Season 01/From S01E09 Into the Woods.mkv
I got surprised the SSD were affected as well...
Code:
root@truenas[/home/admin]# zpool status -v VM-OS .... errors: Permanent errors have been detected in the following files: VM-OS/VM-Disks/emby-ov6oq7:<0x1> VM-OS/VM-Disks/emby-ov6oq7@auto-2024-04-01_03-00:<0x1> VM-OS/VM-Disks/emby-ov6oq7@auto-2024-03-31_03-00:<0x1>
I performed a manual snapshot of the VM emby and removed the affected snapshots.
Ran a second scrub and the pool came back healthy.
Ok maybe it was a legit data corruption... So I did the same for the data by removing the 3 affected media files as well as the corresponding snapshots.
Ran again the scrub and this time it was another media file and all its related snapshot back to one year ago that were corrupted !
I decided to shutdown the server and perform a memtest, it failed almost immediately at 4%.
Removed some sticks and investigated, I found 3 out of 8 sticks to be defective.
I removed 4 sticks (dual channel, I cannot remove 3 out of 8) and performed a scrub again, everything is fine.
Conclusion : My data was not corrupted at all, and the bad RAM corrupted it during the scrub --> Go ECC