The setup:
4X Segate CONSTELLATION ES.3 HDD 7200rpm drives (Refurbished by the manufacturer, still got warranty on them.)
Ryzen 3 2200G
1 unspecified NVME M.2 drive (Some laptop one, this shouldn't be the root of the issue, works fine.)
GA-A320M-H Mobo.
24GB RAM (At first it was 2 sticks, one was corrupted, it did not detect or boot, then I used 1 16gb NEW stick, the same issue, now I'm using one 8 and one 16GB stick.)
TrueNAS-SCALE-23.10.2
Everything operates on a 1gig connection.
I've set up my first NAS. installed everything, worked perfectly. Then I started installing apps, and here's the issue. While doing that, I noticed one of the drives making a weird clicking sound (It's the sound when the head hits the surface, something you DON'T want to hear). It would OFF the drive every now and then for a few seconds, which caused some corruption (Only in the apps being installed, no data was moved at that point.)
Now here's my issue, I fixed that drive with a simple cable swap (No new cable, I just swapped the cable with another drive, even swapped the cables around from all drives). It doesn't go offline anymore. I ran it through HDTune to find bad sectors, none were found in the quick scan (Did not do the long one yet, it will take ages to do.) The cables were not the issue I think and hope. I'm still getting corruption not on that 1 specific drive, BUT ON 2 DRIVES, even with switching the "bad" cables I was thinking that they were, only these specific drives report errors. Sometimes it's only 64 units, other times it's 424 units or a bit more, never anything over 1k over the span of a few days.
Here's what zpool status spews out:
Now, I know about bad connections, I've read through A LOT of forum posts trying to get this issue pinned down. I don't think anything is invalid with cabling. I re-seated the cables multiple times by now. I've ran multiple scrubs, as you see, nothing was repaired, with multiple errors.
I need help with going forward from this. I think I used up all of my options by now. I haven't done anything through SSH, besides clearing the errors to see what would happen. I also haven't swapped the drive for another one, don't have the resources to do that for now. A fresh install is also out of the window, since I've got about 3TB of data that I can't store anywhere for now. It NEEDS to be on that NAS, there's simply no other space to put it. Snapshots are also out of the window, since I've been going at this for at least 2 weeks. I've got only 5 snapshots going.
I have no clue what the best course of action would be. Run Seatools to fix the drives up? Contact the drive company for warranty? Do something specific in the file system? I'm lost.
4X Segate CONSTELLATION ES.3 HDD 7200rpm drives (Refurbished by the manufacturer, still got warranty on them.)
Ryzen 3 2200G
1 unspecified NVME M.2 drive (Some laptop one, this shouldn't be the root of the issue, works fine.)
GA-A320M-H Mobo.
24GB RAM (At first it was 2 sticks, one was corrupted, it did not detect or boot, then I used 1 16gb NEW stick, the same issue, now I'm using one 8 and one 16GB stick.)
TrueNAS-SCALE-23.10.2
Everything operates on a 1gig connection.
I've set up my first NAS. installed everything, worked perfectly. Then I started installing apps, and here's the issue. While doing that, I noticed one of the drives making a weird clicking sound (It's the sound when the head hits the surface, something you DON'T want to hear). It would OFF the drive every now and then for a few seconds, which caused some corruption (Only in the apps being installed, no data was moved at that point.)
Now here's my issue, I fixed that drive with a simple cable swap (No new cable, I just swapped the cable with another drive, even swapped the cables around from all drives). It doesn't go offline anymore. I ran it through HDTune to find bad sectors, none were found in the quick scan (Did not do the long one yet, it will take ages to do.) The cables were not the issue I think and hope. I'm still getting corruption not on that 1 specific drive, BUT ON 2 DRIVES, even with switching the "bad" cables I was thinking that they were, only these specific drives report errors. Sometimes it's only 64 units, other times it's 424 units or a bit more, never anything over 1k over the span of a few days.
Here's what zpool status spews out:
Code:
admin@truenas[~]$ sudo zpool status -v pool: DOM_DATA_ALL state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A scan: scrub repaired 0B in 02:15:33 with 13 errors on Sun Mar 24 22:20:14 2024 config: NAME STATE READ WRITE CKSUM DOM_DATA_ALL DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 d18f6763-6044-408d-b946-567923f24c6a DEGRADED 0 0 424 too many errors 4231ec6f-ab3a-4c62-9cc7-39bcb90a4694 DEGRADED 0 0 424 too many errors c4cba4c4-866d-4614-9ffd-9322a81c3ee7 ONLINE 0 0 0 dccafb81-634a-4f72-adb3-674eff475320 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: DOM_DATA_ALL/ix-applications/catalogs@auto-2024-03-24_01-00:<0x24e31> DOM_DATA_ALL/ix-applications/catalogs@auto-2024-03-24_01-00:<0x24d3e> DOM_DATA_ALL/ix-applications/catalogs@auto-2024-03-24_01-00:/github_com_truecharts_catalog_main/stable/rapidphotodownloader/4.2.0 DOM_DATA_ALL/ix-applications/catalogs@auto-2024-03-24_01-00:/github_com_truecharts_catalog_main/stable/rcon-webadmin/8.1.1 DOM_DATA_ALL/ix-applications/catalogs@auto-2024-03-24_01-00:/github_com_truecharts_catalog_main/stable/radarr/20.2.0 DOM_DATA_ALL/ix-applications/catalogs@auto-2024-03-24_01-00:<0x24dcf> DOM_DATA_ALL/ix-applications/catalogs@auto-2024-03-24_01-00:/github_com_truecharts_catalog_main/stable/qwantify/3.1.2 DOM_DATA_ALL/ix-applications/catalogs@auto-2024-03-24_01-00:/github_com_truecharts_catalog_main/stable/rdesktop DOM_DATA_ALL/ix-applications/catalogs@auto-2024-03-23_01-00:<0x24e31> DOM_DATA_ALL/ix-applications/catalogs@auto-2024-03-23_01-00:<0x24d3e> DOM_DATA_ALL/ix-applications/catalogs@auto-2024-03-23_01-00:/github_com_truecharts_catalog_main/stable/rapidphotodownloader/4.2.0 DOM_DATA_ALL/ix-applications/catalogs@auto-2024-03-23_01-00:/github_com_truecharts_catalog_main/stable/rcon-webadmin/8.1.1 DOM_DATA_ALL/ix-applications/catalogs@auto-2024-03-23_01-00:/github_com_truecharts_catalog_main/stable/radarr/20.2.0 DOM_DATA_ALL/ix-applications/catalogs@auto-2024-03-23_01-00:<0x24dcf> DOM_DATA_ALL/ix-applications/catalogs@auto-2024-03-23_01-00:/github_com_truecharts_catalog_main/stable/qwantify/3.1.2 DOM_DATA_ALL/ix-applications/catalogs@auto-2024-03-23_01-00:/github_com_truecharts_catalog_main/stable/rdesktop DOM_DATA_ALL/ix-applications/catalogs:<0x24e31> DOM_DATA_ALL/ix-applications/catalogs:<0x24d3e> /mnt/DOM_DATA_ALL/ix-applications/catalogs/github_com_truecharts_catalog_main/stable/rapidphotodownloader/4.2.0 /mnt/DOM_DATA_ALL/ix-applications/catalogs/github_com_truecharts_catalog_main/stable/rcon-webadmin/8.1.1 /mnt/DOM_DATA_ALL/ix-applications/catalogs/github_com_truecharts_catalog_main/stable/radarr/20.2.0 DOM_DATA_ALL/ix-applications/catalogs:<0x24dcf> /mnt/DOM_DATA_ALL/ix-applications/catalogs/github_com_truecharts_catalog_main/stable/qwantify/3.1.2 /mnt/DOM_DATA_ALL/ix-applications/catalogs/github_com_truecharts_catalog_main/stable/rdesktop pool: SSDPOOL state: ONLINE config: NAME STATE READ WRITE CKSUM SSDPOOL ONLINE 0 0 0 nvme0n1p4 ONLINE 0 0 0 errors: No known data errors pool: boot-pool state: ONLINE scan: scrub repaired 0B in 00:00:12 with 0 errors on Tue Mar 19 03:45:14 2024 config: NAME STATE READ WRITE CKSUM boot-pool ONLINE 0 0 0 nvme0n1p3 ONLINE 0 0 0 errors: No known data errors
Now, I know about bad connections, I've read through A LOT of forum posts trying to get this issue pinned down. I don't think anything is invalid with cabling. I re-seated the cables multiple times by now. I've ran multiple scrubs, as you see, nothing was repaired, with multiple errors.
I need help with going forward from this. I think I used up all of my options by now. I haven't done anything through SSH, besides clearing the errors to see what would happen. I also haven't swapped the drive for another one, don't have the resources to do that for now. A fresh install is also out of the window, since I've got about 3TB of data that I can't store anywhere for now. It NEEDS to be on that NAS, there's simply no other space to put it. Snapshots are also out of the window, since I've been going at this for at least 2 weeks. I've got only 5 snapshots going.
I have no clue what the best course of action would be. Run Seatools to fix the drives up? Contact the drive company for warranty? Do something specific in the file system? I'm lost.