Stranded Camel
Explorer
- Joined
- May 25, 2017
- Messages
- 79
My FreeNAS box had been running FreeNAS-11.1-U2 perfectly fine since it came out, and had been running continuously for about three months. Then, a couple hours ago, BAM! -- it rebooted for no reason I can figure. (I confirmed that the reboot was real via the uptime, which was in the minutes after this.)
The box is connected to a large UPS, and there was no power-related event in any case (unless it was something that does't affect the lights, etc. -- but with the UPS, it wouldn't matter anyway.) Furthermore, my power supply is rated for a lot more juice than I pull from it. In short, I doubt this was an electrical issue.
Anyway, while the cause of this reboot is something I'd like to eventually figure out and prevent in the future, the issue I'm dealing with right now are the errors that my main pool (5 x 10 TB WD Gold in Z2) experienced as a result of whatever this event was.
Once I unlocked the pool (it's encrypted), I got a warning that a permanent error had been found in the following file:
The FreeNAS system mails me nightly SMART reports, and everything has been in good order. Drive temps are between 27ºC and 32ºC, which is where they always are. All five drives had 0 errors of any type. All had perfectly fine SMART results. They've got about 2800 power-on hours. The latest scrub, which was a week ago, showed 0 errors, as always. In short, the drives were in perfect working order right up until this incident happened.
So I started a scrub. I'm using ZFS for precisely this reason, so I should have no worries! After all, it's designed to prevent errors with features like COW, and it's designed to allow errors to be corrected with all those checksums and such. And I'm using Z2, which gives me a healthy margin for errors.
Well... as the scrub progresses, more and more errors are being detected. Here is where I stand right now:
Since this is an excruciatingly slow process, can anyone tell me (1) what to expect, and (2) what I should do if the scrub can't fix all the errors?
Does "5.33M repaired" mean 5.33 MB of data has been repaired, or 5.33 million errors have been repaired? ("M" is a really bad abbreviation in this context!)
Are these CKSUM errors unrecoverably high, or is this par for the course?
Any other suggestions are quite welcome!
(PS: I would love to use ECC RAM, but I live in South America and no retailer in my country sells it or ECC-capable mobos. And B2B sellers won't give me the time of day.)
The box is connected to a large UPS, and there was no power-related event in any case (unless it was something that does't affect the lights, etc. -- but with the UPS, it wouldn't matter anyway.) Furthermore, my power supply is rated for a lot more juice than I pull from it. In short, I doubt this was an electrical issue.
Anyway, while the cause of this reboot is something I'd like to eventually figure out and prevent in the future, the issue I'm dealing with right now are the errors that my main pool (5 x 10 TB WD Gold in Z2) experienced as a result of whatever this event was.
Once I unlocked the pool (it's encrypted), I got a warning that a permanent error had been found in the following file:
tank/multimedia:<0x2ab15>
. From researching this, it seems that the hex address means the error is in metadata.The FreeNAS system mails me nightly SMART reports, and everything has been in good order. Drive temps are between 27ºC and 32ºC, which is where they always are. All five drives had 0 errors of any type. All had perfectly fine SMART results. They've got about 2800 power-on hours. The latest scrub, which was a week ago, showed 0 errors, as always. In short, the drives were in perfect working order right up until this incident happened.
So I started a scrub. I'm using ZFS for precisely this reason, so I should have no worries! After all, it's designed to prevent errors with features like COW, and it's designed to allow errors to be corrected with all those checksums and such. And I'm using Z2, which gives me a healthy margin for errors.
Well... as the scrub progresses, more and more errors are being detected. Here is where I stand right now:
Code:
$ zpool status -vx tank pool: tank state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://illumos.org/msg/ZFS-8000-8A scan: scrub in progress since Sat Jun 30 07:27:34 2018 8.03T scanned at 1.00G/s, 6.04T issued at 771M/s, 25.8T total 5.33M repaired, 23.38% done, 0 days 07:28:27 to go config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 gptid/re-da-ct-e-d.eli DEGRADED 0 0 126 too many errors (repairing) gptid/re-da-ct-e-d.eli DEGRADED 0 0 98 too many errors (repairing) gptid/re-da-ct-e-d.eli DEGRADED 0 0 103 too many errors (repairing) gptid/re-da-ct-e-d.eli DEGRADED 0 0 68 too many errors (repairing) gptid/re-da-ct-e-d.eli DEGRADED 0 0 95 too many errors (repairing) errors: Permanent errors have been detected in the following files: tank/multimedia:<0x2ab15>
Since this is an excruciatingly slow process, can anyone tell me (1) what to expect, and (2) what I should do if the scrub can't fix all the errors?
Does "5.33M repaired" mean 5.33 MB of data has been repaired, or 5.33 million errors have been repaired? ("M" is a really bad abbreviation in this context!)
Are these CKSUM errors unrecoverably high, or is this par for the course?
Any other suggestions are quite welcome!
(PS: I would love to use ECC RAM, but I live in South America and no retailer in my country sells it or ECC-capable mobos. And B2B sellers won't give me the time of day.)