SteezeMcQueen
Cadet
- Joined
- Mar 6, 2020
- Messages
- 6
Basic details:
FreeNAS 11.3 (Both U2 and U3) running in a VM on ESXi 6.7
Physically, there are two Bulldozer CPUs (Opteron 6276) and 54GB of ECC RAM. FreeNAS VM has 8 vCPU and 32GB RAM assigned. The LSI 9211-8i HBA connected to 4 WD Red 3TB drives is passthrough from ESXi.
FreeNAS has been running happily in a VM for several months, and as a bare metal install on the same hardware for years prior.
Situation:
Over the past two weeks, I have had an intermittent degradation of my pool. The start of this was on May 16th, when my pool threw this alert: "Pool tycho-pool state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error." Logging in and running a `zpool status` showed that the pool was Online, and that the drive corresponding to /dev/da3 had a nonzero Read or Write level. I unfortunately don't have the historical record of that specific data.
I looked at `smartctl -a /dev/da3` and didn't find any failures. After poking around on the FreeNAS forums and Reddit, I chalked it up to gremlins and did a `zpool clear` and went about my business, seeing if it would reappear.
On May 19th, I upgraded from 11.3U2 to U3 and rebooted. Later that night, another "Pool tycho-pool state is ONLINE: One or more devices has experienced an unrecoverable error." alert was triggered. 14 minutes later, the pool entered the degraded state: "Pool tycho-pool state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state."
Again, I don't have historical data for the specific values, but again it was the drive /dev/da3 that was at fault. However, this time /dev/da3 was no longer showing up as attached to the system, my devices ended at da2. I took this as a sign that this was a legitimate drive failure (as I've experienced in the past) and ordered some new drives and powered down the FreeNAS VM.
Last night, I powered the VM back up. The pool came back in the Online state, all drives present. Looking at `zpool status`, a different drive (da2) showed a nonzero CKSUM value (20-something). FreeNAS was rebooted a 2nd time, and on reboot the pool came back Online and fully healthy, with `zpool status` showing all zeros.
I scheduled a battery of Long SMART tests to run across all the pool's drives that night (They all came back passing this morning) and I'm now running a scrub.
`smartctl -a` shows RAW_VALUE 0 for IDs 5, 196, and 197:
All IDs are TYPE Pre-fail or Old_age, which I don't find too shocking considering these drives have 36,000 hours (da2 is the baby at 18,000). Everything I've read seems to indicate that the RAW_VALUE is what matters.
For each drive, the last 21 SMART scans completed without error. I follow the recommended guidance of a Short scan on days 5, 12, 19, 26, and Long scans on days 8, 22, so that scan history extends back before the initial alert on May 16.
So, my conundrum: What's actually failing here? The drives are certainly old enough that I won't be shocked if they're bad, but the SMART data doesn't seem to back that up. It could be the HBA, or the cables - is there a good way to test that that doesn't involve just replacing them (I'm not opposed, but don't have any spares on-hand).
I have new drives, but if it's not actually the drives that are failing, I'm loathe to replace them when I don't really need to.
As I said, I'm running a scrub now. When that completes, my next step is to run `badblocks`, unless there's a better next step that I'm unaware of.
FreeNAS 11.3 (Both U2 and U3) running in a VM on ESXi 6.7
Physically, there are two Bulldozer CPUs (Opteron 6276) and 54GB of ECC RAM. FreeNAS VM has 8 vCPU and 32GB RAM assigned. The LSI 9211-8i HBA connected to 4 WD Red 3TB drives is passthrough from ESXi.
FreeNAS has been running happily in a VM for several months, and as a bare metal install on the same hardware for years prior.
Situation:
Over the past two weeks, I have had an intermittent degradation of my pool. The start of this was on May 16th, when my pool threw this alert: "Pool tycho-pool state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error." Logging in and running a `zpool status` showed that the pool was Online, and that the drive corresponding to /dev/da3 had a nonzero Read or Write level. I unfortunately don't have the historical record of that specific data.
I looked at `smartctl -a /dev/da3` and didn't find any failures. After poking around on the FreeNAS forums and Reddit, I chalked it up to gremlins and did a `zpool clear` and went about my business, seeing if it would reappear.
On May 19th, I upgraded from 11.3U2 to U3 and rebooted. Later that night, another "Pool tycho-pool state is ONLINE: One or more devices has experienced an unrecoverable error." alert was triggered. 14 minutes later, the pool entered the degraded state: "Pool tycho-pool state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state."
Again, I don't have historical data for the specific values, but again it was the drive /dev/da3 that was at fault. However, this time /dev/da3 was no longer showing up as attached to the system, my devices ended at da2. I took this as a sign that this was a legitimate drive failure (as I've experienced in the past) and ordered some new drives and powered down the FreeNAS VM.
Last night, I powered the VM back up. The pool came back in the Online state, all drives present. Looking at `zpool status`, a different drive (da2) showed a nonzero CKSUM value (20-something). FreeNAS was rebooted a 2nd time, and on reboot the pool came back Online and fully healthy, with `zpool status` showing all zeros.
I scheduled a battery of Long SMART tests to run across all the pool's drives that night (They all came back passing this morning) and I'm now running a scrub.
`smartctl -a` shows RAW_VALUE 0 for IDs 5, 196, and 197:
Code:
DRIVE ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE /dev/da0 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 /dev/da1 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 /dev/da2 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 /dev/da3 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
All IDs are TYPE Pre-fail or Old_age, which I don't find too shocking considering these drives have 36,000 hours (da2 is the baby at 18,000). Everything I've read seems to indicate that the RAW_VALUE is what matters.
For each drive, the last 21 SMART scans completed without error. I follow the recommended guidance of a Short scan on days 5, 12, 19, 26, and Long scans on days 8, 22, so that scan history extends back before the initial alert on May 16.
So, my conundrum: What's actually failing here? The drives are certainly old enough that I won't be shocked if they're bad, but the SMART data doesn't seem to back that up. It could be the HBA, or the cables - is there a good way to test that that doesn't involve just replacing them (I'm not opposed, but don't have any spares on-hand).
I have new drives, but if it's not actually the drives that are failing, I'm loathe to replace them when I don't really need to.
As I said, I'm running a scrub now. When that completes, my next step is to run `badblocks`, unless there's a better next step that I'm unaware of.