Phobos
Dabbler
- Joined
- Sep 8, 2014
- Messages
- 25
Hi all,
tl;dr: Two disks spontaneously disappeared from my encrypted pool. I backed up the data, detached the pool, and re-imported it, and the two disks reappeared with CKSUM errors. What just happened, and why?
System: 6x4 TB WD Red encrypted pool running RAIDZ2; X10SLH-F; Xeon E3-1231v3; 32 GB Samsung ECC RAM; Seasonic PSU; FreeNAS-9.10.2-U4 (27ae72978). Boot pool is mirrored, I have a UPS, and I regularly run scrubs and SMART tests.
--------------------------------------------------------
Longer version:
Sunday afternoon, two disks spontaneously dropped out of my pool:
The volume tank (ZFS) state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
Running `zpool status` showed the drives as UNAVAIL. Rebooting and attempting to unlock the pool resulted in:
The volume tank (ZFS) state is DEGRADED: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The drives still were listed as UNAVAIL. However, in both cases the affected drives were recognized by SMART and showed no errors. (Curiously, I noticed that the drives had become unselected in the dialog box that is used to schedule SMART tests.)
In a panic, I spun up a backup server running 3x6TB WD Reds in a RAIDZ1 as a guest under VMWare Fusion on an old Mac Pro. (I know, very much not ideal. These three drives were supposed to be for a backup pool attached directly to my server, but I haven’t yet gotten an HBA.) Data replicated over in about 2 days without too much fuss.
Today, I detached the pool completely, imported it again, provided the GELI key and passphrase, and found that all six drives imported, but two had CKSUM errors:
Right now I am scrubbing the pool, and will follow with SMART long tests to all drives. There are no errors listed under any of the drives when running smartctl. The drives are running below 35°C (although it has been getting warmer the past few weeks).
Did two drives really spontaneously fail? Is the issue perhaps the cables, SATA ports, or SATA controller? Should I `zpool clear` the errors and wait and see what happens, or do I have to replace the drives, or even the motherboard? As always, thank you for any insight.
tl;dr: Two disks spontaneously disappeared from my encrypted pool. I backed up the data, detached the pool, and re-imported it, and the two disks reappeared with CKSUM errors. What just happened, and why?
System: 6x4 TB WD Red encrypted pool running RAIDZ2; X10SLH-F; Xeon E3-1231v3; 32 GB Samsung ECC RAM; Seasonic PSU; FreeNAS-9.10.2-U4 (27ae72978). Boot pool is mirrored, I have a UPS, and I regularly run scrubs and SMART tests.
--------------------------------------------------------
Longer version:
Sunday afternoon, two disks spontaneously dropped out of my pool:
The volume tank (ZFS) state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
Running `zpool status` showed the drives as UNAVAIL. Rebooting and attempting to unlock the pool resulted in:
The volume tank (ZFS) state is DEGRADED: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The drives still were listed as UNAVAIL. However, in both cases the affected drives were recognized by SMART and showed no errors. (Curiously, I noticed that the drives had become unselected in the dialog box that is used to schedule SMART tests.)
In a panic, I spun up a backup server running 3x6TB WD Reds in a RAIDZ1 as a guest under VMWare Fusion on an old Mac Pro. (I know, very much not ideal. These three drives were supposed to be for a backup pool attached directly to my server, but I haven’t yet gotten an HBA.) Data replicated over in about 2 days without too much fuss.
Today, I detached the pool completely, imported it again, provided the GELI key and passphrase, and found that all six drives imported, but two had CKSUM errors:
Code:
pool: tank state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-9P scan: resilvered 373M in 0h0m with 0 errors on Tue May 30 16:30:16 2017 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 gptid/aaaaaaaa-aaaa-11e4-aaaa-xxxxxxxxxxxx.eli ONLINE 0 0 0 gptid/bbbbbbbb-bbbb-11e4-bbbb-xxxxxxxxxxxx.eli ONLINE 0 0 4 gptid/cccccccc-cccc-11e4-cccc-xxxxxxxxxxxx.eli ONLINE 0 0 4 gptid/dddddddd-dddd-11e4-dddd-xxxxxxxxxxxx.eli ONLINE 0 0 0 gptid/eeeeeeee-eeee-11e5-eeee-xxxxxxxxxxxx.eli ONLINE 0 0 0 gptid/ffffffff-ffff-11e7—ffff-xxxxxxxxxxxx.eli ONLINE 0 0 0 errors: No known data errors
Right now I am scrubbing the pool, and will follow with SMART long tests to all drives. There are no errors listed under any of the drives when running smartctl. The drives are running below 35°C (although it has been getting warmer the past few weeks).
Did two drives really spontaneously fail? Is the issue perhaps the cables, SATA ports, or SATA controller? Should I `zpool clear` the errors and wait and see what happens, or do I have to replace the drives, or even the motherboard? As always, thank you for any insight.