I need some help in diagnosing a hardware failure in my system, and preventing any further data loss. Drives (seemingly random ones) are dropping from the pool during heavy load/scrubs.
Setup:
2 raidz2 vdevs with six disks each
FreeNAS 9.2.1.8
Storage controller: IBM M1015 with an intel SAS expander (RES2SV240)
Case: Norco 4224
Motherboard: Supermicro X9SCM-IIF-B
32GB ECC ram
Here is my current zpool status. If I reboot, the pool will come back online but doing a scrub will cause one or more disks to drop out again.
Logs are filled with CAM status errors for all the disks that dropped out, but don't point to a single drive with an issue. Smart checks seem okay.
Can someone help me with the diagnosis process I should follow (to find the problem and prevent data loss)? I'm going to try replacing the power supply next. Let me know if any additional information is required.
More info:
camcontrol devlist
gpart show (note: it exited early with segmentation fault)
edit: added more info
Setup:
2 raidz2 vdevs with six disks each
FreeNAS 9.2.1.8
Storage controller: IBM M1015 with an intel SAS expander (RES2SV240)
Case: Norco 4224
Motherboard: Supermicro X9SCM-IIF-B
32GB ECC ram
Here is my current zpool status. If I reboot, the pool will come back online but doing a scrub will cause one or more disks to drop out again.
Code:
pool: ZFSVolume1 state: UNAVAIL status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://illumos.org/msg/ZFS-8000-8A scan: scrub repaired 0 in 0h0m with 26 errors on Sun Nov 23 00:09:42 2014 config: NAME STATE READ WRITE CKSUM ZFSVolume1 UNAVAIL 0 102 52 raidz2-0 DEGRADED 0 0 0 gptid/93fb9431-4ed5-11e3-acd6-0015175189a2 ONLINE 0 0 0 11611311607109959336 REMOVED 0 0 0 was /dev/gptid/56205fa0-3d9d-11e4-a079-0015175189a2 gptid/96303c42-4ed5-11e3-acd6-0015175189a2 ONLINE 0 0 0 gptid/96fd27f5-4ed5-11e3-acd6-0015175189a2 ONLINE 0 0 0 gptid/98ba8015-4ed5-11e3-acd6-0015175189a2 ONLINE 0 0 0 gptid/9a182567-4ed5-11e3-acd6-0015175189a2 ONLINE 0 0 0 raidz2-1 UNAVAIL 0 220 104 7412218273736074993 REMOVED 0 0 0 was /dev/gptid/e023dbdb-b1dc-11e3-9e59-0015175189a2 gptid/e0ef82fa-b1dc-11e3-9e59-0015175189a2 ONLINE 0 0 0 gptid/e1c184dc-b1dc-11e3-9e59-0015175189a2 ONLINE 0 0 0 17620432317619218962 REMOVED 0 0 0 was /dev/gptid/e286b860-b1dc-11e3-9e59-0015175189a2 13993555882860681678 REMOVED 0 0 0 was /dev/gptid/3e1d4644-1f46-11e4-a328-0015175189a2 10907015668487533868 REMOVED 0 0 0 was /dev/gptid/c3cd8c69-1a59-11e4-a328-0015175189a2
Logs are filled with CAM status errors for all the disks that dropped out, but don't point to a single drive with an issue. Smart checks seem okay.
Can someone help me with the diagnosis process I should follow (to find the problem and prevent data loss)? I'm going to try replacing the power supply next. Let me know if any additional information is required.
More info:
camcontrol devlist
Code:
[root@freenas] /var/log# camcontrol devlist <ATA Hitachi HDS5C404 A3B0> at scbus0 target 8 lun 0 (da0,pass0) <ATA ST4000DM000-1F21 CC52> at scbus0 target 10 lun 0 (da1,pass1) <ATA ST4000DM000-1F21 CC52> at scbus0 target 11 lun 0 (da2,pass2) <ATA Hitachi HDS5C404 A3B0> at scbus0 target 12 lun 0 (da3,pass3) <ATA ST4000DM000-1F21 CC52> at scbus0 target 13 lun 0 (da4,pass4) <Intel RES2SV240 0d00> at scbus0 target 14 lun 0 (ses0,pass5) <ATA ST4000DM000-1CD1 CC43> at scbus0 target 15 lun 0 (pass12,da5) <ATA ST4000DM000-1F21 CC54> at scbus0 target 16 lun 0 (da6,pass7) <ATA ST4000DM000-1F21 CC54> at scbus0 target 17 lun 0 (da7,pass8) <ATA ST4000DM000-1F21 CC54> at scbus0 target 18 lun 0 (pass9,da8) <ATA ST4000DM000-1F21 CC54> at scbus0 target 22 lun 0 (pass11,da9) <ATA ST4000DM000-1F21 CC54> at scbus0 target 23 lun 0 (pass10,da10) <ATA WDC WD40EFRX-68W 0A80> at scbus0 target 24 lun 0 (pass6,da11) <Kingston DataTraveler 2.0 PMAP> at scbus8 target 0 lun 0 (pass13,da12)
gpart show (note: it exited early with segmentation fault)
Code:
[root@freenas] /var/log# gpart show => 34 7814037101 da0 GPT (3.7T) 34 94 - free - (47k) 128 4194304 1 freebsd-swap (2.0G) 4194432 7809842696 2 freebsd-zfs (3.7T) 7814037128 7 - free - (3.5k) => 34 7814037101 da1 GPT (3.7T) 34 94 - free - (47k) 128 4194304 1 freebsd-swap (2.0G) 4194432 7809842696 2 freebsd-zfs (3.7T) 7814037128 7 - free - (3.5k) => 34 7814037101 da2 GPT (3.7T) 34 94 - free - (47k) 128 4194304 1 freebsd-swap (2.0G) 4194432 7809842696 2 freebsd-zfs (3.7T) 7814037128 7 - free - (3.5k) => 34 7814037101 da3 GPT (3.7T) 34 94 - free - (47k) 128 4194304 1 freebsd-swap (2.0G) 4194432 7809842696 2 freebsd-zfs (3.7T) 7814037128 7 - free - (3.5k) => 34 7814037101 da4 GPT (3.7T) 34 94 - free - (47k) 128 4194304 1 freebsd-swap (2.0G) 4194432 7809842696 2 freebsd-zfs (3.7T) 7814037128 7 - free - (3.5k) Segmentation fault (core dumped)
edit: added more info
Last edited: