Disks dropping from array during scrub

agreenfi · Nov 23, 2014

I need some help in diagnosing a hardware failure in my system, and preventing any further data loss. Drives (seemingly random ones) are dropping from the pool during heavy load/scrubs.

Setup:
2 raidz2 vdevs with six disks each
FreeNAS 9.2.1.8
Storage controller: IBM M1015 with an intel SAS expander (RES2SV240)
Case: Norco 4224
Motherboard: Supermicro X9SCM-IIF-B
32GB ECC ram

Here is my current zpool status. If I reboot, the pool will come back online but doing a scrub will cause one or more disks to drop out again.

Code:

  pool: ZFSVolume1
state: UNAVAIL
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0h0m with 26 errors on Sun Nov 23 00:09:42 2014
config:

        NAME                                            STATE     READ WRITE CKSUM
        ZFSVolume1                                      UNAVAIL      0   102    52
          raidz2-0                                      DEGRADED     0     0     0
            gptid/93fb9431-4ed5-11e3-acd6-0015175189a2  ONLINE       0     0     0
            11611311607109959336                        REMOVED      0     0     0  was /dev/gptid/56205fa0-3d9d-11e4-a079-0015175189a2
            gptid/96303c42-4ed5-11e3-acd6-0015175189a2  ONLINE       0     0     0
            gptid/96fd27f5-4ed5-11e3-acd6-0015175189a2  ONLINE       0     0     0
            gptid/98ba8015-4ed5-11e3-acd6-0015175189a2  ONLINE       0     0     0
            gptid/9a182567-4ed5-11e3-acd6-0015175189a2  ONLINE       0     0     0
          raidz2-1                                      UNAVAIL      0   220   104
            7412218273736074993                         REMOVED      0     0     0  was /dev/gptid/e023dbdb-b1dc-11e3-9e59-0015175189a2
            gptid/e0ef82fa-b1dc-11e3-9e59-0015175189a2  ONLINE       0     0     0
            gptid/e1c184dc-b1dc-11e3-9e59-0015175189a2  ONLINE       0     0     0
            17620432317619218962                        REMOVED      0     0     0  was /dev/gptid/e286b860-b1dc-11e3-9e59-0015175189a2
            13993555882860681678                        REMOVED      0     0     0  was /dev/gptid/3e1d4644-1f46-11e4-a328-0015175189a2
            10907015668487533868                        REMOVED      0     0     0  was /dev/gptid/c3cd8c69-1a59-11e4-a328-0015175189a2

Logs are filled with CAM status errors for all the disks that dropped out, but don't point to a single drive with an issue. Smart checks seem okay.

Can someone help me with the diagnosis process I should follow (to find the problem and prevent data loss)? I'm going to try replacing the power supply next. Let me know if any additional information is required.

More info:

camcontrol devlist

Code:

[root@freenas] /var/log# camcontrol devlist
<ATA Hitachi HDS5C404 A3B0>        at scbus0 target 8 lun 0 (da0,pass0)
<ATA ST4000DM000-1F21 CC52>        at scbus0 target 10 lun 0 (da1,pass1)
<ATA ST4000DM000-1F21 CC52>        at scbus0 target 11 lun 0 (da2,pass2)
<ATA Hitachi HDS5C404 A3B0>        at scbus0 target 12 lun 0 (da3,pass3)
<ATA ST4000DM000-1F21 CC52>        at scbus0 target 13 lun 0 (da4,pass4)
<Intel RES2SV240 0d00>             at scbus0 target 14 lun 0 (ses0,pass5)
<ATA ST4000DM000-1CD1 CC43>        at scbus0 target 15 lun 0 (pass12,da5)
<ATA ST4000DM000-1F21 CC54>        at scbus0 target 16 lun 0 (da6,pass7)
<ATA ST4000DM000-1F21 CC54>        at scbus0 target 17 lun 0 (da7,pass8)
<ATA ST4000DM000-1F21 CC54>        at scbus0 target 18 lun 0 (pass9,da8)
<ATA ST4000DM000-1F21 CC54>        at scbus0 target 22 lun 0 (pass11,da9)
<ATA ST4000DM000-1F21 CC54>        at scbus0 target 23 lun 0 (pass10,da10)
<ATA WDC WD40EFRX-68W 0A80>        at scbus0 target 24 lun 0 (pass6,da11)
<Kingston DataTraveler 2.0 PMAP>   at scbus8 target 0 lun 0 (pass13,da12)

gpart show (note: it exited early with segmentation fault)

Code:

[root@freenas] /var/log# gpart show
=>        34  7814037101  da0  GPT  (3.7T)
          34          94       - free -  (47k)
         128     4194304    1  freebsd-swap  (2.0G)
     4194432  7809842696    2  freebsd-zfs  (3.7T)
  7814037128           7       - free -  (3.5k)

=>        34  7814037101  da1  GPT  (3.7T)
          34          94       - free -  (47k)
         128     4194304    1  freebsd-swap  (2.0G)
     4194432  7809842696    2  freebsd-zfs  (3.7T)
  7814037128           7       - free -  (3.5k)

=>        34  7814037101  da2  GPT  (3.7T)
          34          94       - free -  (47k)
         128     4194304    1  freebsd-swap  (2.0G)
     4194432  7809842696    2  freebsd-zfs  (3.7T)
  7814037128           7       - free -  (3.5k)

=>        34  7814037101  da3  GPT  (3.7T)
          34          94       - free -  (47k)
         128     4194304    1  freebsd-swap  (2.0G)
     4194432  7809842696    2  freebsd-zfs  (3.7T)
  7814037128           7       - free -  (3.5k)

=>        34  7814037101  da4  GPT  (3.7T)
          34          94       - free -  (47k)
         128     4194304    1  freebsd-swap  (2.0G)
     4194432  7809842696    2  freebsd-zfs  (3.7T)
  7814037128           7       - free -  (3.5k)

Segmentation fault (core dumped)

edit: added more info

BigDave · Nov 23, 2014

Please post your system specs (hardware), freenas version (i.e. 2.1.x), and don't hold anything back.
The more info you give the easier this will be for others to help;)

Ericloewe · Nov 23, 2014

Are there any backplanes in that chassis? If so, does bypassing them (temporarily removing the drives and connecting them directly) solve the problem?

agreenfi · Nov 23, 2014

I updated the post with some more information. That is a good thought on the backplane. I'll need to get some reverse SAS cables to connect them directly. There are four disks per backplane, so my problem isnt just isolated to one of them.

Ericloewe · Nov 23, 2014

Bad power perhaps? Seems like an awful lot of drives to be failing at once.

rs225 · Nov 23, 2014

Run a memtest too, for the segfault. But power is a good thing to start with. Cables/enclosures is next.

cyberjock · Nov 23, 2014

I have to wonder about your power...

I doubt your RAM is the problem as you have ECC RAM and you haven't said you've gotten any RAM errors. But poor power can cause all sorts of problems. :P

The Norco 4224 has dual power ports for each row of hard drives. I have them all cross connected and such so they all have two independent sources of power from the PSU to a "row" of hard drives.

diehard · Nov 24, 2014

I'm about 90% sure this is a SAS Expander -> SATA problem.

cyberjock · Nov 24, 2014

diehard said:
I'm about 90% sure this is a SAS Expander -> SATA problem.

I'm not so sure of that. I'm using an M1015 with that expander without problems, as are most of my friends. In fact, I'm even using a Norco 4224 chassis.

It's entirely possible that Intel has released a revised hardware version that breaks something, but that's not something I'd hang my hat on based on a single case.

Are you using the P16 firmware on your M1015?

diedrichg · Nov 24, 2014

My first inclination was power supply. Did you happen to post that?

agreenfi · Nov 24, 2014

Thanks for the advice everyone. I'll get a new power supply in there tonight, and see if that fixes it. I also have some reverse SAS breakout cables on the way, which I think will allow me to connect half the drives to the motherboard controller, and the other half to the M1015. If the problem continues at this point, it should narrow it down at least.

The system has been running fine for the last year and half, with the exception of a couple bad Seagate drives. I'm a bit concerned that a bad drive is 'taking out' others in the array. Not sure if that is possible or not.

agreenfi · Nov 24, 2014

Need more help. Replaced the power supply, but scrub ending after just a couple minutes when it should take 12+ hours. No errors have appeared in /var/log/messages or the IPMI display, and no drives have dropped out yet. How can I get scrub to complete??? Here is my current zpool status, a few minutes after executing the scrub command from within the FreeNAS GUI:

Code:

[root@freenas] /var/log# zpool status
  pool: ZFSVolume1
state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0h1m with 52 errors on Mon Nov 24 16:03:12 2014
config:

        NAME                                            STATE     READ WRITE CKSUM
        ZFSVolume1                                      ONLINE       0     0   130
          raidz2-0                                      ONLINE       0     0     0
            gptid/93fb9431-4ed5-11e3-acd6-0015175189a2  ONLINE       0     0     0
            gptid/56205fa0-3d9d-11e4-a079-0015175189a2  ONLINE       0     0    11
            gptid/96303c42-4ed5-11e3-acd6-0015175189a2  ONLINE       0     0     0
            gptid/96fd27f5-4ed5-11e3-acd6-0015175189a2  ONLINE       0     0     0
            gptid/98ba8015-4ed5-11e3-acd6-0015175189a2  ONLINE       0     0     0
            gptid/9a182567-4ed5-11e3-acd6-0015175189a2  ONLINE       0     0     0
          raidz2-1                                      ONLINE       0     0   260
            gptid/e023dbdb-b1dc-11e3-9e59-0015175189a2  ONLINE       0     0     0
            gptid/e0ef82fa-b1dc-11e3-9e59-0015175189a2  ONLINE       0     0     0
            gptid/e1c184dc-b1dc-11e3-9e59-0015175189a2  ONLINE       0     0     0
            gptid/e286b860-b1dc-11e3-9e59-0015175189a2  ONLINE       0     0     0
            gptid/3e1d4644-1f46-11e4-a328-0015175189a2  ONLINE       0     0     0
            gptid/c3cd8c69-1a59-11e4-a328-0015175189a2  ONLINE       0     0     0

errors: 2 data errors, use '-v' for a list
[root@freenas] /var/log#

FYI, the data is still there. Here is the zpool list output. Is it time to delete the array and start over?

Code:

[root@freenas] /var/log# zpool list
NAME         SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
ZFSVolume1  43.5T  20.8T  22.7T    47%  1.00x  ONLINE  /mnt

cyberjock · Nov 24, 2014

Well, considering you have data errors, probably. What's the output of "zpool status -v"?

agreenfi · Nov 24, 2014

The data errors were on a couple files generated by a security camera. I deleted them, and it now reports no known data errors. But the scrub still won't complete.

Code:

[root@freenas] /var/log# zpool status -v
  pool: ZFSVolume1
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 0h1m with 0 errors on Mon Nov 24 16:31:44 2014
config:

        NAME                                            STATE     READ WRITE CKSUM
        ZFSVolume1                                      ONLINE       0     0   182
          raidz2-0                                      ONLINE       0     0     0
            gptid/93fb9431-4ed5-11e3-acd6-0015175189a2  ONLINE       0     0     0
            gptid/56205fa0-3d9d-11e4-a079-0015175189a2  ONLINE       0     0    11
            gptid/96303c42-4ed5-11e3-acd6-0015175189a2  ONLINE       0     0     0
            gptid/96fd27f5-4ed5-11e3-acd6-0015175189a2  ONLINE       0     0     0
            gptid/98ba8015-4ed5-11e3-acd6-0015175189a2  ONLINE       0     0     0
            gptid/9a182567-4ed5-11e3-acd6-0015175189a2  ONLINE       0     0     0
          raidz2-1                                      ONLINE       0     0   364
            gptid/e023dbdb-b1dc-11e3-9e59-0015175189a2  ONLINE       0     0     0
            gptid/e0ef82fa-b1dc-11e3-9e59-0015175189a2  ONLINE       0     0     0
            gptid/e1c184dc-b1dc-11e3-9e59-0015175189a2  ONLINE       0     0     0
            gptid/e286b860-b1dc-11e3-9e59-0015175189a2  ONLINE       0     0     0
            gptid/3e1d4644-1f46-11e4-a328-0015175189a2  ONLINE       0     0     0
            gptid/c3cd8c69-1a59-11e4-a328-0015175189a2  ONLINE       0     0     0

errors: No known data errors
[root@freenas] /var/log#

cyberjock · Nov 24, 2014

That's likely because you have ZFS metadata corruption. Time to give up on the pool. :/

rs225 · Nov 24, 2014

Run a memory test. This scenario is so bizarre, it must be ruled out if you have any plans of continuing to use the hardware for ZFS.

Mlovelace · Nov 24, 2014

Worth a read, maybe it will help :/
https://docs.oracle.com/cd/E19082-01/817-2271/6mhupg6qg/index.html

cyberjock · Nov 24, 2014

rs225 said:
Run a memory test. This scenario is so bizarre, it must be ruled out if you have any plans of continuing to use the hardware for ZFS.

It's not bizarre at all. You have multiple drives dropping out of the pool randomly all the time without the ability to reconsile with a zpool scrub and you *will* get corruption. There's nothing shocking at all. In fact, when I saw his first output I figured the pool would be done for, the question was whether it would even mount or not. :P

Mlovelace · Nov 24, 2014

Do you have snap shots? You might still have the corrupt file in a snap shot, that will give you a scrub error even after you delete the offending file(s).

If you do try deleting them and scrubbing again.

agreenfi · Nov 24, 2014

Well, a scrub has been running for a couple hours now. I deleted a couple Tb of iscsi esxi datastores; this is where most of the read/write activity is, so I figured a likely spot for corruption. Also, 'zpool status -v' had identified one of these files as corrupted yesterday. Once I'm convinced that hardware issues are resolved, I'll go to the effort of destroying the pool and restoring from backups regardless.

Important Announcement for the TrueNAS Community.

Disks dropping from array during scrub

Cadet

FreeNAS Enthusiast

Server Wrangler

Cadet

Server Wrangler

Guru

Inactive Account

Contributor

Inactive Account

Wizard

Cadet

Cadet

Inactive Account

Cadet

Inactive Account

Guru

Guru

Inactive Account

Guru

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Disks dropping from array during scrub"

Similar threads