simultaenous drive failures!?

vikingboy · Jul 1, 2015

Would appreciate some expert advice if it isn't too late.

My RAIDz2 array was running a scrub this morning, UK temps are unusually high today and for the first time since building this system I received an alarms that the limit of 37degC had been reached by a number of drives.

Device: /dev/da6 [SAT], Temperature 37 Celsius reached critical limit of 37 Celsius (Min/Max 25/37!)

11:00am I get a more serious warning.

Code:

The volume RAID (ZFS) state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.

11:10am, appears that the array has crapped itself

Code:

The volume RAID (ZFS) state is UNAVAIL: One or more devices are faulted in response to IO failures.

checked array status and it looks bad....

Code:

[admin@freenas] /% zpool status
  pool: RAID
state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-JQ
  scan: scrub in progress since Wed Jul  1 04:00:03 2015
        10.6T scanned out of 20.4T at 422M/s, 6h47m to go
        0 repaired, 51.84% done
config:

    NAME                                            STATE     READ WRITE CKSUM
    RAID                                            UNAVAIL      0     0     0
     raidz2-0                                      UNAVAIL      0     3     0
       gptid/4087080c-3a74-11e4-b9cb-90e2ba382e3c  ONLINE       0     0     0
       gptid/415efb3e-3a74-11e4-b9cb-90e2ba382e3c  FAULTED      7   142     0  too many errors
       3981687536073005009                         REMOVED      0     0     0  was /dev/gptid/42391ba3-3a74-11e4-b9cb-90e2ba382e3c
       gptid/43109c05-3a74-11e4-b9cb-90e2ba382e3c  ONLINE       0     0     0
       6808129795312123271                         REMOVED      0     0     0  was /dev/gptid/43f12866-3a74-11e4-b9cb-90e2ba382e3c
       gptid/44ceabd7-3a74-11e4-b9cb-90e2ba382e3c  FAULTED      6     8     0  too many errors
       gptid/45aecb5f-3a74-11e4-b9cb-90e2ba382e3c  ONLINE       0     0     0
       gptid/4693dff1-3a74-11e4-b9cb-90e2ba382e3c  ONLINE       0     0     0
       16041530080660729023                        REMOVED      0     0     0  was /dev/gptid/477508a3-3a74-11e4-b9cb-90e2ba382e3c
       gptid/48592025-3a74-11e4-b9cb-90e2ba382e3c  ONLINE       0     0     0

errors: 96 data errors, use '-v' for a list

  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Wed Jul  1 03:46:06 2015
config:

    NAME                                          STATE     READ WRITE CKSUM
    freenas-boot                                  ONLINE       0     0     0
     gptid/464811d2-cb24-11e4-9959-90e2ba382e3c  ONLINE       0     0     0

errors: No known data errors
[admin@freenas] /%

Im due to move country on Friday so don't have time to do much troubleshooting.

Bidule0hm · Jul 1, 2015

Please use the code tags to post CLI outputs.

37 °C is fine for the drives. I think the problem comes from something else (HBA, PSU, ...) or some of the drives already had problems for a while but you didn't notice it.

vikingboy · Jul 1, 2015

yeah, Im beginning to suspect something else such as RAM, drive controller cards (which are actively cooled) or motherboard chipset (which is passively cooled). Theres good case air flow but its in a home office so 'tuned' for noise control rather than outright airflow.
Ive powered the system off and will reboot it this evening and see what the damage is.

Ericloewe · Jul 1, 2015

I don't like the looks of all those "removed" disks.

How many drives is the vdev supposed to have? What has been done to this pool in the past?

vikingboy · Jul 1, 2015

there are 10 discs in the Z2 array (single vdev) consisting of 4TB WD REDs. They are all installed in Supermicro 5-in-3 bays with fans pulling air over drives with usual temps sub 30degrees.
They disks have been bomb proof since this server was built approximately 8 months or so ago. My array runs regular SMART scans as per Cyberjocks guidelines and SCRUBs every 14 days. No previous signs of issues, email logging setup etc.
Its all been fine and error free (as far as I know anyway) until today when it appears as multiple drives failed simultaneously and some reported being disconnected <gulp>
Im wonder if the controller card died or overheated as I can't imagine several drives would die simultaneously like that......although there is probably a law that states it is possible.

The temps have cooled off here since earlier today. I'll fire it back up and see whats what.....do I feel lucky......nah. :-(

Ericloewe · Jul 1, 2015

So half of your drives are screwed in one way or another.

Five drives don't just simultaneously die independently, so there are a few scenarios I can think of:

One of the 5-in-3 bays has some serious issue. Could be power or data. Try connecting the drives directly, without the bays.
Bad SATA/SAS controller
Bad PSU/bad connection to one of the backplanes

vikingboy · Jul 1, 2015

so I rebooted this evening and this is now being reported....

Code:

[admin@freenas] /% zpool status
  pool: RAID
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub in progress since Wed Jul  1 04:00:03 2015
        10.6T scanned out of 20.4T at 69.4M/s, 41h10m to go
        0 repaired, 51.94% done
config:

    NAME                                            STATE     READ WRITE CKSUM
    RAID                                            ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/4087080c-3a74-11e4-b9cb-90e2ba382e3c  ONLINE       0     0     0
        gptid/415efb3e-3a74-11e4-b9cb-90e2ba382e3c  ONLINE       0     0    13
        gptid/42391ba3-3a74-11e4-b9cb-90e2ba382e3c  ONLINE       0     0     3
        gptid/43109c05-3a74-11e4-b9cb-90e2ba382e3c  ONLINE       0     0     0
        gptid/43f12866-3a74-11e4-b9cb-90e2ba382e3c  ONLINE       0     0     2
        gptid/44ceabd7-3a74-11e4-b9cb-90e2ba382e3c  ONLINE       0     0     0
        gptid/45aecb5f-3a74-11e4-b9cb-90e2ba382e3c  ONLINE       0     0     0
        gptid/4693dff1-3a74-11e4-b9cb-90e2ba382e3c  ONLINE       0     0     0
        gptid/477508a3-3a74-11e4-b9cb-90e2ba382e3c  ONLINE       0     0     0
        gptid/48592025-3a74-11e4-b9cb-90e2ba382e3c  ONLINE       0     0     0

errors: No known data errors

  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Wed Jul  1 03:46:06 2015
config:

    NAME                                          STATE     READ WRITE CKSUM
    freenas-boot                                  ONLINE       0     0     0
      gptid/464811d2-cb24-11e4-9959-90e2ba382e3c  ONLINE       0     0     0

errors: No known data errors
[admin@freenas] /%

I'm confused - what should my next steps be? Should I let the scrub complete in this hardware?
Should I buy a new server and transfer all the disks? Buy new disks?

Bidule0hm · Jul 1, 2015

Let the scrub complete and then start to search what is the source of the problem. First identify if the drives are all in the same 5 bays thing as @Ericloewe suggested it ;)

Important Announcement for the TrueNAS Community.

simultaenous drive failures!?

vikingboy

Explorer

Bidule0hm

Server Electronics Sorcerer

vikingboy

Explorer

Ericloewe

Server Wrangler

vikingboy

Explorer

Ericloewe

Server Wrangler

vikingboy

Explorer

Bidule0hm

Server Electronics Sorcerer

Similar threads

Important Announcement for the TrueNAS Community.

simultaenous drive failures!?

Explorer

Server Electronics Sorcerer

Explorer

Server Wrangler

Explorer

Server Wrangler

Explorer

Server Electronics Sorcerer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "simultaenous drive failures!?"

Similar threads