single or ALL drives failed?

neils · Aug 3, 2016

FreeNAS 9.3 STABLE on a 'backblaze' v5.0 storage pod from Backuppods

I had been receiving the following 'critical alerts' email for a 15 drive raidz2:
---

Code:

Device: /dev/ada3, 40 Currently unreadable (pending) sectors
Device: /dev/ada3, 9 Offline uncorrectable sectors

---
Then couple days ago I get
---

Code:

Device: /dev/ada3, 40 Currently unreadable (pending) sectors
Device: /dev/ada3, 9 Offline uncorrectable sectors
The volume csrpd1 (ZFS) state is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
The capacity for the volume 'csrpd1' is currently at 92%, while the recommended value is below 80%.

---
zpool status -v csrpd1 showed
---

Code:

pool: csrpd1
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
  see: http://illumos.org/msg/ZFS-8000-8A
scan: scrub repaired 0 in 35h33m with 2 errors on Mon Aug  1 11:33:58 2016
config:

NAME                                            STATE     READ WRITE CKSUM
csrpd1                                          ONLINE       0     0     2
raidz2-0                                      ONLINE       0     0     4
   gptid/652599a0-d68b-11e5-9dde-0cc47a5eccf4  ONLINE       0     0     0
   gptid/6597aff4-d68b-11e5-9dde-0cc47a5eccf4  ONLINE       0     0     0
   gptid/6610b29c-d68b-11e5-9dde-0cc47a5eccf4  ONLINE       0     0     0
   gptid/6688829c-d68b-11e5-9dde-0cc47a5eccf4  ONLINE       0     0     0
   gptid/6702e919-d68b-11e5-9dde-0cc47a5eccf4  ONLINE       0     0     0
   gptid/677d928e-d68b-11e5-9dde-0cc47a5eccf4  ONLINE       0     0     0
   gptid/67fbdb3c-d68b-11e5-9dde-0cc47a5eccf4  ONLINE       0     0     0
   gptid/687b22bd-d68b-11e5-9dde-0cc47a5eccf4  ONLINE       0     0     0
   gptid/68ec3f39-d68b-11e5-9dde-0cc47a5eccf4  ONLINE       0     0     0
   gptid/69699a43-d68b-11e5-9dde-0cc47a5eccf4  ONLINE       0     0     0
   gptid/69e1fe49-d68b-11e5-9dde-0cc47a5eccf4  ONLINE       0     0     0
   gptid/6a553de4-d68b-11e5-9dde-0cc47a5eccf4  ONLINE       0     0     0
   gptid/6acc9261-d68b-11e5-9dde-0cc47a5eccf4  ONLINE       0     0     0
   gptid/6b4adf15-d68b-11e5-9dde-0cc47a5eccf4  ONLINE       0     0     0
   gptid/6bb9e2ea-d68b-11e5-9dde-0cc47a5eccf4  ONLINE       0     0     0
spares
gptid/6c348586-d68b-11e5-9dde-0cc47a5eccf4    AVAIL

errors: Permanent errors have been detected in the following files:

       /mnt/csrpd1/NEXRAD/level2/2008/200809/20080928/KTWX/KTWX20080928_210731_V03.bz2

---
BTW, /dev/ada3 is gptid/6688829c-d68b-11e5-9dde-0cc47a5eccf4

Note zpool status was not indicating a degraded state.
Then today I received:
---

Code:

Device: /dev/ada3, 40 Currently unreadable (pending) sectors
Device: /dev/ada3, 9 Offline uncorrectable sectors
The volume csrpd1 (ZFS) state is DEGRADED: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
The capacity for the volume 'csrpd1' is currently at 92%, while the recommended value is below 80%.

---
and zpool status -v csrpd1 now shows
---

Code:

pool: csrpd1

state: DEGRADED

status: One or more devices has experienced an error resulting in data

    corruption.  Applications may be affected.

action: Restore the file in question if possible.  Otherwise restore the

    entire pool from backup.

  see: http://illumos.org/msg/ZFS-8000-8A

  scan: scrub repaired 0 in 35h33m with 2 errors on Mon Aug  1 11:33:58 2016

config:



    NAME                                            STATE     READ WRITE CKSUM

    csrpd1                                          DEGRADED     0     0    38

      raidz2-0                                      DEGRADED     0     0    76

        gptid/652599a0-d68b-11e5-9dde-0cc47a5eccf4  DEGRADED     0     0     0  too many errors

        gptid/6597aff4-d68b-11e5-9dde-0cc47a5eccf4  DEGRADED     0     0     0  too many errors

        gptid/6610b29c-d68b-11e5-9dde-0cc47a5eccf4  DEGRADED     0     0     0  too many errors

        gptid/6688829c-d68b-11e5-9dde-0cc47a5eccf4  DEGRADED     0     0     0  too many errors

        gptid/6702e919-d68b-11e5-9dde-0cc47a5eccf4  DEGRADED     0     0     0  too many errors

        gptid/677d928e-d68b-11e5-9dde-0cc47a5eccf4  DEGRADED     0     0     0  too many errors

        gptid/67fbdb3c-d68b-11e5-9dde-0cc47a5eccf4  DEGRADED     0     0     0  too many errors

        gptid/687b22bd-d68b-11e5-9dde-0cc47a5eccf4  DEGRADED     0     0     0  too many errors

        gptid/68ec3f39-d68b-11e5-9dde-0cc47a5eccf4  DEGRADED     0     0     0  too many errors

        gptid/69699a43-d68b-11e5-9dde-0cc47a5eccf4  DEGRADED     0     0     0  too many errors

        gptid/69e1fe49-d68b-11e5-9dde-0cc47a5eccf4  DEGRADED     0     0     0  too many errors

        gptid/6a553de4-d68b-11e5-9dde-0cc47a5eccf4  DEGRADED     0     0     0  too many errors

        gptid/6acc9261-d68b-11e5-9dde-0cc47a5eccf4  DEGRADED     0     0     0  too many errors

        gptid/6b4adf15-d68b-11e5-9dde-0cc47a5eccf4  DEGRADED     0     0     0  too many errors

        gptid/6bb9e2ea-d68b-11e5-9dde-0cc47a5eccf4  DEGRADED     0     0     0  too many errors

    spares

      gptid/6c348586-d68b-11e5-9dde-0cc47a5eccf4    AVAIL  

errors: Permanent errors have been detected in the following files:

        /mnt/csrpd1/NEXRAD/level2/2008/200809/20080928/KTWX/KTWX20080928_210731_V03.bz2

---
Do I believe my eyes? Have ALL of the drives failed?
What is this spool status telling me?

Nick2253 · Aug 3, 2016

Your pool status is telling you that your pool is degraded. One or more drives have either failed or experienced unrecoverable errors. From the manual:

Damaged files may or may not be able to be removed depending on the type of corruption. If the corruption is within the plain data, the file should be removable. If the corruption is in the file metadata, then the file cannot be removed, though it can be moved to an alternate location. In either case, the data should be restored from a backup source. It is also possible for the corruption to be within pool-wide metadata, resulting in entire datasets being unavailable. If this is the case, the only option is to destroy the pool and re-create the datasets from backup.

In the future, please please please please use code tags. It makes everything much easier to read.

Your additional problem is that you are using too much of your pool. You should not use more than 80%.

Jailer · Aug 3, 2016

You have a 15 drive wide RAIDZ2 vdev?

maglin · Aug 3, 2016

More like [emoji33][emoji90]!!!!!!

Sent from my iPhone using Tapatalk

depasseg · Aug 3, 2016

Do you have SMART tests configured for your drives? I'm curious what the temps are. @Bidule0hm has some great scripts which make it easy.

neils · Aug 4, 2016

Nick2253 said:
Your pool status is telling you that your pool is degraded. One or more drives have either failed or experienced unrecoverable errors. From the manual:

In the future, please please please please use code tags. It makes everything much easier to read.

Your additional problem is that you are using too much of your pool. You should not use more than 80%.

Yes, too many drives in the pool and too much usage. What was I thinking! (Maybe wasn't)

I'm curious about the zpool report of too many errors for each of the drives when I had been getting the messages specifically for /dev/ada3. And zpool drive STATEs normally are ONLINE, UNAVAILABLE, and such. But 'DEGRADED' state for a drive? And for all of them? I haven't seen that before from 'zpool status' reports when a drive went bad.

Is there a relationship between excessive zpool drive count as with this one, excessive capacity as in this case, the corrupted and uncorrectable file, and the zpool drive state of 'DEGRADED'?

Drive temps are 24-29C

Code:

foreach i ( `seq 0 15` )
foreach? echo == ada$i
foreach? smartctl -a /dev/ada${i} | egrep '194|197|198|199'
foreach? echo
foreach? end
== ada0
194 Temperature_Celsius     0x0002   222   222   000    Old_age   Always       -       27 (Min/Max 18/42)
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

== ada1
194 Temperature_Celsius     0x0002   240   240   000    Old_age   Always       -       25 (Min/Max 18/39)
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

== ada2
194 Temperature_Celsius     0x0002   240   240   000    Old_age   Always       -       25 (Min/Max 17/41)
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

== ada3
194 Temperature_Celsius     0x0002   250   250   000    Old_age   Always       -       24 (Min/Max 18/42)
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       40
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       9
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

== ada4
194 Temperature_Celsius     0x0002   240   240   000    Old_age   Always       -       25 (Min/Max 18/41)
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

== ada5
194 Temperature_Celsius     0x0002   250   250   000    Old_age   Always       -       24 (Min/Max 18/41)
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

== ada6
194 Temperature_Celsius     0x0002   206   206   000    Old_age   Always       -       29 (Min/Max 21/43)
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

== ada7
194 Temperature_Celsius     0x0002   214   214   000    Old_age   Always       -       28 (Min/Max 21/42)
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

== ada8
194 Temperature_Celsius     0x0002   214   214   000    Old_age   Always       -       28 (Min/Max 19/41)
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

== ada9
194 Temperature_Celsius     0x0002   214   214   000    Old_age   Always       -       28 (Min/Max 20/43)
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

== ada10
194 Temperature_Celsius     0x0002   230   230   000    Old_age   Always       -       26 (Min/Max 18/40)
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

== ada11
194 Temperature_Celsius     0x0002   214   214   000    Old_age   Always       -       28 (Min/Max 21/43)
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

== ada12
194 Temperature_Celsius     0x0002   222   222   000    Old_age   Always       -       27 (Min/Max 20/43)
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

== ada13
194 Temperature_Celsius     0x0002   206   206   000    Old_age   Always       -       29 (Min/Max 22/42)
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

== ada14
194 Temperature_Celsius     0x0002   206   206   000    Old_age   Always       -       29 (Min/Max 22/43)
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

== ada15
194 Temperature_Celsius     0x0002   230   230   000    Old_age   Always       -       26 (Min/Max 19/40)
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

neils · Aug 4, 2016

Nick2253 said:
Your pool status is telling you that your pool is degraded. One or more drives have either failed or experienced unrecoverable errors. From the manual:

In the future, please please please please use code tags. It makes everything much easier to read.

Your additional problem is that you are using too much of your pool. You should not use more than 80%.

To be clear, I've seen degraded pools in the past, where a zpool status shows the vdev is DEGRADED, then lists the STATE of each drive from which one may determine which drive failed. Good/active drives have a state of 'ONLINE'.
There is no distinction between the drive states in the above status output. They are all 'DEGRADED'. That's not a valid drive state, is it? Is this a bug in FreeNAS 9.3?

anodos · Aug 4, 2016

What type of HBA are you using?

neils · Aug 7, 2016

anodos said:
What type of HBA are you using?

From the backuppods.com website:

MB: supermicro MBD-X9SRH-7TF-O
HBA: 3 x Sunrich A-540 SATA III 4 port PCIe (Marvell 9235 chipset)
Backplane: 9 x Sunrich S-331 5 port expansion card (Marvell 9715 chipset)

neils · Aug 7, 2016

neils said:
From the backuppods.com website:

MB: supermicro MBD-X9SRH-7TF-O
HBA: 3 x Sunrich A-540 SATA III 4 port PCIe (Marvell 9235 chipset)
Backplane: 9 x Sunrich S-331 5 port expansion card (Marvell 9715 chipset)

Also should mention that in addition to each drive listed with state = DEGRADED in the zpool status, when selecting each drive in
Storage > volume > status > drive number
the button selections available include only 'REPLACE' instead of 'OFFLINE', suggesting as the documentation says they are already 'OFFLINE'.

But the drives couldn't all be offline - the vdev is mounted and available for read/write.

Is FreeNAS 9.3 having a problem with this hardware?

anodos · Aug 8, 2016

neils said:
Is FreeNAS 9.3 having a problem with this hardware?

That's entirely possible. Many users have had problems with Marvell chipsets. If you have access to something like an LSI 9211-8i, IBM M1015, Dell H200, etc. I would try importing the pool using one of those cards. The backplanes might also be dodgy, and you might want to try cutting them out of the loop as well.

Stux · Aug 8, 2016

Also, that's a lot of sata port multipliers, which are not recommended.

from the hardware recommendations:

Avoid using SATA Port Multipliers at all costs. If you don't the cost is very likely to be "your data". These are error-prone and built by the lowest bidder and are about as reliable as you can expect.

It could just be that one of the sata cards, or multiplier cards has died, and it might be as simple as replacing the dead card.

One of the reasons why it is wise to spread the drives in a vdev out across your hba cards if you can, such that if you suffer a loss of an HBA you won't lose a vdev.

neils · Aug 22, 2016

Stux said:
Also, that's a lot of sata port multipliers, which are not recommended.

from the hardware recommendations:

It could just be that one of the sata cards, or multiplier cards has died, and it might be as simple as replacing the dead card.

One of the reasons why it is wise to spread the drives in a vdev out across your hba cards if you can, such that if you suffer a loss of an HBA you won't lose a vdev.

Good recommendation. At least I did that.
Wonder what would be a good diagnostic for a dead port multiplier card?

neils · Aug 22, 2016

anodos said:
That's entirely possible. Many users have had problems with Marvell chipsets. If you have access to something like an LSI 9211-8i, IBM M1015, Dell H200, etc. I would try importing the pool using one of those cards. The backplanes might also be dodgy, and you might want to try cutting them out of the loop as well.

Might I assume the reports of 'all' drives as 'DEGRADED' is reflecting a driver bug, since I can still conduct I/O on the zpool?

Is there a risk with exporting the pool; in order to, say, try another OS install and import from the new OS?

rs225 · Aug 22, 2016

Have you run a memory test on this system? Bad memory is another potential cause of corruption like this.

I think it shows all the drives degraded because ZFS can't figure out why it can't get a valid checksum from the vdev. This is also why I think your problem is in your CPU/mobo/RAM. If it was your controller cards acting in a somewhat random manner, you would expect to have some checksum failures that localize on any/all of the drives. None of yours do they are all at higher levels, which means it is the checksum itself that is wrong. Only the CPU/RAM/mobo can do this, usually. (PCI bus and the card could still do it, but you are left wondering how it is able to fail so darn accurately!)

SweetAndLow · Aug 22, 2016

Since no one has mentioned it I'll say it, your pool is corrupt and probably not coming back. You can try fixing your controller problem or using different hardware but it's not going to work. You should also check the smart values for all your drives.

Mirfster · Aug 22, 2016

neils said:
MB: supermicro MBD-X9SRH-7TF-O
HBA: 3 x Sunrich A-540 SATA III 4 port PCIe (Marvell 9235 chipset)
Backplane: 9 x Sunrich S-331 5 port expansion card (Marvell 9715 chipset)

What RAM are you using? I noticed that this MB can take ECC and Non-ECC...

neils · Aug 23, 2016

Mirfster said:
What RAM are you using? I noticed that this MB can take ECC and Non-ECC...

hynx HMT31GR7CFR4C-PB
-- ECC, Registered

neils · Aug 23, 2016

Hang on folks ...

I shut down the system to run a memory test, which passed the whole first pass before I bailed.
Then powered up the system and the pool came back up *not* DEGRADED and with all drive states ONLINE.

I have off lined and replaced the drive that was generating sector errors, so it is now resilvering.

What gives?

rs225 · Aug 23, 2016

See what it shows after resilver finishes.

If it is still clear, then it means the data on the drives is good. The problem, then, was that somehow good data on drives was seriously mangled by the time it got to your CPU. And that problem mysteriously disappeared.

Important Announcement for the TrueNAS Community.

single or ALL drives failed?

Dabbler

Wizard

Not strong, but bad

Patron

FreeNAS Replicant

Dabbler

Dabbler

Sambassador

Dabbler

Dabbler

Sambassador

MVP

Dabbler

Dabbler

Guru

Sweet'NASty

Doesn't know what he's talking about

Dabbler

Dabbler

Guru

Similar threads