Multiple Drives in ONLY one VDEV appear to be failing??

HeloJunkie

Patron
Joined
Oct 15, 2014
Messages
300
Preface: This is in my backup system listed in my signature running TrueNAS-12.0-U3 with 5 x 9 Drive RAIDZ2 VDEVs. I have 140TB usable and about half of that is used.

I got an email saying that my pool was in a degraded state. I logged in and it appears that there is something very strange going on. I have three failing drives, and all of those drives just happen to be in the same VDEV which seems very strange to me. All three are resilvering but one says it is online, one says that it is degraded and one says that it is faulted.

Needless to say, those drives need to be replaced. Interesting to note that these six drives are all Seagate OEM drives, all the rest of the drives in the system are WD.

So I guess I am looking for advice on how best to handle this situation. As I stated above, this is a backup to another unit so if I lost all data on it I would lose no data at all, but replicating 70+TB back from my primary server would be a pain.

Should I just wait to see if it resilvers and then replace the drives one at a time? I am thinking all nine of those drives have to go since three of them have already failed. What is the easiest way to replace ALL drives in a VDEV? From reading, I am thinking it is one at a time.

And advice would be appreciated!

Code:
  pool: vol1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Apr 10 13:28:10 2022
        726G scanned at 556M/s, 321G issued at 246M/s, 97.2T total
        7.78G resilvered, 0.32% done, 4 days 18:34:16 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        vol1                                            DEGRADED     0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/9bed4d8b-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/9fdb0a9f-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/a0c02e7b-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/a2accfe9-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/a45f9c8d-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/a560bed0-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/a7b50f54-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/ab823287-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/adeeb776-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
          raidz2-1                                      ONLINE       0     0     0
            gptid/b4853db5-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/b9e9de9c-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/bb972e33-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/c384679f-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/c298bc4f-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/c71378b5-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/c89c5dc2-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/cb5ae9f7-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/ce0c7f7a-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
          raidz2-2                                      ONLINE       0     0     0
            gptid/d198996f-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/d2664bca-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/d3dc940a-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/d5468178-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/dfad66ba-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/ded262b0-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/e0cddd90-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/e7cb18ae-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/ead47bf6-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
          raidz2-3                                      ONLINE       0     0     0
            gptid/edd89401-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/efce3fb4-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/45ca72b8-81c4-11eb-9763-0007433b1890  ONLINE       0     0     0
            gptid/9dbcb705-15a4-11e7-9dc2-a0369f52eb66  ONLINE       0     0     0
            gptid/f9043142-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/faaca3e0-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/fe51d95e-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/ffdec3a8-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
            gptid/fe4c68e2-0f58-11e7-96ea-a0369f52eb66  ONLINE       0     0     0
          raidz2-4                                      DEGRADED     0     0     0
            gptid/60f3bb95-90d1-11eb-aa66-0cc47abc5340  ONLINE       0     0   956  (resilvering)
            gptid/608bdff7-90d1-11eb-aa66-0cc47abc5340  ONLINE       0     0   795
            gptid/62c7936e-90d1-11eb-aa66-0cc47abc5340  ONLINE       0     0   795
            gptid/63121eac-90d1-11eb-aa66-0cc47abc5340  ONLINE       0     0   795
            gptid/638c4c58-90d1-11eb-aa66-0cc47abc5340  ONLINE       0     0   795
            gptid/64447b50-90d1-11eb-aa66-0cc47abc5340  DEGRADED   145     0   956  too many errors  (resilvering)
            gptid/63b6cfbd-90d1-11eb-aa66-0cc47abc5340  ONLINE       0     0   795
            gptid/64b62b27-90d1-11eb-aa66-0cc47abc5340  ONLINE       0     0   795
            gptid/64f75ee0-90d1-11eb-aa66-0cc47abc5340  FAULTED  1.11K     0     2  too many errors  (resilvering)
 

Alecmascot

Guru
Joined
Mar 18, 2014
Messages
1,177
A search threw up this along with other info :
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Also, you did not mention if you had any SMART failures. Before replacing the drives it would be good to know if the drives are at fault here. When did you last run a SMART Extended/Long test and did it pass or fail? Do you have any other failing SMART indications? While you can replace the drives, the question is if they are actually at fault or could it be a cable, controller as previously indicated, bad power. You get the idea.

Good Luck, hope all works out.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
What's the model of the Seagate drives? There was a firmware release for the Ironwood series in relation to write cache/NCQ timeouts, but if your drives are white label or shucked OEM that could present a challenge.


Side note: you have two resilvering drives, the "third drive" is ZFS telling you that the vdev itself is degraded overall. Three offline drives in a Z2 makes the vdev UNAVAIL.

Given the scale here I'd lean towards "replace with known compatible hardware" - that being the WDs - one at a time.
 

HeloJunkie

Patron
Joined
Oct 15, 2014
Messages
300
OK, so now I am really confused. I came in this morning and this is what I see:

Code:
pool: vol1
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 28.9G in 03:33:31 with 191 errors on Sun Apr 10 17:01:41 2022


and the offending VDEV:

Code:
raidz2-4                                      DEGRADED 3.90K     0     0
            gptid/60f3bb95-90d1-11eb-aa66-0cc47abc5340  FAULTED    203     0 3.59K  too many errors
            gptid/608bdff7-90d1-11eb-aa66-0cc47abc5340  ONLINE       0     0 3.26K
            gptid/62c7936e-90d1-11eb-aa66-0cc47abc5340  ONLINE       0     0 3.26K
            gptid/63121eac-90d1-11eb-aa66-0cc47abc5340  ONLINE       0     0 3.26K
            gptid/638c4c58-90d1-11eb-aa66-0cc47abc5340  ONLINE       0     0 3.26K
            gptid/64447b50-90d1-11eb-aa66-0cc47abc5340  DEGRADED 6.62K     0 1.09K  too many errors
            gptid/63b6cfbd-90d1-11eb-aa66-0cc47abc5340  ONLINE       0     0 3.26K
            gptid/64b62b27-90d1-11eb-aa66-0cc47abc5340  ONLINE       0     0 3.26K
            gptid/64f75ee0-90d1-11eb-aa66-0cc47abc5340  FAULTED  1.11K     0     2  too many errors



Yet all of my data is still available (again this is just the backup system). I assume that this is because I have only "lost" two drives (faulted) and since I am running Z2 the data is still available until that third drive that is DEGRADED fails. Once that drive goes, I lose the VDEV and the pool, as I recall.

As a side note, I did take a look at the SMART information, I do long and short tests, and it does look like these drives are actually bad. I am seeing a high count of reallocated sectors on these drives and zero on all other drives in the system:

Code:
5 Reallocated_Sector_Ct 0x0033 076 076 010 Pre-fail Always - 28152 (0 1)


and

Code:
ATA Error Count: 581 (device log contains only the most recent five errors)



So I guess the best course of action (as @HoneyBadger says) is to replace with "known compatible hardware" one drive at a time and hope that the VDEV survives the process. Of course, I now have a totally different problem to deal with, I always list my drives by serial number in a spreadsheet as to where they are located in the system, unfortunately, it appears that these Seagate OEM drives do not report their serial number:

Code:
da2: Serial 00000000 ; GPTID=gptid/63121eac-90d1-11eb-aa66-0cc47abc5340
da3: Serial 00000000 ; GPTID=gptid/64f75ee0-90d1-11eb-aa66-0cc47abc5340
da4: Serial 00000000 ; GPTID=gptid/608bdff7-90d1-11eb-aa66-0cc47abc5340
da5: Serial 00000000 ; GPTID=gptid/62c7936e-90d1-11eb-aa66-0cc47abc5340
da6: Serial 00000000 ; GPTID=gptid/60f3bb95-90d1-11eb-aa66-0cc47abc5340
da7: Serial 00000000 ; GPTID=gptid/638c4c58-90d1-11eb-aa66-0cc47abc5340
da8: Serial 00000000 ; GPTID=gptid/64b62b27-90d1-11eb-aa66-0cc47abc5340
da9: Serial 00000000 ; GPTID=gptid/64447b50-90d1-11eb-aa66-0cc47abc5340
da10: Serial 00000000 ; GPTID=gptid/63b6cfbd-90d1-11eb-aa66-0cc47abc5340


Of course, I have never seen this before so now I have absolutely no idea what drive is failing where since I cannot see the drive serial numbers anywhere!!!

Code:
=== START OF INFORMATION SECTION ===
Device Model:     OOS12000G
Serial Number:    00000000
LU WWN Device Id: 5 000c50 0b19de024
Firmware Version: OOS1
User Capacity:    12,000,138,625,024 bytes [12.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Apr 11 03:12:52 2022 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled



Any of you have any good ideas as to how to figure this out without drive serial numbers being reported by smartctl? I have tried running some of the tools I have been using before, but no luck, I assume because the drives are not reporting a serial number for whatever reason, these tools do not work:

Code:
partition  label                                       zpool      device  disk                       size  type  serial           rpm  sas-location   
------------------------------------------------------------------------------------------------------------------------------------------------------
da2p2      gptid/63121eac-90d1-11eb-aa66-0cc47abc5340  vol1       da2     ATA OOS12000G             12000  HDD   00000000        7200  SAS3008(0):2#10
da3p2      gptid/64f75ee0-90d1-11eb-aa66-0cc47abc5340  vol1       da3     ATA OOS12000G             12000  HDD   00000000        7200  SAS3008(0):2#10
da4p2      gptid/608bdff7-90d1-11eb-aa66-0cc47abc5340  vol1       da4     ATA OOS12000G             12000  HDD   00000000        7200  SAS3008(0):2#10
da5p2      gptid/62c7936e-90d1-11eb-aa66-0cc47abc5340  vol1       da5     ATA OOS12000G             12000  HDD   00000000        7200  SAS3008(0):2#10
da6p2      gptid/60f3bb95-90d1-11eb-aa66-0cc47abc5340  vol1       da6     ATA OOS12000G             12000  HDD   00000000        7200  SAS3008(0):2#10
da7p2      gptid/638c4c58-90d1-11eb-aa66-0cc47abc5340  vol1       da7     ATA OOS12000G             12000  HDD   00000000        7200  SAS3008(0):2#10
da8p2      gptid/64b62b27-90d1-11eb-aa66-0cc47abc5340  vol1       da8     ATA OOS12000G             12000  HDD   00000000        7200  SAS3008(0):2#10
da9p2      gptid/64447b50-90d1-11eb-aa66-0cc47abc5340  vol1       da9     ATA OOS12000G             12000  HDD   00000000        7200  SAS3008(0):2#10
da10p2     gptid/63b6cfbd-90d1-11eb-aa66-0cc47abc5340  vol1       da10    ATA OOS12000G             12000  HDD   00000000        7200  SAS3008(0):2#10
 
Last edited:

HeloJunkie

Patron
Joined
Oct 15, 2014
Messages
300
Try sesutil identify da2 and see if it will light up the indicator LED for a bay.
Thank You, I will give this a try and let you know!
 

Alecmascot

Guru
Joined
Mar 18, 2014
Messages
1,177
The missing drive serial numbers could be because your hba is not flashed to IT mode ?
 

HeloJunkie

Patron
Joined
Oct 15, 2014
Messages
300
The missing drive serial numbers could be because your hba is not flashed to IT mode ?
No, it was flashed to IT mode well before these drives were added. I have found out that Seagate OEMed these drives and you can get some software to set your own drive serial numbers yourself.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I'm curious if the drive light flashes for you. If not, you could try to run a SMART Short test and listen for the drive noise. And engine stethoscope would prove useful. It's not the best way to go about it but call it a last ditch option. If you can set your drive serial numbers, that is great news.
 

HeloJunkie

Patron
Joined
Oct 15, 2014
Messages
300
Try sesutil identify da2 and see if it will light up the indicator LED for a bay.
@HoneyBadger - This worked!! Thank you very much for the advice. I was able to identify the drives in question so now I can go ahead and replace them since they do, in fact, appear to have errors reporting via smartctl.
 
Top