Reslivering to an UNAVAIL disk? What's going on

diskdiddler · Mar 20, 2016

I replaced a faulty disk yesterday, #5 of 6 disks. Admitedly I'd been running the server for 2 weeks without the disk at all. Unfortunatelt that was out of my control.
I have ZFS2 or whatever it is, up to 2 disks can go bad on me.

I replaced disk 5, went into the menus, opted to choose replace and off it went.

Today I see the below, both disks 5 AND 6 are giving me issues.

Code:

[root@freenas] ~# zpool status
  pool: ARRAY
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Mar 21 00:22:19 2016
        6.45T scanned out of 23.9T at 202M/s, 25h11m to go
        8.91M resilvered, 26.95% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        ARRAY                                           DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            gptid/7a3a09f4-10cb-11e4-bc4d-441ea13cac6b  ONLINE       0     0     0
            gptid/7aaa3c1a-10cb-11e4-bc4d-441ea13cac6b  ONLINE       0     0     0
            gptid/2a07e64b-9fc5-11e4-8140-28924a2d5aca  ONLINE       0     0     0
            gptid/7b8d77f7-10cb-11e4-bc4d-441ea13cac6b  ONLINE       0     0     0
            11463431290501954902                        UNAVAIL      5   143     0  was /dev/gptid/410e7d98-ee2b-11e5-9faf-28924a2d5aca
            2815441331035575542                         UNAVAIL     46   129     0  was /dev/gptid/9f120a1c-5531-11e5-95cd-28924a2d5aca  (resilvering)

errors: No known data errors

  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0h2m with 0 errors on Sun Mar 20 03:47:20 2016
config:

        NAME                                            STATE     READ WRITE CKSUM
        freenas-boot                                    ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/93096e78-b686-11e4-b873-28924a2d5aca  ONLINE       0     0     0
            gptid/ad8cafcc-b687-11e4-ba56-28924a2d5aca  ONLINE       0     0     0

errors: No known data errors

Now that's ok, because hey, maybe the replacement disk is DOA and disk #6 just hapenned to die at the same time from the strain? It's possible.
However,.... how is the reslivering process still running? it seems to be implying it's still managing to talk to disk #6 "(reslivering)" and I've checked multiple times, it's definitely going up at about 200M/s yet 5 and 6 are UNAVAIL
Confused?

diskdiddler · Mar 20, 2016

Also, I'd LOVE to just shut the server down, replace 2 cables to be sure and boot it back up, maybe it's just a weird anomaly but I'm kinda terrified to shut it down in its current state with 2 disks down. I feel like I need to wait 24 hours while this process finishes and presumably, it magically brings disk #6 out of an unavail state (??)

or is it reslivering disks 1,2,3,4 (that's also, quite terrifying - why?)

P.S I also can't check temp, check basic smart info or really do ANYTHING with disks 5 and 6, it's as if they truly are, unavailable. So what the heck is reslivering?

jdong · Mar 20, 2016

It sounds like you can count the lucky stars that you built a Raid-Z2 and anyone reading this with a RAID-Z1 should be sweating bullets. You indeed suffered a second disk failure while your array is resilvering. If you lose another disk, you will lose the pool.

At this point, I would personally wait for the resilver to complete, then swap out the 5th disk.

diskdiddler · Mar 20, 2016

I always thought those were hypothetical situations which MIGHT occur during a resliver on a RAID setup. I mean yes, there's more strain, but I do thrash my system a fair bit. I guess it can really happen.
I'm confused at the read and write errors rather than checksum though, previously I only get checksum errors, not read / write?

Are both disks DEFINITELY faulty in this situation? Could it be a cabling issue?
Finally, is it really reslivering to a faulty disk?

jdong · Mar 20, 2016

Read/write errors means either the disk or the storage bus controller explicitly returned an error. Checksum errors mean that nothing reported an error, but the data was determined to be wrong by ZFS.

You could possibly have a cabling issue or a controller issue. It is a little worrisome that the new drive you've put in is already starting to show errors...

Sent from my iPad using Tapatalk

diskdiddler · Mar 20, 2016

So I should let the resliver run, 24 hours, possibly get disk #6 back into a working condition, then power it off and replace 5?
I'm tempted to replace cables, attempt to resliver 5 again just to be certain it's not something else - is that a bad idea? (if I specifically wait until after 6 is behaving again?)

jdong · Mar 20, 2016

Might not be a bad idea. Also, I think it's worth running a short or long SMART test on the 5th and 6th drives to see if the drives have any errors. That'll quickly rule out whether it's the drives or the cabling.

Sent from my iPad using Tapatalk

diskdiddler · Mar 20, 2016

I'll do that, thanks.
I HAVE to wait until this is complete though, because as it stands I can't run a smart test on 5 or 6 (again, hence my confusion it claims it's reslivering it) - the system can not see either disk with any of the commands I know of.

You can clearly see, when I hit return for the disks 5/6 in this command, nothing comes up.

[root@freenas] ~# smartctl -a /dev/ada0 | grep Temperature
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 42 (Min/Max 21/59)
[root@freenas] ~# smartctl -a /dev/ada5 | grep Temperature
[root@freenas] ~# smartctl -a /dev/ada4 | grep Temperature
[root@freenas] ~# smartctl -a /dev/ada3 | grep Temperature
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 42 (Min/Max 19/55)
[root@freenas] ~# smartctl -a /dev/ada2 | grep Temperature
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 42 (Min/Max 19/54)
[root@freenas] ~#

Mirfster · Mar 20, 2016

That is why it is suggested to burn-in all drives before putting them into production. ;)

diskdiddler · Mar 20, 2016

The 6'th disk would've died anyhow based on this data, this has just nudged it earlier I guess.
I'll replace 5, then 6 over the next 5 days while it reslivers (3 times seemingly)

danb35 · Mar 21, 2016

What's the output of 'camcontrol devlist'? What's the full output of 'smartctl -a /dev/ada4'?

It also looks like ada0, ada2, and ada3 are running warm--you'll want to take a look at system cooling.

rs225 · Mar 21, 2016

It would be good to see your SMART output on both drives. All of it.

Did disk 5 resilver at all? Is it possible you replaced a working disk 6 with 'new disk 5', which still has problems due to a cable issue? In other words, port 5 is bad, and now you've got a missing 'disk 5', and are replacing a working disk 6 with a 'new disk 5' on the bad port 5?

If that isn't the case, do you possibly have common cables that would affect both?

diskdiddler · Mar 21, 2016

Well this isn't ideal.
It finally finished and disk 6 which was reslivering remained in an offline state after completion.

Code:

[root@freenas] ~# zpool status
  pool: ARRAY
state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 8.91M in 27h51m with 0 errors on Tue Mar 22 04:14:15 2016
config:

        NAME                                            STATE     READ WRITE CKSUM
        ARRAY                                           DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            gptid/7a3a09f4-10cb-11e4-bc4d-441ea13cac6b  ONLINE       0     0     0
            gptid/7aaa3c1a-10cb-11e4-bc4d-441ea13cac6b  ONLINE       0     0     0
            gptid/2a07e64b-9fc5-11e4-8140-28924a2d5aca  ONLINE       0     0     0
            gptid/7b8d77f7-10cb-11e4-bc4d-441ea13cac6b  ONLINE       0     0     0
            11463431290501954902                        UNAVAIL      5   143     0  was /dev/gptid/410e7d98-ee2b-11e5-9faf-28924a2d5aca
            2815441331035575542                         UNAVAIL     46   129     0  was /dev/gptid/9f120a1c-5531-11e5-95cd-28924a2d5aca

errors: No known data errors

diskdiddler · Mar 21, 2016

Now it's all pretty much falling to pieces, rebooted - both disks remain UNAVAIL.
Disk 0 is claiming it's going to die soon too.

1 Raw_Read_Error_Rate 0x000b 100 100 050 Pre-fail Always - 0
2 Throughput_Performance 0x0005 100 100 050 Pre-fail Offline - 0
3 Spin_Up_Time 0x0027 100 100 001 Pre-fail Always - 8891
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 44
5 Reallocated_Sector_Ct 0x0033 100 100 050 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 050 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 100 100 050 Pre-fail Offline - 0
9 Power_On_Hours 0x0032 064 064 000 Old_age Always - 14484
10 Spin_Retry_Count 0x0033 100 100 030 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 44
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 46
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 15
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 56
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 44 (Min/Max 21/59)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 16
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 253 000 Old_age Always - 0
220 Disk_Shift 0x0002 100 100 000 Old_age Always - 0
222 Loaded_Hours 0x0032 064 064 000 Old_age Always - 14484
223 Load_Retry_Count 0x0032 100 100 000 Old_age Always - 0
224 Load_Friction 0x0022 100 100 000 Old_age Always - 0
226 Load-in_Time 0x0026 100 100 000 Old_age Always - 204
240 Head_Flying_Hours 0x0001 100 100 001 Pre-fail Offline - 0

SMART Error Log Version: 1
ATA Error Count: 2

diskdiddler · Mar 21, 2016

" CRITICAL: Device: /dev/ada0, 16 Currently unreadable (pending) sectors"
My understanding is, as long as disks 0,1,2,3 continue to churn, I'm ok - but only barely?
I need to replace the disks pretty much SUPER asap.

I wonder if something bigger is going on, causing this though. - I can imagine, faulty board / cables, could cause an issue reading or writing but bad sectors?
That's why someone earlier suggested a smart test, right? It's an internal test, by the disk, excluding external factors (except power) any issues from a full SMART would indicate, it's really the disk, with a problem, no?

rs225 · Mar 21, 2016

Could your power supply be bad, or cables shared on disks 5 & 6?

Would it be a good idea to power cycle the whole system and see if 5 & 6 reappear? If they don't but the system sees them, then another replace for both would be necessary.

You definitely don't want to do anything with disks 1 to 4.

Resilvered 8.9MB? Strange.

danb35 · Mar 21, 2016

You really need to do something about your drive temperatures--I can't say how much (if at all) they're contributing to the failures you're seeing, but they certainly aren't helping. You should also be backing up your data if you haven't already, as you have no redundancy at this point.

To repeat my previous questions: What's the output of 'camcontrol devlist'? Are ada5 or ada6 listed? If so, what's the full output of 'smartctl -a /dev/ada5'?

diskdiddler · Mar 21, 2016

rs225 said:
Could your power supply be bad, or cables shared on disks 5 & 6?

Would it be a good idea to power cycle the whole system and see if 5 & 6 reappear? If they don't but the system sees them, then another replace for both would be necessary.

You definitely don't want to do anything with disks 1 to 4.

Resilvered 8.9MB? Strange.

It's certainly a possibility the supply is going, but I'd expect other errors, system reboots, jail crashing etc.
It could just be bad luck, I don't know.
System was power cycled and one of the disks has been replaced with a brand new one today, I'm reslivering only one at a time. This could be a dumb idea, I don't know.

As for the 8.9MB, my theory is the pending 16 dead sectors on disk 0 perhaps?

diskdiddler · Mar 21, 2016

danb35 said:
You really need to do something about your drive temperatures--I can't say how much (if at all) they're contributing to the failures you're seeing, but they certainly aren't helping. You should also be backing up your data if you haven't already, as you have no redundancy at this point.

To repeat my previous questions: What's the output of 'camcontrol devlist'? Are ada5 or ada6 listed? If so, what's the full output of 'smartctl -a /dev/ada5'?

Agree, I solved the temp issue a long while back. I had previously tried to make the cooling quiet in the system and that didn't work out. Disks won't exceed 45 / 50c at the very hottest summer day now. A far cry from the 59c they hit last summer.
That 50c is an extreme btw, 99.9% of the time they'll be around 42 / 45c - that's as good as I can do it unfortunately.

"smartctl -a /dev/ada5" for both 4/5 (disk 5/6) didn't work, period - the disks were shut offline by the server entirely.

I am apparently, successfully slivering now.

Code:

[root@freenas] ~# zpool status
  pool: ARRAY
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Mar 22 11:34:12 2016
        90.4G scanned out of 24.0T at 218M/s, 31h56m to go
        15.0G resilvered, 0.37% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        ARRAY                                           DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            gptid/7a3a09f4-10cb-11e4-bc4d-441ea13cac6b  ONLINE       0     0     0
            gptid/7aaa3c1a-10cb-11e4-bc4d-441ea13cac6b  ONLINE       0     0     0
            gptid/2a07e64b-9fc5-11e4-8140-28924a2d5aca  ONLINE       0     0     0
            gptid/7b8d77f7-10cb-11e4-bc4d-441ea13cac6b  ONLINE       0     0     0
            gptid/c366811f-efc5-11e5-aa4f-28924a2d5aca  ONLINE       0     0     0  (resilvering)
            2815441331035575542                         UNAVAIL      0     0     0  was /dev/gptid/9f120a1c-5531-11e5-95cd-28924a2d5aca

errors: No known data errors

Also interesting FW change on them.

Code:

[root@freenas] ~# camcontrol devlist
<TOSHIBA MD04ACA500 FP1A>          at scbus0 target 0 lun 0 (ada0,pass0)
<TOSHIBA MD04ACA500 FP1A>          at scbus1 target 0 lun 0 (ada1,pass1)
<TOSHIBA MD04ACA500 FP1A>          at scbus2 target 0 lun 0 (ada2,pass2)
<TOSHIBA MD04ACA500 FP1A>          at scbus3 target 0 lun 0 (ada3,pass3)
<TOSHIBA MD04ACA500 FP2A>          at scbus4 target 0 lun 0 (ada4,pass4)
<General USB Flash Disk 1.00>      at scbus7 target 0 lun 0 (pass5,da0)
<General USB Flash Disk 1.00>      at scbus8 target 0 lun 0 (pass6,da1)

diskdiddler · Mar 21, 2016

*IF* this all works out and my data makes it back to a safe status (6 x 100% working disks) I'm going to be extremely grateful to cyberjock (IIRC?) who recommended I go with ZFS2 vs ZFS1 when I set it all up.
I purchased 6 with the intention of using 5 for storage, one spare, but someone said it'll run awfully slow and be misconfigured, 4/2 configuration is better, hence that move.
I've managed to survive without that extra disks worth of space, thank goodness.

I am still marginally concerned there's bigger issues here than just disks and it wouldn't matter if I had 2 spares or 9, eventually my data will go, due to the weird behaviour.
I do have backups of the super critical stuff. Regardless it'd be sad to lose my system, it's 'my baby' and well setup for me at the moment.

Final question guys and girls, the 16 bad clusters / sectors on disk 0, these occurred, to my knowledge, during a phase where ONLY 4 disks were in the array in working order. Surely this means I have infact corrupted some data, correct? Can anyone explain this to me?

Important Announcement for the TrueNAS Community.

Reslivering to an UNAVAIL disk? What's going on

Wizard

Wizard

Explorer

Wizard

Explorer

Wizard

Explorer

Wizard

Doesn't know what he's talking about

Wizard

Hall of Famer

Guru

Wizard

Wizard

Wizard

Guru

Hall of Famer

Wizard

Wizard

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Reslivering to an UNAVAIL disk? What's going on"

Similar threads