Replacing Failed Disk Issue

dddza · Mar 6, 2016

Hello,

I encountered a very strange issue with replacing a disk on a live FreeNAS machine that I would appreciate some understanding around:

FreeNAS-9.2.1.7-RELEASE-x64 (fdbe9a0)
X9DRi-LN4+
WD 3TB RED disks
LSI SAS2X28

We found our pool degraded with the following:

dmesg:

(da27:mps0:0:36:0): READ(10). CDB: 28 00 37 19 a4 b0 00 01 00 00 length 131072 SMID 713 terminated ioc 804b scsi 0 state 0 xfer 0
(da27:mps0:0:36:0): READ(10). CDB: 28 00 37 19 a5 b0 00 01 00 00 length 131072 SMID 513 terminated ioc 804b scsi 0 state 0 xfer 0
(da27:mps0:0:36:0): READ(10). CDB: 28 00 a4 69 fe 18 00 00 68 00 length 53248 SMID 230 terminated ioc 804b scsi 0 state 0 xfer 0
(da27:mps0:0:36:0): WRITE(10). CDB: 2a 00 68 5b 39 d0 00 00 08 00
(da27:mps0:0:36:0): CAM status: SCSI Status Error
(da27:mps0:0:36:0): SCSI status: Check Condition
(da27:mps0:0:36:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da27:mps0:0:36:0): Info: 0x685b39d0
(da27:mps0:0:36:0): Error 22, Unretryable error
(da27:mps0:0:36:0): WRITE(10). CDB: 2a 00 68 5b 89 60 00 01 00 00 length 131072 SMID 438 terminated ioc 804b scsi 0 state c xfer 0
(da27:mps0:0:36:0): READ(10). CDB: 28 00 37 e0 38 f0 00 00 18 00 length 12288 SMID 606 terminated ioc 804b scsi 0 state c xfer 0
(da27:mps0:0:36:0): READ(10). CDB: 28 00 37 61 32 88 00 00 10 00 length 8192 SMID 396 terminated ioc 804b scsi 0 state c xfer 0
(da27:mps0:0:36:0): READ(10). CDB: 28 00 37 e0 3a 80 00 00 10 00 length 8192 SMID 383 terminated ioc 804b scsi 0 state c xfer 0
mps0: IOCStatus = 0x4b while resetting device 0x27
da27 at mps0 bus 0 scbus0 target 36 lun 0
da27: <ATA WDC WD30EFRX-68E 0A82> s/n WD-WCC4NNPRU8XK detached

Status:

mirror-12 DEGRADED 0 0 0
gptid/73b167d6-5928-11e4-b007-002590e077bc ONLINE 0 0 0
4602108141042555384 REMOVED 0 0 0 was /dev/gptid/7485d723-5928-11e4-b007-002590e077bc

da27 failed.

We could not smartctl the disk or query it in any way to verify details about it.

"OFFLINE" under Volume Status did not exist and only "REPLACE" and as per the FreeNAS documentation:

"Once the disk has been replaced and is showing as OFFLINE, click the disk again and then click its “Replace” button. Select the replacement disk from the drop-down menu and click the “Replace Disk” button. Once you click the “Replace Disk” button, the ZFS pool will start to resilver and the status of the resilver will be displayed."

Considering this, I pulled the disk and replaced it. The disk LED blinked and that was it.

There was no recognizing of the disk. Nothing in dmesg, "View Disks" or Volume Status >> Replace

I tried replacing again, nothing.

I used a different disk, nothing.

Reading further, I found to OFFLINE the disk, which was then done:

# zpool offline tank /dev/gptid/7485d723-5928-11e4-b007-002590e077bc

mirror-12 DEGRADED 0 0 0
gptid/73b167d6-5928-11e4-b007-002590e077bc ONLINE 0 0 0
4602108141042555384 OFFLINE 0 0 0 was /dev/gptid/7485d723-5928-11e4-b007-002590e077bc

I attempted pulling the disk again, still nothing was recognized.

According to our pool (zpool status) and the person who original set it up, luckily, we had a spare available:

spares
gptid/c2e6a3b0-592a-11e4-b007-002590e077bc AVAIL

This ended up being a spare SSD that was once used as cache but then removed. This was labelled as da25.

I then pulled the disk da25, FreeNAS confirmed it's aware of it being pulled:

da25 at mps0 bus 0 scbus0 target 34 lun 0
da25: <ATA Samsung SSD 850 1B6Q> s/n S1SUNWAF705527B detached
(da25:mps0:0:34:0): Periph destroyed

I used this slot to put in the new WD disk, it was instantly reconized:

da25 at mps0 bus 0 scbus0 target 47 lun 0
da25: <ATA WDC WD30EFRX-68A 0A80> Fixed Direct Access SCSI-6 device
da25: Serial Number WD-WMC1T1342056
da25: 600.000MB/s transfers
da25: Command Queueing enabled
da25: 2861588MB (5860533168 512 byte sectors: 255H 63S/T 364801C)
da25: quirks=0x8<4K>

Going back into Volume Status and hitting Replace showed the new disk WD 3TB disk. A resilver has now started:

Mar 6 13:12:16 san3 notifier: 1+0 records in
Mar 6 13:12:16 san3 notifier: 1+0 records out
Mar 6 13:12:16 san3 notifier: 1048576 bytes transferred in 0.004511 secs (232442604 bytes/sec)
Mar 6 13:12:16 san3 notifier: dd: /dev/da25: short write on character device
Mar 6 13:12:16 san3 notifier: dd: /dev/da25: end of device
Mar 6 13:12:16 san3 notifier: 5+0 records in
Mar 6 13:12:16 san3 notifier: 4+1 records out
Mar 6 13:12:16 san3 notifier: 4677632 bytes transferred in 0.077173 secs (60612291 bytes/sec)
Mar 6 13:12:40 san3 kernel: GEOM_ELI: Device da25p1.eli created.
Mar 6 13:12:40 san3 kernel: GEOM_ELI: Encryption: AES-XTS 128
Mar 6 13:12:40 san3 kernel: GEOM_ELI: Crypto: hardware

state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Mar 6 13:12:33 2016
2.54T scanned out of 7.01T at 422M/s, 3h5m to go
151G resilvered, 36.24% done

mirror-12 DEGRADED 0 0 0
gptid/73b167d6-5928-11e4-b007-002590e077bc ONLINE 0 0 0
replacing-1 OFFLINE 0 0 0
4602108141042555384 OFFLINE 0 0 0 was /dev/gptid/7485d723-5928-11e4-b007-002590e077bc
gptid/4981e7a1-e38c-11e5-b007-002590e077bc ONLINE 0 0 108 (resilvering)

This however is still in a zpool status:

spares
gptid/c2e6a3b0-592a-11e4-b007-002590e077bc AVAIL

Questions remain:

- Why did pulling the disk from it's original slot and replacing it not be automatically found/reconized ?
- What if another disk fails and we need to replace it, we have no other slots available so it would not be possible to use a 'spare' slot.

Thank you in advanced.

Robert Trevellyan · Mar 6, 2016

dddza said:
Why did pulling the disk from it's original slot and replacing it not be automatically found/reconized ?

Did you test the removed disk in a different machine? Maybe that port/slot is bad, and the original disk is actually fine?

Did you reboot at any stage of this process? If not, maybe some of your ports don't support hot-swap.

Ericloewe · Mar 6, 2016

Woah, hold it.

If I'm understanding your wall of text correctly, there are a few questions that need to be addressed.

Never use the CLI to mess with your pool. If the GUI won't let you offline the drive, just skip that step.
You seem to be hot swapping the drives. Did you validate this? If not, do not attempt to hot swap.
Prior to physically removing the spare, you should've removed them from the pool.

DrKK · Mar 6, 2016

I read his post this morning like 5 times, and didn't understand it all. Then, we looked at it in IRC, and didn't understand it at all. If you are understanding it, Eric, then kudos to you sir.

Ericloewe · Mar 6, 2016

DrKK said:
I read his post this morning like 5 times, and didn't understand it all. Then, we looked at it in IRC, and didn't understand it at all. If you are understanding it, Eric, then kudos to you sir.

I doubt that I am, to be honest. Just picked up a few red flags, but I did not get a coherent timeline out of it.

dddza · Mar 7, 2016

Thank you for the feedback. Apologies if something is not clear - I tried to provide as much detail related to my findings.

In short:
The pool was degraded. This was confirmed by doing a 'zpool status'.
The degraded disk, da27 was identified and pulled from the machine. A new disk was inserted but it was not identified by the system. There was no report or indication that a new disk had been added.
Then I found that a SSD was marked as a 'spare' so I pulled that device (da25), put the replacement WD disk in that slot and it was instantly identified.
I then used the CLI to "replace" the failed da25 disk and the resilver started and completed.

I will be testing the drive a bit later to confirm if the disk is failed or if we are sitting with a backplane/slot problem.

I really hope this sums up the problem.

dddza · Mar 7, 2016

Robert Trevellyan said:
Did you test the removed disk in a different machine? Maybe that port/slot is bad, and the original disk is actually fine?

Did you reboot at any stage of this process? If not, maybe some of your ports don't support hot-swap.

The system was not rebooted. Hot-swap is definitely supported as we set it up this way and also confirmed it by pulling each disk before creating the pool to 100% confirm drive number/slots etc. Each disk was identified and instantly recognized by the system when testing.

Klontje · Mar 7, 2016

Reading your story I think maybe your cable/hotswap bay/backplane might be faulty as the replacement disk wasn't identified when put in the same slot as your defective drive but did work in another slot. Long story short, your 'defective' disk might be good but one of your other components failed you.

Robert Trevellyan · Mar 7, 2016

dddza said:
The system was not rebooted.

What about the other question?

Important Announcement for the TrueNAS Community.

Replacing Failed Disk Issue

dddza

Dabbler

Robert Trevellyan

Pony Wrangler

Ericloewe

Server Wrangler

DrKK

FreeNAS Generalissimo

Ericloewe

Server Wrangler

dddza

Dabbler

dddza

Dabbler

Klontje

Dabbler

Robert Trevellyan

Pony Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

Replacing Failed Disk Issue

Dabbler

Pony Wrangler

Server Wrangler

FreeNAS Generalissimo

Server Wrangler

Dabbler

Dabbler

Dabbler

Pony Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Replacing Failed Disk Issue"

Similar threads