Hello,
I encountered a very strange issue with replacing a disk on a live FreeNAS machine that I would appreciate some understanding around:
FreeNAS-9.2.1.7-RELEASE-x64 (fdbe9a0)
X9DRi-LN4+
WD 3TB RED disks
LSI SAS2X28
We found our pool degraded with the following:
dmesg:
Status:
da27 failed.
We could not smartctl the disk or query it in any way to verify details about it.
"OFFLINE" under Volume Status did not exist and only "REPLACE" and as per the FreeNAS documentation:
"Once the disk has been replaced and is showing as OFFLINE, click the disk again and then click its “Replace” button. Select the replacement disk from the drop-down menu and click the “Replace Disk” button. Once you click the “Replace Disk” button, the ZFS pool will start to resilver and the status of the resilver will be displayed."
Considering this, I pulled the disk and replaced it. The disk LED blinked and that was it.
There was no recognizing of the disk. Nothing in dmesg, "View Disks" or Volume Status >> Replace
I tried replacing again, nothing.
I used a different disk, nothing.
Reading further, I found to OFFLINE the disk, which was then done:
I attempted pulling the disk again, still nothing was recognized.
According to our pool (zpool status) and the person who original set it up, luckily, we had a spare available:
spares
gptid/c2e6a3b0-592a-11e4-b007-002590e077bc AVAIL
This ended up being a spare SSD that was once used as cache but then removed. This was labelled as da25.
I then pulled the disk da25, FreeNAS confirmed it's aware of it being pulled:
I used this slot to put in the new WD disk, it was instantly reconized:
Going back into Volume Status and hitting Replace showed the new disk WD 3TB disk. A resilver has now started:
This however is still in a zpool status:
spares
gptid/c2e6a3b0-592a-11e4-b007-002590e077bc AVAIL
Questions remain:
- Why did pulling the disk from it's original slot and replacing it not be automatically found/reconized ?
- What if another disk fails and we need to replace it, we have no other slots available so it would not be possible to use a 'spare' slot.
Thank you in advanced.
I encountered a very strange issue with replacing a disk on a live FreeNAS machine that I would appreciate some understanding around:
FreeNAS-9.2.1.7-RELEASE-x64 (fdbe9a0)
X9DRi-LN4+
WD 3TB RED disks
LSI SAS2X28
We found our pool degraded with the following:
dmesg:
(da27:mps0:0:36:0): READ(10). CDB: 28 00 37 19 a4 b0 00 01 00 00 length 131072 SMID 713 terminated ioc 804b scsi 0 state 0 xfer 0
(da27:mps0:0:36:0): READ(10). CDB: 28 00 37 19 a5 b0 00 01 00 00 length 131072 SMID 513 terminated ioc 804b scsi 0 state 0 xfer 0
(da27:mps0:0:36:0): READ(10). CDB: 28 00 a4 69 fe 18 00 00 68 00 length 53248 SMID 230 terminated ioc 804b scsi 0 state 0 xfer 0
(da27:mps0:0:36:0): WRITE(10). CDB: 2a 00 68 5b 39 d0 00 00 08 00
(da27:mps0:0:36:0): CAM status: SCSI Status Error
(da27:mps0:0:36:0): SCSI status: Check Condition
(da27:mps0:0:36:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da27:mps0:0:36:0): Info: 0x685b39d0
(da27:mps0:0:36:0): Error 22, Unretryable error
(da27:mps0:0:36:0): WRITE(10). CDB: 2a 00 68 5b 89 60 00 01 00 00 length 131072 SMID 438 terminated ioc 804b scsi 0 state c xfer 0
(da27:mps0:0:36:0): READ(10). CDB: 28 00 37 e0 38 f0 00 00 18 00 length 12288 SMID 606 terminated ioc 804b scsi 0 state c xfer 0
(da27:mps0:0:36:0): READ(10). CDB: 28 00 37 61 32 88 00 00 10 00 length 8192 SMID 396 terminated ioc 804b scsi 0 state c xfer 0
(da27:mps0:0:36:0): READ(10). CDB: 28 00 37 e0 3a 80 00 00 10 00 length 8192 SMID 383 terminated ioc 804b scsi 0 state c xfer 0
mps0: IOCStatus = 0x4b while resetting device 0x27
da27 at mps0 bus 0 scbus0 target 36 lun 0
da27: <ATA WDC WD30EFRX-68E 0A82> s/n WD-WCC4NNPRU8XK detached
(da27:mps0:0:36:0): READ(10). CDB: 28 00 37 19 a5 b0 00 01 00 00 length 131072 SMID 513 terminated ioc 804b scsi 0 state 0 xfer 0
(da27:mps0:0:36:0): READ(10). CDB: 28 00 a4 69 fe 18 00 00 68 00 length 53248 SMID 230 terminated ioc 804b scsi 0 state 0 xfer 0
(da27:mps0:0:36:0): WRITE(10). CDB: 2a 00 68 5b 39 d0 00 00 08 00
(da27:mps0:0:36:0): CAM status: SCSI Status Error
(da27:mps0:0:36:0): SCSI status: Check Condition
(da27:mps0:0:36:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da27:mps0:0:36:0): Info: 0x685b39d0
(da27:mps0:0:36:0): Error 22, Unretryable error
(da27:mps0:0:36:0): WRITE(10). CDB: 2a 00 68 5b 89 60 00 01 00 00 length 131072 SMID 438 terminated ioc 804b scsi 0 state c xfer 0
(da27:mps0:0:36:0): READ(10). CDB: 28 00 37 e0 38 f0 00 00 18 00 length 12288 SMID 606 terminated ioc 804b scsi 0 state c xfer 0
(da27:mps0:0:36:0): READ(10). CDB: 28 00 37 61 32 88 00 00 10 00 length 8192 SMID 396 terminated ioc 804b scsi 0 state c xfer 0
(da27:mps0:0:36:0): READ(10). CDB: 28 00 37 e0 3a 80 00 00 10 00 length 8192 SMID 383 terminated ioc 804b scsi 0 state c xfer 0
mps0: IOCStatus = 0x4b while resetting device 0x27
da27 at mps0 bus 0 scbus0 target 36 lun 0
da27: <ATA WDC WD30EFRX-68E 0A82> s/n WD-WCC4NNPRU8XK detached
Status:
mirror-12 DEGRADED 0 0 0
gptid/73b167d6-5928-11e4-b007-002590e077bc ONLINE 0 0 0
4602108141042555384 REMOVED 0 0 0 was /dev/gptid/7485d723-5928-11e4-b007-002590e077bc
gptid/73b167d6-5928-11e4-b007-002590e077bc ONLINE 0 0 0
4602108141042555384 REMOVED 0 0 0 was /dev/gptid/7485d723-5928-11e4-b007-002590e077bc
da27 failed.
We could not smartctl the disk or query it in any way to verify details about it.
"OFFLINE" under Volume Status did not exist and only "REPLACE" and as per the FreeNAS documentation:
"Once the disk has been replaced and is showing as OFFLINE, click the disk again and then click its “Replace” button. Select the replacement disk from the drop-down menu and click the “Replace Disk” button. Once you click the “Replace Disk” button, the ZFS pool will start to resilver and the status of the resilver will be displayed."
Considering this, I pulled the disk and replaced it. The disk LED blinked and that was it.
There was no recognizing of the disk. Nothing in dmesg, "View Disks" or Volume Status >> Replace
I tried replacing again, nothing.
I used a different disk, nothing.
Reading further, I found to OFFLINE the disk, which was then done:
# zpool offline tank /dev/gptid/7485d723-5928-11e4-b007-002590e077bc
mirror-12 DEGRADED 0 0 0
gptid/73b167d6-5928-11e4-b007-002590e077bc ONLINE 0 0 0
4602108141042555384 OFFLINE 0 0 0 was /dev/gptid/7485d723-5928-11e4-b007-002590e077bc
mirror-12 DEGRADED 0 0 0
gptid/73b167d6-5928-11e4-b007-002590e077bc ONLINE 0 0 0
4602108141042555384 OFFLINE 0 0 0 was /dev/gptid/7485d723-5928-11e4-b007-002590e077bc
I attempted pulling the disk again, still nothing was recognized.
According to our pool (zpool status) and the person who original set it up, luckily, we had a spare available:
spares
gptid/c2e6a3b0-592a-11e4-b007-002590e077bc AVAIL
This ended up being a spare SSD that was once used as cache but then removed. This was labelled as da25.
I then pulled the disk da25, FreeNAS confirmed it's aware of it being pulled:
da25 at mps0 bus 0 scbus0 target 34 lun 0
da25: <ATA Samsung SSD 850 1B6Q> s/n S1SUNWAF705527B detached
(da25:mps0:0:34:0): Periph destroyed
da25: <ATA Samsung SSD 850 1B6Q> s/n S1SUNWAF705527B detached
(da25:mps0:0:34:0): Periph destroyed
I used this slot to put in the new WD disk, it was instantly reconized:
da25 at mps0 bus 0 scbus0 target 47 lun 0
da25: <ATA WDC WD30EFRX-68A 0A80> Fixed Direct Access SCSI-6 device
da25: Serial Number WD-WMC1T1342056
da25: 600.000MB/s transfers
da25: Command Queueing enabled
da25: 2861588MB (5860533168 512 byte sectors: 255H 63S/T 364801C)
da25: quirks=0x8<4K>
da25: <ATA WDC WD30EFRX-68A 0A80> Fixed Direct Access SCSI-6 device
da25: Serial Number WD-WMC1T1342056
da25: 600.000MB/s transfers
da25: Command Queueing enabled
da25: 2861588MB (5860533168 512 byte sectors: 255H 63S/T 364801C)
da25: quirks=0x8<4K>
Going back into Volume Status and hitting Replace showed the new disk WD 3TB disk. A resilver has now started:
Mar 6 13:12:16 san3 notifier: 1+0 records in
Mar 6 13:12:16 san3 notifier: 1+0 records out
Mar 6 13:12:16 san3 notifier: 1048576 bytes transferred in 0.004511 secs (232442604 bytes/sec)
Mar 6 13:12:16 san3 notifier: dd: /dev/da25: short write on character device
Mar 6 13:12:16 san3 notifier: dd: /dev/da25: end of device
Mar 6 13:12:16 san3 notifier: 5+0 records in
Mar 6 13:12:16 san3 notifier: 4+1 records out
Mar 6 13:12:16 san3 notifier: 4677632 bytes transferred in 0.077173 secs (60612291 bytes/sec)
Mar 6 13:12:40 san3 kernel: GEOM_ELI: Device da25p1.eli created.
Mar 6 13:12:40 san3 kernel: GEOM_ELI: Encryption: AES-XTS 128
Mar 6 13:12:40 san3 kernel: GEOM_ELI: Crypto: hardware
Mar 6 13:12:16 san3 notifier: 1+0 records out
Mar 6 13:12:16 san3 notifier: 1048576 bytes transferred in 0.004511 secs (232442604 bytes/sec)
Mar 6 13:12:16 san3 notifier: dd: /dev/da25: short write on character device
Mar 6 13:12:16 san3 notifier: dd: /dev/da25: end of device
Mar 6 13:12:16 san3 notifier: 5+0 records in
Mar 6 13:12:16 san3 notifier: 4+1 records out
Mar 6 13:12:16 san3 notifier: 4677632 bytes transferred in 0.077173 secs (60612291 bytes/sec)
Mar 6 13:12:40 san3 kernel: GEOM_ELI: Device da25p1.eli created.
Mar 6 13:12:40 san3 kernel: GEOM_ELI: Encryption: AES-XTS 128
Mar 6 13:12:40 san3 kernel: GEOM_ELI: Crypto: hardware
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Mar 6 13:12:33 2016
2.54T scanned out of 7.01T at 422M/s, 3h5m to go
151G resilvered, 36.24% done
mirror-12 DEGRADED 0 0 0
gptid/73b167d6-5928-11e4-b007-002590e077bc ONLINE 0 0 0
replacing-1 OFFLINE 0 0 0
4602108141042555384 OFFLINE 0 0 0 was /dev/gptid/7485d723-5928-11e4-b007-002590e077bc
gptid/4981e7a1-e38c-11e5-b007-002590e077bc ONLINE 0 0 108 (resilvering)
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Mar 6 13:12:33 2016
2.54T scanned out of 7.01T at 422M/s, 3h5m to go
151G resilvered, 36.24% done
mirror-12 DEGRADED 0 0 0
gptid/73b167d6-5928-11e4-b007-002590e077bc ONLINE 0 0 0
replacing-1 OFFLINE 0 0 0
4602108141042555384 OFFLINE 0 0 0 was /dev/gptid/7485d723-5928-11e4-b007-002590e077bc
gptid/4981e7a1-e38c-11e5-b007-002590e077bc ONLINE 0 0 108 (resilvering)
This however is still in a zpool status:
spares
gptid/c2e6a3b0-592a-11e4-b007-002590e077bc AVAIL
Questions remain:
- Why did pulling the disk from it's original slot and replacing it not be automatically found/reconized ?
- What if another disk fails and we need to replace it, we have no other slots available so it would not be possible to use a 'spare' slot.
Thank you in advanced.