Unable to ONLINE replacement disk on ~100TB volume

Status
Not open for further replies.

pclausen

Patron
Joined
Apr 19, 2015
Messages
267
I have a drive where the SMART error counts started to increase, so as a preventive measure, I added a replacement disk in a spare bay and resilvered to it. I then OFFLINED it, pulled out the old drive and inserted the new one. But then I then attempted to ONLINE the new drive, I got the following error:

Code:
Jun 26 12:10:01 freenas notifier: swapoff: /dev/da28p1.eli: No such file or directory
Jun 26 12:10:01 freenas notifier: geli: No such device: /dev/da28p1.
Jun 26 12:10:01 freenas notifier: 1+0 records in
Jun 26 12:10:01 freenas notifier: 1+0 records out
Jun 26 12:10:01 freenas notifier: 1048576 bytes transferred in 0.471564 secs (2223613 bytes/sec)
Jun 26 12:10:01 freenas notifier: dd: /dev/da28: short write on character device
Jun 26 12:10:01 freenas notifier: dd: /dev/da28: end of device
Jun 26 12:10:01 freenas notifier: 5+0 records in
Jun 26 12:10:01 freenas notifier: 4+1 records out
Jun 26 12:10:01 freenas notifier: 4284416 bytes transferred in 0.012649 secs (338714200 bytes/sec)
Jun 26 12:10:06 freenas notifier: swapoff: /dev/da28p1.eli: No such file or directory
Jun 26 12:10:06 freenas notifier: geli: No such device: /dev/da28p1.
Jun 26 12:10:06 freenas manage.py: [middleware.exceptions:38] [MiddlewareError: Disk replacement failed: "invalid vdev specification, use '-f' to override the following errors:, /dev/gptid/cdba2e5c-1c1d-11e5-8714-0cc47a3311b4 is part of active pool 'v1', "]


Any ideas?

Here's the action I'm trying to perform that throws the above error:

onlineerror.PNG


This is the third drive I have replaced, following the exact same procedure. The only difference I can think off is that I'm on a newer stable release now (FreeNAS-9.3-STABLE-201506232120) than I was when replacing the previous 2 disks.

Should I reboot the server and then attempt another ONLINE action?

Resilver itself completed with no errors. In fact, it was flying this time:

Code:
 pool: v1
state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
  continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Jun 26 06:22:25 2015
  17.0T scanned out of 58.4T at 3.82G/s, 3h5m to go
  208G resilvered, 29.10% done
 

pclausen

Patron
Joined
Apr 19, 2015
Messages
267
I attempted to do the ONLINE from the command line as follows:

[root@freenas] /mnt/v1/jails/transmission_1/media/incomplete# zpool online v1 10187186815997073956
warning: device '10187186815997073956' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present

And a zpool status now shows:

Code:
raidz2-2  DEGRADED  0  0  0
  gptid/feb677d8-0af7-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/00250d90-0af8-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/019ce015-0af8-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/0369152c-0af8-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/04b6d537-0af8-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/055877c8-0af8-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/05f94c13-0af8-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  10187186815997073956  UNAVAIL  0  0  0  was /dev/gptid/3afea948-1bed-11e5-8714-0cc47a3311b4
  gptid/077b67b3-0af8-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/08454fcc-0af8-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0


Did the replacement disk I just resilvered just die? I tested it just a few weeks ago, badblocks the whole thing, and it passed with flying colors.

I wonder if I should put the old drive back in for now and do another resilver to a fresh spare?
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
I wonder if I should put the old drive back in for now and do another resilver to a fresh spare?
I suggest carefully following the directions for replacing a failed drive, as if you're starting from scratch, but at step 2:
  1. Remove the disk you added.
  2. Wipe said disk, e.g. with a SATA secure erase command, or DBAN.
  3. Pretend it's a new disk and proceed to step 3.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
I agree, you need to wipe the "new" disk and start over.

My additional advice is when you replace a failing hard drive, do not replace it the way you did. You installed a new drive, then you resilvered it, then you offlined it. Why would you offline it just to simply physically move the drive? I would have first removed the failed drive and placed the new drive there in the first place and then you wouldn't have this issue. Also if I needed to physically relocate a drive in the system, I'd shutdown my system first, then relocate it. This would ensure I didn't induce issues into the system.
 

pclausen

Patron
Joined
Apr 19, 2015
Messages
267
Thanks guys. Just be clear, the original drive had not failed, I was just getting SMART errors and decided to be proactive by going ahead and replacing it, following the Replacing Drives to Grow a ZFS Pool procedure.

So your suggestion is to instead shutdown the whole system, pull the drive with SMART errors (still in good working condition), insert the new drive, and then resilver with the vdev in a degraded state?

I thought is was safer to do a replace so that the vdev retained the dual parity protection during the entire procedure. I did have a hickup during my 2nd hot swap replacement, but the system recovered immediately. I think I have learned my lesson and will always shut the system down completely now before removing a disk with SMART errors and replacing it with a new drive.

Anyway, I pulled another cold spare off the shelf and replaced the UNAVAIL drive with it, powered back up, and noted there was no "Offline" button, only a "Replace" one, so I proceeded to click that and kicked off the resilvering processing. Seem to be moving along nicely.

Code:
 pool: v1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
  continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Jun 26 15:58:14 2015
  1.72T scanned out of 58.4T at 2.55G/s, 6h20m to go
  20.7G resilvered, 2.95% done
config:

  NAME  STATE  READ WRITE CKSUM
  v1  DEGRADED  0  0  0
  raidz2-0  ONLINE  0  0  0
  gptid/46e89e6f-0af7-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/4867c943-0af7-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/49d42bec-0af7-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/4b479933-0af7-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/4cade6d6-0af7-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/4e19bc54-0af7-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/4f8ad88f-0af7-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/50f364ea-0af7-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/525bf8ae-0af7-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/53c8e5d1-0af7-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  raidz2-1  ONLINE  0  0  0
  gptid/9c79397f-0af7-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/9df01de9-0af7-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/9f4513d9-0af7-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/a0b933da-0af7-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/a2170ba8-0af7-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/a38a1d1f-0af7-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/a4d74104-0af7-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/a63d77df-0af7-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/a7b8fd94-0af7-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/a924bab6-0af7-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  raidz2-2  DEGRADED  0  0  0
  gptid/feb677d8-0af7-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/00250d90-0af8-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/019ce015-0af8-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/0369152c-0af8-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/04b6d537-0af8-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/055877c8-0af8-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/05f94c13-0af8-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  replacing-7  UNAVAIL  0  0  0
  10187186815997073956  UNAVAIL  0  0  0  was /dev/gptid/3afea948-1bed-11e5-8714-0cc47a3311b4
  gptid/ab2d8839-1c3d-11e5-aa3d-0cc47a3311b4  ONLINE  0  0  0  (resilvering)
  gptid/077b67b3-0af8-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  gptid/08454fcc-0af8-11e5-b7f4-0cc47a3311b4  ONLINE  0  0  0
  raidz2-3  ONLINE  0  0  0
  gptid/ce748bee-13dd-11e5-a72f-0cc47a3311b4  ONLINE  0  0  0
  gptid/8cf979ca-0f5e-11e5-b736-0cc47a3311b4  ONLINE  0  0  0
  gptid/8d7c3c2d-0f5e-11e5-b736-0cc47a3311b4  ONLINE  0  0  0
  gptid/8e096da8-0f5e-11e5-b736-0cc47a3311b4  ONLINE  0  0  0
  gptid/8e8350e0-0f5e-11e5-b736-0cc47a3311b4  ONLINE  0  0  0
  gptid/8f0947fd-0f5e-11e5-b736-0cc47a3311b4  ONLINE  0  0  0
  gptid/dc7fa34f-13a6-11e5-a72f-0cc47a3311b4  ONLINE  0  0  0
  gptid/9296a811-0f5e-11e5-b736-0cc47a3311b4  ONLINE  0  0  0
  gptid/944d1606-0f5e-11e5-b736-0cc47a3311b4  ONLINE  0  0  0
  gptid/958b7e4c-0f5e-11e5-b736-0cc47a3311b4  ONLINE  0  0  0
  raidz2-4  ONLINE  0  0  0
  gptid/6f420640-107f-11e5-a691-0cc47a3311b4  ONLINE  0  0  0
  gptid/70132a6b-107f-11e5-a691-0cc47a3311b4  ONLINE  0  0  0
  gptid/70e74cfa-107f-11e5-a691-0cc47a3311b4  ONLINE  0  0  0
  gptid/71b38ef5-107f-11e5-a691-0cc47a3311b4  ONLINE  0  0  0
  gptid/728e4dc6-107f-11e5-a691-0cc47a3311b4  ONLINE  0  0  0
  gptid/736eadd4-107f-11e5-a691-0cc47a3311b4  ONLINE  0  0  0
  gptid/7451c560-107f-11e5-a691-0cc47a3311b4  ONLINE  0  0  0
  gptid/752e29c9-107f-11e5-a691-0cc47a3311b4  ONLINE  0  0  0
  gptid/760c9916-107f-11e5-a691-0cc47a3311b4  ONLINE  0  0  0
  gptid/76f39a48-107f-11e5-a691-0cc47a3311b4  ONLINE  0  0  0

errors: No known data errors


Hopefully I'll be able to stop biting my nails in another 6 hours or so.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
Thanks guys. Just be clear, the original drive had not failed, I was just getting SMART errors and decided to be proactive by going ahead and replacing it, following the Replacing Drives to Grow a ZFS Pool procedure.
In that case, perhaps the only mistake you made was offlining the new drive. There's no need to do that just to physically move a drive. Offlining a drive means you're telling ZFS that you're removing it from the array.
I thought is was safer to do a replace so that the vdev retained the dual parity protection during the entire procedure.
I believe it is, if you have a spare drive port.
 

pclausen

Patron
Joined
Apr 19, 2015
Messages
267
Ok thanks. Yes, the only reason for offlining the drive was to move it to the bay previously occupied by the drive I was replacing. As I expand the volume by adding additional vdevs, I'd like the vdevs to be clustered together physically within each enclosure, but that's just me being anal I suppose. If I do decide to maintain a certain physical order in the future, I'll just shutdown the system first, or at least the chassis holding the FreeNAS mobo.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
the only reason for offlining the drive was to move it to the bay previously occupied by the drive I was replacing
Yup, completely unnecessary, and the cause of your problem. ZFS places a unique ID on each drive, so you can move them around all you want and everything 'just works'. But I agree with @joeschmuck , it's best to shut down the box first. One reason is that there's always a chance that the swap partition is in use.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
I thought is was safer to do a replace so that the vdev retained the dual parity protection during the entire procedure.
I agree, it is safer to do this but step 5 of the procedure wasn't followed correctly and you offlined the drive you just added which as explained above was incorrect. If its not reasonable to shut down a system like yours because of the need to always have data available you can leave your drives in different locations and just keep record of where they are physically located and move them back when you can schedule a shutdown period. A label maker works wonders to mark the drives last 4 serial number digits on the front of the drive cage, or I like the FreeNAS GUI where you can add a comment to the drive such as "Bay 3".
 

pclausen

Patron
Joined
Apr 19, 2015
Messages
267
Thanks guys. I'll definitely modify my process going forward. I do maintain a spreadsheet where I track changes. GUIDs, etc:

freenaspooldisk6-26-15.PNG


Yes, I plan to update the comment for the drive to include chassis and bay #s from the spreadsheet.

Here's my "server room":

servers6-26-15.JPG


So there are a lot of bays to track!

7 more minutes and the 120TB volume should no longer be in a degraded state!
 
Joined
Oct 2, 2014
Messages
925
mhmmmmmm so much storage
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
Holy crap, now that is a server room! Yea, my wife would not be happy if I took a room and did something like that to it. Also I have no idea what I'd do with all that storage, maybe get a few high speed internet lines and rent out server space?
 
Joined
Oct 2, 2014
Messages
925
Holy crap, now that is a server room! Yea, my wife would not be happy if I took a room and did something like that to it. Also I have no idea what I'd do with all that storage, maybe get a few high speed internet lines and rent out server space?
Download all the internets, and she doesnt have to know....if you ever renovate just make a secret room, theirs plenty of instructable idea's out there :P
 
Status
Not open for further replies.
Top