SOLVED ada drive ordering changed during hotswap

Status
Not open for further replies.

jnitis

Dabbler
Joined
Aug 12, 2012
Messages
12
I'm using:
Code:
FreeBSD freenas.minifrizzle 9.1-STABLE FreeBSD 9.1-STABLE #0 r+16f6355: Tue Aug 27 00:38:40 PDT 2013    root@build.ixsystems.com:/tank/home/jkh/src/freenas/os-base/amd64/tank/home/jkh/src/freenas/FreeBSD/src/sys/FREENAS.amd64  amd64

on this motherboard:
Code:
E350IA-E44

I have a hot swap Lian Li case that has 5 bays however only 4 are occupied (motherboard only has 4 SATA ports). They show up in the system as ada0-3. I have a single RAIDZ2 array configured across 4 3TB disks.
.
ISSUE: One fine day ada2 starts acting up and then completely fails. Great, I order a new drive and when I go to remove the failed drive (verified by serial #) and replace it with the new drive (all without rebooting) my drive ordering changed from:
.
ada0 (disk A)
ada1 (disk B)
ada2 (failed disk C)
ada3 (disk D)
.
to:
.
ada0 (disk A)
ada1 (disk B)
ada2 (disk D)
ada3 (brand new placement disk)
.
As a result ZFS lost track of disk D because it changed from ada3->ada2 and this knocked the RAIDZ2 array down to 2 drives but luckily I didn't lose any data. I find it very odd ZFS lost track of the drive because it certainly knows the drives as more than just the ada#, it also has the gptid (among possibly other IDs), right?
.
QUESTION: How do I either a) stop this device re-ordering from happening (is this the answer?) or b) make ZFS recognize the drives not by the ada# which may change but by their GUID, gptid, WWID, or some other ID.
.
End result: I ended up rebooting the system and the drives showed back up in their original order and I kicked off a resilver. I view this as a hack and suboptimal when hot swapping should be working just fine.
.
Thank you.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
You don't lock in the devices. ZFS should be using their gptid. If they didn't you didn't create it with FreeNAS 8.x+'s GUI and you are a bad boy/girl.

Normally, if you create your pool using the GUI it will use the gptids. What I've recommended in the past is to "remove" and "replace" each disk in the drive one at a time. As each one is resilvered the gptids will be used. There's probably a better way, but I don't know what it is... but I've got a hunch Dusan is about to show up and blow us away with some cool knowledge.
 

jnitis

Dabbler
Joined
Aug 12, 2012
Messages
12
I had a hunch you'd be the first to reply. :)

I built the box in Aug 2012 and I think we were up to FreeNAS 8.x at that time and indeed I did build the pool through the GUI. Looking at zpool status it actually does look like it sees each disk as its gptid, so why did it lose the ada3 disk when it temporarily got moved (while the system was online) to ada2? Its gptid should have remained the same. Perhaps some manual intervention was required (other than that required to replace the failed drive and kick off the resilver)?

Line 19 is the new disk, line 18 appears to be the old disk, although why does its id now look like a funky string vs. its original gptid?
Code:
[root@freenas] ~# zpool status -v
  pool: data
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Dec 12 16:26:21 2013
        815G scanned out of 2.14T at 110M/s, 3h33m to go
        197G resilvered, 37.21% done
config:
 
        NAME                                              STATE    READ WRITE CKSUM
        data                                              DEGRADED    0    0    0
          raidz2-0                                        DEGRADED    0    0    0
            gptid/c710cb49-da87-11e1-a174-8c89a51b863f    ONLINE      0    0    0
            gptid/c79687c1-da87-11e1-a174-8c89a51b863f    ONLINE      0    0    0
            replacing-2                                  DEGRADED    0    0    0
              8603898949929911385                        UNAVAIL      0    0    0  was /dev/gptid/c8222dea-da87-11e1-a174-8c89a51b863f
              gptid/1429e5ab-6307-11e3-bcd9-8c89a51b858c  ONLINE      0    0    0  (resilvering)
            gptid/c8bb0bfb-da87-11e1-a174-8c89a51b863f    ONLINE      0    0    0
 
errors: No known data errors
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
ZFS should never get confused when device names change. It doesn't care about the device names, nor GPTIDs. It uses its internal GUIDs that are stored in the vdev labels. On the other hand the FreeNAS GUI can get confused when the device names change and you didn't use GPTID, but you did you the GPTIDs. The "funky number" you see for the original disk is the ZFS internal GUID (it also lists the last known device name to the right "was /dev/gptid/...").
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
Hmm, your case got me thinking. FreeBSD should never rename a connected device (ada3->ada2). The only possible explanation I can think of is that there is some problem with disk D power connector. When you removed disk C something got bumped and disk D lost power for a brief period. That would remove the device from the system and it could get a new name assigned when it reappeared. ZFS will notice that it lost a device, but it will not automatically "reconnect" it when it shows up again. You need to run "zpool online <pool> <device>" to do that.
If this ever happens again, please post output of dmesg. That could help us figure out what is going on.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I was wondering if a RAID controller in jbod is being used. It's possible that in FreeBSD when any disk the RAID controller is doing something funky and reassigning some disks. You use a RAID controller jnitis?
 

jnitis

Dabbler
Joined
Aug 12, 2012
Messages
12
Bingo Dusan, you are the man. That sounds like what happened. I actually had the dmesg output on screen (which was also in /var/log/messages) however I didn't make an effort to save it thinking /var/log would be stored on a persistent volume but apparently it's not. By the time I realized this it had already disappeared from my Putty scroll-back buffer.

CJ: FYI I'm using a Lian Li PC-Q25B case. No RAID controller, mobo mentioned in OP set to AHCI mode.
 
Status
Not open for further replies.
Top