Re-adding cache devices seems to hang zpool.

Status
Not open for further replies.

Enlightend

Dabbler
Joined
Oct 30, 2013
Messages
15
Hey guys,

I updated my test server to 9.2-BETA yesterday and all ran well, then I went looking around and noticed that there were firmware updates for nearly all my drives, SSD's and controllers, so thought to run those while I was at it.

I removed my cache devices from the pool, updated them, secure erased them and tried to re-add them to the pool, which promtly hangs the zpool.

Once this happens I can not use "zfs" or "zpool" any longer, unless I reboot the entire system.
Nor can I cd to the mounted volumes, without my console hanging (at which point I can still simply SSH to the system and do stuff, as long as I do not try to touch anything ZFS related)

At this point my mounts on NFS, iscsi and CIFS also become unreachable.
There is no CPU usage in top to indicate ZPOOL is doing anything to prep the drives.


I was thinking the firmware update on the SSD's got fucked up, but then I did a fresh reboot, added those drives to a new zpool as member disks instead and this operation works without issue and blazing fast.

I removed the drives from the system, did another secure erase under Windows, created partitions and ran tests to assure that I can read/write/delete data from them.



After all the tests I have done, I can only conclude that the SSD's are perfectly fine and the problem is with Freenas.

I attempted both to add the SSD's one by one as cache devices and all together.


Any logs, outputs, files, info I can post to help diagnose this problem?
Anyone else maybe have this problem with the BETA or RC versions?

EDIT: Additional information:

When I reboot the system comes up and my zpool is accessible as normal. It doesn't have the cache drives added.

I also just tried to add other disks to the zpool as cache and the same happens.
Disks I tried went from 320GB HDD to 128GB SSD's and even some USB sticks.
So the problem is clearly not with the disks, but with adding cache devices to an existing zpool.

Also, when I freshly did the update, FREENAS was also nagging me that my L2ARC drives weren't using the correct sector size, 512 instead of 4k.
 
D

dlavigne

Guest
Sounds like a bug. Please create a report at bugs.freenas.org and post the issue number here.
 

Enlightend

Dabbler
Joined
Oct 30, 2013
Messages
15
Will do when I get the time.

By now i removed and added the pool, tried to add drives as zil, etc, still no go.

EDIT: Ya know, I think I once had this on Solaris too, and the only way I could fix it was destroy the pool and rebuild it.

I'm in training at the moment (all of C# in a month :p) but I should have a 3 day weekend this weekend, so I'll further debug this and file a bug report.
 

Xaro

Cadet
Joined
Dec 9, 2013
Messages
7
Exactly same issue here with FreeNAS-9.2.0-RC-x64 (93440a9)

Creating raidz2 pool + cache. OK
Removing cache device. OK
Adding same cache device. ZFS hangs

Also reproducible via CLI

Did you already created a ticket ?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
If nobody has a ticket yet please don't make one.

Ii'll try to test this with my test system :)

BRB
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I just removed, readded, then removed again an l2arc without problems.

Any footer errors you can provide as mine looks fine...


Dec 10 14:50:37 freenas notifier: ada1 destroyed
Dec 10 14:50:37 freenas notifier: ada1 created
Dec 10 14:50:37 freenas notifier: ada1 destroyed
Dec 10 14:51:15 freenas kernel: GEOM_ELI: Device ada0p1.eli destroyed.
Dec 10 14:51:15 freenas kernel: GEOM_ELI: Detached ada0p1.eli on last close.
Dec 10 14:51:15 freenas notifier: swapoff: removing /dev/ada0p1.eli as swap device
Dec 10 14:51:15 freenas notifier: 1+0 records in
Dec 10 14:51:15 freenas notifier: 1+0 records out
Dec 10 14:51:15 freenas notifier: 1048576 bytes transferred in 0.007186 secs (145915746 bytes/sec)
Dec 10 14:51:15 freenas notifier: dd: /dev/ada1: short write on character device
Dec 10 14:51:15 freenas notifier: dd: /dev/ada1: end of device
Dec 10 14:51:15 freenas notifier: 5+0 records in
Dec 10 14:51:15 freenas notifier: 4+1 records out
Dec 10 14:51:15 freenas notifier: 4800512 bytes transferred in 0.022631 secs (212121729 bytes/sec)
Dec 10 14:51:18 freenas kernel: GEOM_ELI: Device ada0p1.eli created.
Dec 10 14:51:18 freenas kernel: GEOM_ELI: Encryption: AES-XTS 256
Dec 10 14:51:18 freenas kernel: GEOM_ELI: Crypto: hardware
Dec 10 14:51:19 freenas notifier: Stopping collectd.
Dec 10 14:51:23 freenas notifier: Waiting for PIDS: 3782.
Dec 10 14:51:23 freenas notifier: Starting collectd.
Dec 10 14:51:25 freenas notifier: geli: Cannot access ada0p1 (error=1).
Dec 10 14:51:25 freenas notifier: Stopping collectd.
Dec 10 14:51:33 freenas notifier: Waiting for PIDS: 4431.
Dec 10 14:51:33 freenas notifier: Starting collectd.
Dec 10 14:52:22 freenas notifier: ada1 destroyed
Dec 10 14:52:22 freenas notifier: ada1 created
Dec 10 14:52:22 freenas notifier: ada1 destroyed

Some footer info would be useful
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
One note. I did work with someone a few weeks ago on an issue like this. Devices just wouldn't remove from the pool. They were trying to remove a failed disk. Couldn't remove it from the CLI either. The command would return without any message.

Ultimately we had to shutdown the machine and pull the bad disk so it would be offline. When he started resilvering he had tons of pool errors. He wasn't sure if/when the last scrub had occurred but something was very wrong. Ultimately he lost some files. He's not sure how long they may have been damaged as he hadn't access them in months.
 

Xaro

Cadet
Joined
Dec 9, 2013
Messages
7
Here, footer show nothing after dd results. In CLI, hangs at zpool add -f poolName cache /dev/gptid/...

Your SSD cache is "full 4K compliant" ?! ;-) More seriously, maybe it has to do with our yesterday conversation... Remember ?
 

Xaro

Cadet
Joined
Dec 9, 2013
Messages
7
Screenshot is the only info I can provide in that situation.

If have time tomorrow, I will try with another device as cache.
 

Attachments

  • Screen Shot 2013-12-10 at 22.29.24.png
    Screen Shot 2013-12-10 at 22.29.24.png
    34.2 KB · Views: 382

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
No. I can add one, but I doubt it will matter.. be right back...
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Well, just created a zvol, then tried to add the SSD again. Game over. GUI is permafroze and I tried to do a "zpool status' from the CLI and it's dead.
Ticket: https://bugs.freenas.org/issues/3656
 

Enlightend

Dabbler
Joined
Oct 30, 2013
Messages
15
No, sorry, I wasn't able to make a ticket yet.
I'm in the middle of exams right now. But the hanging of the pool and if you do it trough the GUI, the GUI, is exactly what I'm getting.

Other then the records in/out, it doesn't seem to do a damn thing and yes, only way to get out of the hang seems a reboot.
Hence why I asked if we can somehow set the verbosity level to 11 somewhere to see what the heck the system is trying to do.

Does this have anything to do with the message nagging that cache drives aren't 4k aligned?
 

Xaro

Cadet
Joined
Dec 9, 2013
Messages
7
BINGO ! Must never doubt my assumptions :tongue:

I don't know if this is related to #3626 but in my case, I'm concerned by that. Cyberjock may perhaps confirm that uses a 4K SSD which don't lies to OS and anyway have the same phenomenon ?
 

Enlightend

Dabbler
Joined
Oct 30, 2013
Messages
15
Seems they found the underlying cause of the problem which is a tad deep (related to zvols) and this was temp fixed (a tad dirty) but noted for a deep fix.

Loading RC2 now in which this patch should be live and testing if it works now.

EDIT: RC2 loaded, tried adding cache drives, still same issue.

It was stated in the ticket that this is related to zvols existing in the pool.
I'll remove the zvol and see if it works then.

If it does, the supposed fix applied to RC2 does not work in my case and further study needs to be done.
 

TheMattS

Cadet
Joined
Dec 30, 2013
Messages
1
I just ran into the same issue with the final 9.2 release (FreeNAS-9.2.0-RELEASE-x64 ). I migrated from nas4free, went to re-add my cache device to fix the 4k alignment and got the hang. Once I removed the zvols in the pool was I was able to add the cache device without issue.
 

Jano

Dabbler
Joined
Jan 7, 2014
Messages
31
Hi

Problem I have looks quite similar to this one but in different scenario.

FreeNAS 9.2 (latest)
pool 4x1TB RAID 10, no zil, no l2arc
on this raid there is 1 zvol (64KB block size).
what seems important there are data on this zvol (ESXI lun)

I'm trying to add new vdev (mirror) to this raid 10. Result is (the same from GUI or console) hang up. New console and calling zpool does nothing. "top" shows nothing on zpool processes (no load).
gstat shows no any disk activity (except system from time to time).
Only solution is restart system.

The same operation but when there are no data on zvol works correctly (few seconds and it's done).

Test using old style 512B block size brings no results.
 

Jano

Dabbler
Joined
Jan 7, 2014
Messages
31
Hi, I will one more interesting thing... after several tries of extending my raid10, just for test I created new device "aaa".

Later "aaa" was removed by zpool destroy and no longer visible on the list of devices... but appear again after each reboot os the system:

Checking status of zfs pools:
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
aaa - - - - - FAULTED -
dev3 1.81T 755G 1.08T 40% 1.00x ONLINE /mnt
device1 3.62T 1.19T 2.43T 32% 1.00x ONLINE /mnt

pool: aaa
state: UNAVAIL
status: One or more devices could not be opened. There are insufficient
replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
see: http://illumos.org/msg/ZFS-8000-3C
scan: none requested
config:

NAME STATE READ WRITE CKSUM
aaa UNAVAIL 0 0 0
mirror-0 UNAVAIL 0 0 0
17015168727627443149 UNAVAIL 0 0 0 was /dev/mfid6
17706946153659279633 UNAVAIL 0 0 0 was /dev/mfid7
10635481950712008684 UNAVAIL 0 0 0 was /dev/mfid8
5120756702795181700 UNAVAIL 0 0 0 was /dev/mfid9

-- End of daily output --

Waiting for 9.2.1 final to check results.
 
Status
Not open for further replies.
Top