Urgent help - pool looks broken - what should my next steps be with these odd symptoms?

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
My pool has decided to become randomly f**ed. I have no idea what my next steps should be, and I dont want to execute a scrub if it might inadvertently "fix" it by deleting possible recovery data. I could do with some urgent help what might be going on, and what my approach and diagnostic method should be, please!

It could be as simple as an issue with mounting (mountpoint somehow lost/incorrect/changed + needs mounting fixes), or as serious as pool f**ed up. I don't know, so Im checking before doing a thing.

This is what I know/have noticed.

The system

In my signature. Uses heavy duty enterprise hardware. Xeon, Supermicro, 3 way mirrored pool, mirrored "special vdev" SSDs and logs. 256 GB RAM and pool is barely 50% full, so no issue with resource starvation. Server has barely been used for 2-3 weeks, except to retrieve the odd media file to watch, or save the odd word document to (SMB). Services are basically SSH and SMB, not a lot else. No VMs or jails. No known hardware problems. "zpool status" reports no known data errors, no known device issues, no checksum issues. In other words, overspecified and underused :)

I also haven't logged in for ages, or saved my PW on another device; as far as I know, nothing malicious has been done (if it has, then its odd a BSD server on the LAN is hit but no PCs).

Recent events:

All was normal as far as I know, at least till very recently. With such light usage it's hard to be sure. About a week ago, SMB stopped being able to save files on the server. Sometimes services screw up, I rebooted the server, thought nothing of it. A few days ago (or maybe late last week even??) I figured it wasnt doing anything, may as well execute a scrub or 2, on the pools, in all that idle time. Kicked them off, thought nothing of it.

(Relevant background: There have been past issues where scrubs take a lot of CPU and drive the system into "CPU starvation catatonic" state; iX devs fixed many of those in U4, I havent tested any of it, but reckon they will have done so, as would the OpenZFS community. So I wasn't worried that scrub would do anything, and to be fair, it's never caused a problem. At worst it's made the UI/console sloww and unresponsive. Never worse.)

Today I wanted to copy a file off the server. First use of it for ages. Server couldnt be reached on the LAN. Console responsive but locked up easily (Alt-F1, then Alt-F2 when that locked up, etc). I noticed 2 scrubs were still ongoing - they should easily have ended by now. They seemed to be stalled, not moving anywhere like where they should be. TOP showed 100% idle, then that console froze. No idea. Maybe it was idle, maybe not. So I rebooted (reboot command at console). It went catatonic during boot. Got to the point it had identified and loaded the pools, set up GEOM for swap devices, then... just went dead, totally unresponsive to console, ssh, alt-F2, anything. Gave it 20 mins, then had to hard reset with the PSU switch. Figured no harm likely - I've written no new data to the pool for ages, and it seemed to have hard-frozen.

I let it reboot into single user mode, manually imported the pools (zpool import -c /data/zfs/zpool.cache -a), cancelled the scrubs on any pools that had them active (zpool scrub -s POOLNAME) and continued normal boot (exit).

Current status

My main working pool is the one with the issue, so I'll focus on that.

  • zpool list shows all pools exist, and the sizes are as expected. Main working pool exists and looks sensible too.
  • zpool status -v shows all vdevs exist, no offline devices, no checksum errors, and cancelled scrub (as expected)
  • zfs list -r WORKPOOL lists all expected datasets and their sizes, looks completely normal
  • zfs list -t snap -r WORKPOOL | less shows most snaps exist, including snaps taken since the last reboot, for all datasets and subdatasets. It looks tentatively like some automated snaps that should exist, havent been taken, but nothing to suggest a corruption issue, it's consistent across all datasets. But there should be 3 days of 15 minute snaps, and instead there's 1 hour of them. That sort of thing. But that hour's worth, is there for all datasets.
  • zfs get all MYPOOL/sample_dataset shows two oddities: the datasets aren't mounted. And the mountpoints are all wrong (/MYPOOL/dataset, not /mnt/MYPOOL/dataset). The mount command confirms, none are mounted from my working pool.
  • mc, SMB and ls all confirm that the only directory that exists on my working pool , is .recycle. That I don't get because surely if snaps exist, then MYPOOL/dataset/.zfs should exist (its visible in all dataset settings) - unless the issue is non-mounting again. Which I guess is likely.


So my suspicion and possible options seem to be...... (and this is where I need help!)


My suspicion is that Im not seeing any datasets because after EXITing single user mode and resuming boot into usual OS, the datasets on my working pool werent mounted, and had the wrong mountpoints anyway, and SMB is automatically exposing a recycler, which is the exception.

But what's a safe way to test this? How do I get back to a visible pool, if it's in good order but mounting (only) is f**ed, but not compromise any potentially needed repair/rollback, if I try that and it still doesnt work?

Alternatively, is it safe to manually mount a specific dataset from the command line, to test it? If so, whats the command?

Or should I just export the entire pool, then re-import using the TrueNAS UI? But then I'll lose settings related to that pool, such as SMB shares, wont I?

What should I do, to restore usability (especially if it is just a mounting issue) without losing anything if it is in fact having a problem?
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
zpool status -v shows all vdevs exist, no offline devices, no checksum errors, and cancelled scrub (as expected)
So your data is fine.

zfs get all MYPOOL/sample_dataset shows two oddities: the datasets aren't mounted. And the mountpoints are all wrong (/MYPOOL/dataset, not /mnt/MYPOOL/dataset). The mount command confirms, none are mounted from my working pool.
A few points... you mention MYPOOL and WORKPOOL and "my working pool"... are they all the same pool?

Or should I just export the entire pool, then re-import using the TrueNAS UI? But then I'll lose settings related to that pool, such as SMB shares, wont I?

What should I do, to restore usability (especially if it is just a mounting issue) without losing anything if it is in fact having a problem?
If the mountpoint needs changing, stop the shares/jails/VMs that need that pool and export it, then zpool import -R /mnt POOL then zpool export POOL, re-import it via GUI and restart all your services, etc.

You won't lose any config.
 

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
So your data is fine.


A few points... you mention MYPOOL and WORKPOOL and "my working pool"... are they all the same pool?


If the mountpoint needs changing, stop the shares/jails/VMs that need that pool and export it, then zpool import -R /mnt POOL then zpool export POOL, re-import it via GUI and restart all your services, etc.

You won't lose any config.
Yes, all the same pool.

Followed your advice, THANK YOU, looks okay now.

One change that doesnt worry me. I did the exports and imports in Single User mode for cleanness and to ensure no services or other stuff had the pools in use (had to remount / as rw, trivial). Then rebooted to ensure clean running, and no lingering efffects of the single user mode work. When it rebooted after the final export, and I went into the GUI, it had already imported the pools, I didnt have to do that in the GUI. Looks good now. Scrtubbing - and this time seems to be scrubbing with proper expected progress.

THANK YOU

Any guesses what the heck could have happened? I can imagine it locked up on scrub (thatd be a Jira issue if that problem's back). But losing mounting / failing to mount / losing mountpoints? Thats plain wierd.
 
Top