My pool has decided to become randomly f**ed. I have no idea what my next steps should be, and I dont want to execute a scrub if it might inadvertently "fix" it by deleting possible recovery data. I could do with some urgent help what might be going on, and what my approach and diagnostic method should be, please!
It could be as simple as an issue with mounting (mountpoint somehow lost/incorrect/changed + needs mounting fixes), or as serious as pool f**ed up. I don't know, so Im checking before doing a thing.
This is what I know/have noticed.
The system
In my signature. Uses heavy duty enterprise hardware. Xeon, Supermicro, 3 way mirrored pool, mirrored "special vdev" SSDs and logs. 256 GB RAM and pool is barely 50% full, so no issue with resource starvation. Server has barely been used for 2-3 weeks, except to retrieve the odd media file to watch, or save the odd word document to (SMB). Services are basically SSH and SMB, not a lot else. No VMs or jails. No known hardware problems. "zpool status" reports no known data errors, no known device issues, no checksum issues. In other words, overspecified and underused :)
I also haven't logged in for ages, or saved my PW on another device; as far as I know, nothing malicious has been done (if it has, then its odd a BSD server on the LAN is hit but no PCs).
Recent events:
All was normal as far as I know, at least till very recently. With such light usage it's hard to be sure. About a week ago, SMB stopped being able to save files on the server. Sometimes services screw up, I rebooted the server, thought nothing of it. A few days ago (or maybe late last week even??) I figured it wasnt doing anything, may as well execute a scrub or 2, on the pools, in all that idle time. Kicked them off, thought nothing of it.
(Relevant background: There have been past issues where scrubs take a lot of CPU and drive the system into "CPU starvation catatonic" state; iX devs fixed many of those in U4, I havent tested any of it, but reckon they will have done so, as would the OpenZFS community. So I wasn't worried that scrub would do anything, and to be fair, it's never caused a problem. At worst it's made the UI/console sloww and unresponsive. Never worse.)
Today I wanted to copy a file off the server. First use of it for ages. Server couldnt be reached on the LAN. Console responsive but locked up easily (Alt-F1, then Alt-F2 when that locked up, etc). I noticed 2 scrubs were still ongoing - they should easily have ended by now. They seemed to be stalled, not moving anywhere like where they should be. TOP showed 100% idle, then that console froze. No idea. Maybe it was idle, maybe not. So I rebooted (reboot command at console). It went catatonic during boot. Got to the point it had identified and loaded the pools, set up GEOM for swap devices, then... just went dead, totally unresponsive to console, ssh, alt-F2, anything. Gave it 20 mins, then had to hard reset with the PSU switch. Figured no harm likely - I've written no new data to the pool for ages, and it seemed to have hard-frozen.
I let it reboot into single user mode, manually imported the pools (zpool import -c /data/zfs/zpool.cache -a), cancelled the scrubs on any pools that had them active (zpool scrub -s POOLNAME) and continued normal boot (exit).
Current status
My main working pool is the one with the issue, so I'll focus on that.
So my suspicion and possible options seem to be...... (and this is where I need help!)
My suspicion is that Im not seeing any datasets because after EXITing single user mode and resuming boot into usual OS, the datasets on my working pool werent mounted, and had the wrong mountpoints anyway, and SMB is automatically exposing a recycler, which is the exception.
But what's a safe way to test this? How do I get back to a visible pool, if it's in good order but mounting (only) is f**ed, but not compromise any potentially needed repair/rollback, if I try that and it still doesnt work?
Alternatively, is it safe to manually mount a specific dataset from the command line, to test it? If so, whats the command?
Or should I just export the entire pool, then re-import using the TrueNAS UI? But then I'll lose settings related to that pool, such as SMB shares, wont I?
What should I do, to restore usability (especially if it is just a mounting issue) without losing anything if it is in fact having a problem?
It could be as simple as an issue with mounting (mountpoint somehow lost/incorrect/changed + needs mounting fixes), or as serious as pool f**ed up. I don't know, so Im checking before doing a thing.
This is what I know/have noticed.
The system
In my signature. Uses heavy duty enterprise hardware. Xeon, Supermicro, 3 way mirrored pool, mirrored "special vdev" SSDs and logs. 256 GB RAM and pool is barely 50% full, so no issue with resource starvation. Server has barely been used for 2-3 weeks, except to retrieve the odd media file to watch, or save the odd word document to (SMB). Services are basically SSH and SMB, not a lot else. No VMs or jails. No known hardware problems. "zpool status" reports no known data errors, no known device issues, no checksum issues. In other words, overspecified and underused :)
I also haven't logged in for ages, or saved my PW on another device; as far as I know, nothing malicious has been done (if it has, then its odd a BSD server on the LAN is hit but no PCs).
Recent events:
All was normal as far as I know, at least till very recently. With such light usage it's hard to be sure. About a week ago, SMB stopped being able to save files on the server. Sometimes services screw up, I rebooted the server, thought nothing of it. A few days ago (or maybe late last week even??) I figured it wasnt doing anything, may as well execute a scrub or 2, on the pools, in all that idle time. Kicked them off, thought nothing of it.
(Relevant background: There have been past issues where scrubs take a lot of CPU and drive the system into "CPU starvation catatonic" state; iX devs fixed many of those in U4, I havent tested any of it, but reckon they will have done so, as would the OpenZFS community. So I wasn't worried that scrub would do anything, and to be fair, it's never caused a problem. At worst it's made the UI/console sloww and unresponsive. Never worse.)
Today I wanted to copy a file off the server. First use of it for ages. Server couldnt be reached on the LAN. Console responsive but locked up easily (Alt-F1, then Alt-F2 when that locked up, etc). I noticed 2 scrubs were still ongoing - they should easily have ended by now. They seemed to be stalled, not moving anywhere like where they should be. TOP showed 100% idle, then that console froze. No idea. Maybe it was idle, maybe not. So I rebooted (reboot command at console). It went catatonic during boot. Got to the point it had identified and loaded the pools, set up GEOM for swap devices, then... just went dead, totally unresponsive to console, ssh, alt-F2, anything. Gave it 20 mins, then had to hard reset with the PSU switch. Figured no harm likely - I've written no new data to the pool for ages, and it seemed to have hard-frozen.
I let it reboot into single user mode, manually imported the pools (zpool import -c /data/zfs/zpool.cache -a), cancelled the scrubs on any pools that had them active (zpool scrub -s POOLNAME) and continued normal boot (exit).
Current status
My main working pool is the one with the issue, so I'll focus on that.
- zpool list shows all pools exist, and the sizes are as expected. Main working pool exists and looks sensible too.
- zpool status -v shows all vdevs exist, no offline devices, no checksum errors, and cancelled scrub (as expected)
- zfs list -r WORKPOOL lists all expected datasets and their sizes, looks completely normal
- zfs list -t snap -r WORKPOOL | less shows most snaps exist, including snaps taken since the last reboot, for all datasets and subdatasets. It looks tentatively like some automated snaps that should exist, havent been taken, but nothing to suggest a corruption issue, it's consistent across all datasets. But there should be 3 days of 15 minute snaps, and instead there's 1 hour of them. That sort of thing. But that hour's worth, is there for all datasets.
- zfs get all MYPOOL/sample_dataset shows two oddities: the datasets aren't mounted. And the mountpoints are all wrong (/MYPOOL/dataset, not /mnt/MYPOOL/dataset). The mount command confirms, none are mounted from my working pool.
- mc, SMB and ls all confirm that the only directory that exists on my working pool , is .recycle. That I don't get because surely if snaps exist, then MYPOOL/dataset/.zfs should exist (its visible in all dataset settings) - unless the issue is non-mounting again. Which I guess is likely.
So my suspicion and possible options seem to be...... (and this is where I need help!)
My suspicion is that Im not seeing any datasets because after EXITing single user mode and resuming boot into usual OS, the datasets on my working pool werent mounted, and had the wrong mountpoints anyway, and SMB is automatically exposing a recycler, which is the exception.
But what's a safe way to test this? How do I get back to a visible pool, if it's in good order but mounting (only) is f**ed, but not compromise any potentially needed repair/rollback, if I try that and it still doesnt work?
Alternatively, is it safe to manually mount a specific dataset from the command line, to test it? If so, whats the command?
Or should I just export the entire pool, then re-import using the TrueNAS UI? But then I'll lose settings related to that pool, such as SMB shares, wont I?
What should I do, to restore usability (especially if it is just a mounting issue) without losing anything if it is in fact having a problem?
Last edited: