FreeNAS 9.3 filesystem hang

peterh · Mar 24, 2015

A newly installed FreeNAS froze and barely could be restarted.
This is used as nfs-server for 2 vmware hosts, uses only nfs
as sharing, i small number of filesystems. Disk total is 29TB

The sequence as we experienced started monday morning.
A migration of a number of vmware hosts was started (from another
fileserver), this caused some load.
During this transer the descision was taken to remove 2 large files
that was created "to have something to scrub on". In total 15TB.
These files was removed by 'rm BIG BIG2' from a shell.
The 'rm' did not immediatly returned to a shell prompt, which did
not concerned me as i could continue with other tasks.

Some minutes later vmware hosts began to fail.
The shell worked , the gui worked but was very slow.
Trying to mount a filesystem from another machine failed :
( fs01 10.220.55.18 is the failing server )
[root@fs02] ~# mount 10.220.55.18:/mnt/vol0/vmware-vmdk02 /tmp/fs01
[tcp] 10.220.55.18:/mnt/vol0/vmware-vmdk02: NFSPROC_NULL: RPC: Unable to receive; errno = Connection reset by peer
^C
[root@fs02] ~# mount 10.220.55.18:/mnt/vol0/vmware-vmdk02 /tmp/fs01
[tcp] 10.220.55.18:/mnt/vol0/vmware-vmdk02: RPCPROG_NFS: RPC: Program not registered

From a working GUI we tried to stp/start NFS services, it never
completed. A reboot from the console was attempted.
This came to syncing disks but never reboot. Hard reset was done
after several minutes.

After boot it hang on 'Mounting local file systems:
^T showed
load 1.1 cmd zpool: 884 [zio->io_cvl]
for a while, then
load 3.1 cmd zpool: 884 [tx->tx_sync_done_cv]
which after a long wait stopped increasing used process time. At that point
we did control-alt-del and the box rebooted .

Next boot also showed a long time with [zio->io_cvl]
but later it became [spa->spa_scrub_in_cv] and finally the system came alive.

The large files were gone, but everything else seemd to be correct.

What happened ? And will it come back ?

Particulars: this is a FreeNAS 9.3
System Information
Hostname fs01.vtd.volvo.se
Build FreeNAS-9.3-STABLE-201503071634
Platform Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
Memory 65476MB

hardware is supermicro 4U ZFS Server Xeon E5-2600
CPU: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz (2100.04-MHz K8-class CPU)
FreeBSD/SMP: Multiprocessor System Detected: 24 CPUs
FreeBSD/SMP: 2 package(s) x 6 core(s) x 2 SMT threads
igb0: <Intel(R) PRO/1000 Network Connection version - 2.4.0>
igb1: <Intel(R) PRO/1000 Network Connection version - 2.4.0>
igb2: <Intel(R) PRO/1000 Network Connection version - 2.4.0>
igb3: <Intel(R) PRO/1000 Network Connection version - 2.4.0>
ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 2.5.15>
ahci0: <Intel Patsburg AHCI SATA controller>
mps2: <LSI SAS2308>
mps2: Firmware: 20.00.00.00, Driver: 16.00.00.00-fbsd
mps2: IOCCapabilities: 5285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
Ada0 at mps0 bus 0 scbus0 target 0 lun 0
da0: <SEAGATE ST3000NM0023 0004> Fixed Direct Access SCSI-6 device
da0: Serial Number Z1Z4QYS40000C451AW6G
da0: 600.000MB/s transfers
da0: Command Queueing enabled
da0: 2861588MB (5860533168 512 byte sectors: 255H 63S/T 364801C)

repeat to da13

booted from :
da15 at umass-sim0 bus 0 scbus11 target 0 lun 0
da15: <Kingston DataTraveler 3.0 PMAP> Removable Direct Access SCSI-6 device
da15: Serial Number 08606E6D414CBEA1470C93FD
da15: 40.000MB/s transfers
da15: 30008MB (61457664 512 byte sectors: 255H 63S/T 3825C)
da15: quirks=0x2<NO_6_BYTE>
which is a mirror with :
da16 at umass-sim1 bus 1 scbus12 target 0 lun 0
da16: <Kingston DataTraveler 3.0 PMAP> Removable Direct Access SCSI-6 device
da16: Serial Number 08606E6D418ABEA14717F4DE
da16: 40.000MB/s transfers
da16: 30008MB (61457664 512 byte sectors: 255H 63S/T 3825C)
da16: quirks=0x2<NO_6_BYTE>
log & sil on :
ada0 at ahcich0 bus 0 scbus3 target 0 lun 0
ada0: <INTEL SSDSC2BA100G3 5DV10270> ATA-9 SATA 3.x device
ada0: Serial Number BTTV434403FP100FGN
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada0: Command Queueing enabled
ada0: 95396MB (195371568 512 byte sectors: 16H 63S/T 16383C)
ada0: Previously was known as ad4
ada1 at ahcich1 bus 0 scbus4 target 0 lun 0
ada1: <SAMSUNG MZ7WD960HMHP-00003 DXV8C03Q> ATA-9 SATA 3.x device
ada1: Serial Number S1E4NYAFA01902
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 915715MB (1875385008 512 byte sectors: 16H 63S/T 16383C)
ada1: Previously was known as ad6

cyberjock · Mar 24, 2015

Can you provide a debug file on your server?

peterh · Mar 25, 2015

freenas-debug ? Yes. What areas ( switches) is of value ?

peterh · Mar 25, 2015

here is one with '-ghTnNStz'

cyberjock · Mar 25, 2015

You should use the WebGUI button.. system -> Advanced -> Save Debug. :)

jgreco · Mar 25, 2015

I have seen the removal of a multiterabyte file take an unreasonably large amount of time; I never looked at it in more detail since we don't normally do this. I'm guessing it is reproducible.

Trying to reboot the server while ZFS is desperately trying to do some major metadata update (my best guess) is a sure recipe for pain, though.

cyberjock · Mar 27, 2015

The actual deletion of files (zvols, datasets, etc.) shouldn't take a long time unless the ZFS flag async_destroy is disabled.

You can check your pool with the command

# zpool get all | grep async

If your pool is running the latest flags, this is certainly enabled.

If this is disabled you have to wait for ZFS to find all the fragments to clear them, which will lock the pool up while ZFS does its internal transaction stuff. Unfortunately, if you reboot the machine because you don't want to wait, you'll only reset the work it has done. It will lock up again on bootup and the only fix is to wait it out.

I have seen people delete a dataset with multi-TBs of data and it has locked a pool up for days. This is why the async_destroy feature was added.

jgreco · Mar 27, 2015

cyberjock said:
I have seen people delete a dataset with multi-TBs of data and it has locked a pool up for days. This is why the async_destroy feature was added.

That seems like async_destroy would be a bit of a landmine. It still has to do all that metadata work in the background, and so if there's a panic or a call for reboot while that is going on, it seems like you'd be in for the hurt during the pool import, since the operation would need to be completed at that time.

cyberjock · Mar 27, 2015

Why a landmine? It's the solution to the problem. You destroy the dataset and it immediately disappears. The free space is given back over time as ZFS find all the blocks and frees them. Pretty cool to watch actually. I deleted a 2TB dataset once, and it took about 15 minutes for it to clear everything up. :)

jgreco · Mar 27, 2015

It's a landmine. You need to think operationally.

We've already established that deleting a large dataset can take substantial amounts of time. Even your relatively optimistic "15 minutes" is a fairly long period of time. We presumably agree that this is because of metadata traversal and updating to deallocate newly freed blocks (and in more complex situations to manage snapshot metadata). You just finished indicating that you had seen it take more than a day to do such an update on a multi-TB dataset. Let's imagine neither extreme but assume a 12-hour operation needs to take place.

Pretend you're a service provider.

So at around midnight, automated customer deprovisioning scripts call for the deletion of a dataset, because the customer discontinued service and migrated the data elsewhere. The service provider is not particularly aware that the customer's dataset consists of an incredibly fragmented SQL database that is several terabytes large. Most customer datasets are a few hundred GB and lightly fragmented, and vaporize in 10 minutes or less.

Two AM rolls around, and with it, the four hour window for scheduled downtime maintenance. A critical fix has been released for SSL, and updates are being rolled out across the board. This means operations wants to deploy the latest FreeNAS update and a reboot is imminent.

So. There are two ways I see this unfolding.

1) If async_destroy is not set, then the deprovisioning script hasn't completed, because the command to destroy the dataset hasn't completed, and the provider's provisioning system is therefore aware that there are outstanding major operations on the array. This would preclude a shutdown of the array because the array is busy. This probably greatly annoys the operators on duty, but someone actually looks at the situation and determines that the upgrade cannot happen during this downtime window because the array is busy. This is an "acceptable" outcome.

2) If async_destroy is set, then the deprovisioning script has completed. At 2:10AM, an operator initiates the firmware update. The array, however, is busy in the background. At the end of the firmware update, the update script of course reboots into the new firmware. The FreeBSD kernel begins the userland shutdown, but probably appears to "hang" on the reboot because it is waiting for ZFS to finish the async operation. The operator looks at the box, after it's failed to reboot in a few minutes, and calls for a hard reset. FreeNAS boots and begins the ZFS import. The import, of course, needs to complete the metadata update transaction, so the box is now in limbo for up to the next 12 hours while the operation commits. This takes customer storage offline for a period of at least eight hours beyond the scheduled downtime.

Case 2) is an operational landmine and a disaster.

And, yes, service providers do actually use complex provisioning systems to manage their resources. I've worked on designing and implementing such systems, and the edge cases are always the things that screw you.

cyberjock · Mar 27, 2015

I think we all agree that #1 is a very bad thing (sync destroy is a cardinal sin for multiple TBs of data). The zpool going offline while the delete operation completes is not an option. I saw this first hand on a system running a version of FreeNAS that was so old that async_destroy didn't exist yet.

For #2, as far as I know, even if you reboot while the free space is being freed, it does not prevent the box from shutting down or starting up. Async destroy is meant to make all deletions, regardless of the quantity of data deleted, to not result in the "operational landmine" you are theorizing.

For async destroy, the pool is totally usable and online (free space is freed as it goes). You can export and import the pool at will without problems. The only downside is that while the pool is freeing space it is a bit slower than normal and the space isn't totally deallocated yet.

For those without the flag, the pool goes unresponsive because the space being freed has to occur in a single transaction. So you are down for the count, until the blocks are all rounded up and cleared.

So I think the landmine you are referring to is not an issue at all because it doesn't exist. Am I missing something?

rogerh · Mar 28, 2015

Man zpool-features seems to side with cyberjock on this one:

"When async_destroy is enabled the file system's data will be
reclaimed by a background process, allowing the destroy opera-
tion to complete without traversing the entire file system.
The background process is able to resume interrupted destroys
after the pool has been opened, eliminating the need to finish
interrupted destroys as part of the open operation."

But I suppose being in a man page doesn't necessarily make it true!

jgreco · Mar 29, 2015

cyberjock said:
I think we all agree that #1 is a very bad thing (sync destroy is a cardinal sin for multiple TBs of data).

No, not really, blocking while you perform an operation is generally healthy. You wouldn't expect "dd if=/dev/random of=/mnt/pool/data bs=1048576 count=9999999999" to return instantly, for example. It can be ANNOYING for an operation to block until it is completed. That kind of depends on the context.

zpool going offline while the delete operation completes is not an option.

Yes, bad.

For #2, as far as I know, even if you reboot while the free space is being freed, it does not prevent the box from shutting down or starting up. Async destroy is meant to make all deletions, regardless of the quantity of data deleted, to not result in the "operational landmine" you are theorizing.

That implies that there's yet another thread running in the background that might be freeing up space on a pool on which there's no userland activity? Uggh, that introduces other ugliness. Too much for a Sunday morning.

peterh · Mar 30, 2015

The file(s) removed was about 20TB in size..
async_destroy was default enabled:
[root@fs01] /var/tmp# zpool get all | grep async
freenas-boot feature@async_destroy disabled local
vol0 feature@async_destroy enabled local

So what you are telling me is to expect removal of huge datasets could take many hours ? And in the meantime
the system is useless ? ( the first indication we had that something was wrong was that vmware lost it's storage
and noone else could mount datasets from the freenas box) . Is this what is expected from zfs ? Or is it a bug
that has been introduced after the fork from SUN ?

jgreco · Mar 30, 2015

peterh said:
So what you are telling me is to expect removal of huge datasets could take many hours ?

Yes, because you've still got to identify what blocks belonged to that pool and free them.

And in the meantime the system is useless ?

No, it shouldn't be. The case Cyberjock and I were discussing is a bit of an edge case. I haven't looked at it in detail and he says it's handled, though as noted below I have doubts.

( the first indication we had that something was wrong was that vmware lost it's storage
and noone else could mount datasets from the freenas box) .

Well, now, that's the thing. According to Cyberjock, deletion of a huge dataset can be handled in the background by a task. For me, this brings up more questions than it answers:

Too many ZFS pools are designed without sufficient capacity to perform well under heavy load. In such a case, the existence of a background thread adding load to the pool is problematic.

Being interrupted usually means ZFS has to complete an operation during the pool import, which, for a dataset deletion, could mean a very large delay as I outlined above. Cyberjock says this isn't a problem but your reboot experience seems to suggest otherwise.

There's a reason that scrub and resilver report that an operation is in progress in the zpool status, and why those operations can be stopped from the command line. They have the opportunity to slag pool performance. This is bad enough that eventually tunables were added to help limit it to less-busy periods of pool activity.

So I am immediately left wondering things like, how do you tell if a background thread is asynchronously destroying a dataset? What limits or controls exist to moderate the load imposed on a small pool? Can the operation be paused or cancelled? What does the system report for free space, the actual current free space or the computed future free space?

I have no idea whether the problem you experienced was brought on by this potential issue. It could have been unrelated.

cyberjock · Mar 30, 2015

jgreco said:
No, not really, blocking while you perform an operation is generally healthy. You wouldn't expect "dd if=/dev/random of=/mnt/pool/data bs=1048576 count=9999999999" to return instantly, for example. It can be ANNOYING for an operation to block until it is completed. That kind of depends on the context.

That is not a valid comparison. The block size of 1MB means that each 1MB write is, potentially, a separate transaction. The problem with deleting datasets is that a dataset deletion used to be required to be performed in a single transaction. So the transaction opened, then all the blocks had to be found, then all the blocks freed as a single transaction. This can, potentially, take a very long time.

The alternative, before async_destroy existed, was to do an "rm -R /zpool/dataset/* and let that run. Each file deletion was potentially a different transaction, and you could delete 20TB of data in the dataset without locking out the system. Then, after the deletion is complete you could delete the dataset. The dataset would be 4KB or so and would be very short and easy to delete. Basically the practice was to never delete a dataset that wasn't empty.

For a dd write like you are doing you are going to see a transaction open, 1 (or more blocks) written, then the tranaction closed. Then another transaction with one (or more) blocks would be created. But if you did something like "dd if=/dev/random of=testfile bs=15TB count=1" (and assuming you had enough RAM to store that size, it *would* be a single transaction, and things would lock up while dd did the actual write to the pool. It would be a single transaction and that would be a messy affair.

Scrubs can be stopped (zpool scrub -s <poolname>). Resilvers however, cannot.

I can't tell you how to find out if something is asyncronously destroying a dataset or not. Sorry. That would be something to ask a ZFS dev. I'll ask one of our guys and see if they know how.

I am not sure what moderates the pool workload while the async_destroy process is running. In my experience it seems to run with something that approximates the load of a scrub (resilvers by default are more intrusive than scrubs).

The deletion cannot be paused or cancelled as far as I know. I could maybe believe that a "pause" exists. But I doubt a cancellation is possible since the dataset is already deleted. The async cleanup is nothing more than housekeeping as the actual "deletion" is already done and over.

cyberjock · Mar 30, 2015

You can see what space is used, but needs to be freed with the command "zpool get freeing <pool>".

jgreco · Mar 30, 2015

cyberjock said:
That is not a valid comparison. (followed by a bunch of obsession over blocksize rather than the overall idea of shoving a massive transaction at the pool)

You got the point in the third paragraph, thankfully.

cyberjock said:
You can see what space is used, but needs to be freed with the command "zpool get freeing <pool>".

That's something I'd never seen before. Interesting.

I am not sure what moderates the pool workload while the async_destroy process is running. In my experience it seems to run with something that approximates the load of a scrub (resilvers by default are more intrusive than scrubs).

That's still uncomfortable, but on the other hand, there's no way to moderate the load on the pool when you're doing it synchronously, either. One might consider the absence of I/O quotas to be a design defect in ZFS ... but now I'm chasing a different bus.

peterh · Mar 31, 2015

The real issue then seems to be that async_destroy will have to high priority and if large enough file is deleted disturbances
is to be expected.
We saw a "dead system" while this happened, nfs mounts would time out new mounts was refused.
Looks like a bug to me, but not an impossible-to-fix one.

HoneyBadger · Mar 31, 2015

Is your zpool striped mirrors? I've done some large nukes on zpools with async_destroy and I've definitely seen spikes in latency but nothing that's completely shut down access. Mine was via block (iSCSI/FC) not file but I can't imagine it being a huge difference from that front.

Important Announcement for the TrueNAS Community.

FreeNAS 9.3 filesystem hang

Patron

Inactive Account

Patron

Patron

Attachments

Inactive Account

Resident Grinch

Inactive Account

Resident Grinch

Inactive Account

Resident Grinch

Inactive Account

Guru

Resident Grinch

Patron

Resident Grinch

Inactive Account

Inactive Account

Resident Grinch

Patron

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "FreeNAS 9.3 filesystem hang"

Similar threads