zfs varying raidz1 access speeds, would eventually crash

Thibaut · May 8, 2017

I'm a happy FreeNAS user for more than 3 years now and feel pretty comfortable at using it. To be honest I'm more often sitting at the command line rather than using FreeNAS' interface since the most usual operations I must perform are transfers from pretty large datasets (>1T), from one pool to the other.

So I've become acquainted with the zfs, zpool and even sometimes zdb commands and a few of their options. I generally ssh into the server and create a tmux "dashboard" that gives me a nice vision of the global server load: zpool list, zpool iostat PoolName 5 and top each running one next to the other in their tmux pane.

The server is a SuperMicro, dual Xeon E5-2620v2@2.10GHz (2x16 threads) with 192GB RAM and 36 HDs, running FreeNAS-9.2.1.7-RELEASE-x64.
There are 2 Pools, both configured as Striped RAIDZ(1) of 5x3TB disks each.
Pool-1 aggregates 4 RAIDs for a total capacity of 54.5T
Pool-2 aggregates 3 RAIDs for a total capacity of 40.9T
And, yes, there is one spare disk left for those who counted ;)

When moving large datasets (>1T) from Pool-1 to Pool-2, I can observe drastic variations in read/write access speeds reported by zpool iostat. They usually rocket at up to 700M/s shortly after the zfs send / receive command is issued, then oscillate around 300M/s for a few minutes, then plummet to around 60M/s, sometimes reaching back to slightly above 100M/s.
Then, from time to time they'll go back up to 300 or even 500M/s for a moment, then back down again.

This wouldn't be too much of a concern in the end, but sometimes the whole zfs service crashes !

Apparently at random, the zfs service will stop responding, zpool iostat would display all zeros in its operations and bandwidth read / write columns. The Pools would show no activity from the hard drives' led indicators, and all attempt to access datasets through the command line would hang.
Although the system itself would still be running, as top would clearly show other, still active, processes.

Trying to reboot the system from command line via ssh, or using option 10 (reboot) at FreeNAS' command line interface on the server's terminal itself, would always get stuck trying to kill some remaining processes. A hardware reboot becomes the only option at that point.

I've even doubtfully tried service zfs restart in this situation hopping it would restore access to the pools, but to no avail :(

I've observed that, right before some crashes, there was a burst in read/write access speeds, reaching back to around 500M/s before stopping all activity. But that was not always the case.
Also on some other occasions, I noticed the Free Mem reported by top, dangerously flirting with the 0G limit, but once again, this was not always the case. What is certain is that the Swap was never used at anytime.

I've been observing the system as closely as possible and the only correlation I could make was the following:
When the zfs send / receive command is issued, two zfs processes are spawned, which seems pretty logical, one for send (read) and one for receive (write). The WCPU percentage of those two processes, as reported by top, is clearly proportional to the read/write speeds reported by zpool iostat, which also makes perfect sense.

Although Free Mem, ARC Total and MRU display noticeable fluctuations (tens of Gs) during the operation, I couldn't notice a clear correlation between, lets say Free Mem, and the actual access times.

The described behavior is so unpredictable, except that it usually happens when moving large datasets, that I've become scared to operate remotely on the server. Having to hit the road at night to the office to physically push the server's power button is a nightmare I've faced more than once !

In case you've encountered a similar behavior, or have the slightest idea of what might be done to avoid this situation, please post here.
I'd so much appreciate feeling confident that any further zfs send / receive command would carry on without crashing the house down, I'd be forever grateful to my savior !

Robert Trevellyan · May 8, 2017

Have you monitored power supply stability and system temperature during these bulk transfers?

Thibaut · May 8, 2017

To be honest, no I didn't.

Although the system having a redundant power supply hooked to a sturdy UPS and being enclosed in a refrigerated cabinet I would be surprised that those would be the causes of the crash.

Regarding power usage, the two zfs processes spawned are each using a separate thread. The highest consumption reported by top for each of these, topped at around 70%. And they're cycling on only 2 of the 32 available threads in the system.
This means the maximum total charge observed during those operations is only around 20% of the system potential.

Stux · May 8, 2017

Try disabling UMA. And if that fixes it, report this as a bug.

Screen Shot 2017-05-09 at 1.37.49 PM.png

https://forums.freenas.org/index.php?threads/swap-with-9-10.42749/page-3

TLDR; UMA + ARC is buggy, and causes memory exhaustion when a server is thrashed.

rs225 · May 9, 2017

Are these resumeable receives?

Thibaut · May 10, 2017

@Stux
Thanks for referring this.
I've read the thread to better understand what it means to change this variable value. I'm probably going to test it on a VM before applying to the server.
I'm not yet 100% sure that I got it clear, but will certainly clarify when playing with the VM.
I'll certainly report here in case I notice it helps prevent zfs crashing !

@rs225
Nope, they're not resumeable receives.
The system is still FreeNAS-9.2.1.7, thus FreeBSD 9.2-RELEASE-p10. And I think resumeable receives only appeared in FreeBSD 10.3, which would be FreeNAS 9.10.

SweetAndLow · May 10, 2017

There are 2 Pools, both configured as Striped RAIDZ(1) of 5x3TB disks each.
Pool-1 aggregates 4 RAIDs for a total capacity of 54.5T
Pool-2 aggregates 3 RAIDs for a total capacity of 40.9T

what does this mean? How are your disks connected to your motheboard? You do know that you shouldn't use raid with zfs?

Thibaut · May 11, 2017

@SweetAndLow
In case you want to better understand what it means you can find more details here: https://www.zfsbuild.com/2010/06/03/howto-create-striped-raidz-nested-zpool/

Regarding the disks connections, the system is equipped with 2 expander backplanes, hosting a total of 36 SATA connections.
Both backplanes being managed by an LSI Host Bus Adapter card.

And, yes, I'm well aware it's a bad idea to have zfs ontop of a RAID configuration, or have RAID hardware managing the disks. It's even written in the post's title: I'm talking raidZ all the way, as in the above mentioned article.
Now if your question concerns the chosen Pool's disks configuration it's a completely different debate, as choosing an optimized configuration with a large number of disks has many different implications and depends greatly on the specific use case.

SweetAndLow · May 11, 2017

Thibaut said:
@SweetAndLow
In case you want to better understand what it means you can find more details here: https://www.zfsbuild.com/2010/06/03/howto-create-striped-raidz-nested-zpool/

Regarding the disks connections, the system is equipped with 2 expander backplanes, hosting a total of 36 SATA connections.
Both backplanes being managed by an LSI Host Bus Adapter card.

And, yes, I'm well aware it's a bad idea to have zfs ontop of a RAID configuration, or have RAID hardware managing the disks. It's even written in the post's title: I'm talking raidZ all the way, as in the above mentioned article.
Now if your question concerns the chosen Pool's disks configuration it's a completely different debate, as choosing an optimized configuration with a large number of disks has many different implications and depends greatly on the specific use case.

I understand, you should just stop using the word raid when it doesn't represent what you are doing.

Sent from my Nexus 5X using Tapatalk

Thibaut · Dec 18, 2017

It's been a while since my initial question, and far from willing to revive an old topic, I think I should share what I've experimented in case it might be of use to others.

What I've observed is the following:

When transferring (pool to pool) large recursive datasets, specifically containing one large zvolume dataset (> 750G) the full data transfer takes place and memory being correctly managed whatever the transfer rates (100MB/s to 700MB/s).
Once the full data has been copied, some time seems to be required to "finalize" the copy process. I've not a sufficient understanding of the intricacies of zfs to pinpoint what goes on exactly under the hood, but what is observed is that, for an amount of time proportional to the data volume that has been transferred, it becomes impossible to write to the destination pool.
This can last as long as a few minutes when the zvols are over the Terabyte.
The crash occurs when the recursive copy immediately launches after the previous, large, one. As, at this moment, it's impossible to write to the destination pool, the newly copied data gets stored in memory, rapidly filling the remaining available RAM... until there is none left !
At this point, even if the OS still responds, all zfs pools become inactive, showing zero activity in zpool iostat. Restarting the zfs service has no effect and the only way to revive the pools is to restart the system. Which means hard rebooting since rebooting our FreeNAS system requires access to some directories residing on a zfs pool.

After observing this behavior, I thought it would be possible to somehow preserve more free RAM in order to allow the system to buffer more data until the destination pool gets writable again. After some research it appears that it should be possible to use a tunable (vfs.zfs.arc_max) that is supposed to define the maximum ARC memory to be used by the zfs caching mechanism.

As the system i'm working on has 192GB ECC RAM, I've tried and set the tunable as follow:

variable : vfs.zfs.arc_max
value : 152471339008
comment : ZFS ARC Limit to avoid out of memory crashes

The value is supposed to corresponds to 142G, which should leave about 50G free in the system's RAM. Unfortunately, although the system was rebooted with this value set, the observed average free memory is around 30G.

Up until now, the only solution I could apply not to face a crash is to manually launch a single transfer for each large zvol datasets, which is quite a pain in my situation.

Any information regarding the possibility to effectively limit the average zfs ARC cache memory usage would be welcome. Some sort of script could probably be written to manage the situation but that's unfortunately out of my possibilities to write :-(

Important Announcement for the TrueNAS Community.

zfs varying raidz1 access speeds, would eventually crash

Thibaut

Dabbler

Robert Trevellyan

Pony Wrangler

Thibaut

Dabbler

Stux

MVP

rs225

Guru

Thibaut

Dabbler

SweetAndLow

Sweet'NASty

Thibaut

Dabbler

SweetAndLow

Sweet'NASty

Thibaut

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

zfs varying raidz1 access speeds, would eventually crash

Dabbler

Pony Wrangler

Dabbler

MVP

Guru

Dabbler

Sweet'NASty

Dabbler

Sweet'NASty

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "zfs varying raidz1 access speeds, would eventually crash"

Similar threads