Thibaut
Dabbler
- Joined
- Jun 21, 2014
- Messages
- 33
I'm a happy FreeNAS user for more than 3 years now and feel pretty comfortable at using it. To be honest I'm more often sitting at the command line rather than using FreeNAS' interface since the most usual operations I must perform are transfers from pretty large datasets (>1T), from one pool to the other.
So I've become acquainted with the zfs, zpool and even sometimes zdb commands and a few of their options. I generally ssh into the server and create a tmux "dashboard" that gives me a nice vision of the global server load: zpool list, zpool iostat PoolName 5 and top each running one next to the other in their tmux pane.
The server is a SuperMicro, dual Xeon E5-2620v2@2.10GHz (2x16 threads) with 192GB RAM and 36 HDs, running FreeNAS-9.2.1.7-RELEASE-x64.
There are 2 Pools, both configured as Striped RAIDZ(1) of 5x3TB disks each.
Pool-1 aggregates 4 RAIDs for a total capacity of 54.5T
Pool-2 aggregates 3 RAIDs for a total capacity of 40.9T
And, yes, there is one spare disk left for those who counted ;)
When moving large datasets (>1T) from Pool-1 to Pool-2, I can observe drastic variations in read/write access speeds reported by zpool iostat. They usually rocket at up to 700M/s shortly after the zfs send / receive command is issued, then oscillate around 300M/s for a few minutes, then plummet to around 60M/s, sometimes reaching back to slightly above 100M/s.
Then, from time to time they'll go back up to 300 or even 500M/s for a moment, then back down again.
This wouldn't be too much of a concern in the end, but sometimes the whole zfs service crashes !
Apparently at random, the zfs service will stop responding, zpool iostat would display all zeros in its operations and bandwidth read / write columns. The Pools would show no activity from the hard drives' led indicators, and all attempt to access datasets through the command line would hang.
Although the system itself would still be running, as top would clearly show other, still active, processes.
Trying to reboot the system from command line via ssh, or using option 10 (reboot) at FreeNAS' command line interface on the server's terminal itself, would always get stuck trying to kill some remaining processes. A hardware reboot becomes the only option at that point.
I've even doubtfully tried service zfs restart in this situation hopping it would restore access to the pools, but to no avail :(
I've observed that, right before some crashes, there was a burst in read/write access speeds, reaching back to around 500M/s before stopping all activity. But that was not always the case.
Also on some other occasions, I noticed the Free Mem reported by top, dangerously flirting with the 0G limit, but once again, this was not always the case. What is certain is that the Swap was never used at anytime.
I've been observing the system as closely as possible and the only correlation I could make was the following:
When the zfs send / receive command is issued, two zfs processes are spawned, which seems pretty logical, one for send (read) and one for receive (write). The WCPU percentage of those two processes, as reported by top, is clearly proportional to the read/write speeds reported by zpool iostat, which also makes perfect sense.
Although Free Mem, ARC Total and MRU display noticeable fluctuations (tens of Gs) during the operation, I couldn't notice a clear correlation between, lets say Free Mem, and the actual access times.
The described behavior is so unpredictable, except that it usually happens when moving large datasets, that I've become scared to operate remotely on the server. Having to hit the road at night to the office to physically push the server's power button is a nightmare I've faced more than once !
In case you've encountered a similar behavior, or have the slightest idea of what might be done to avoid this situation, please post here.
I'd so much appreciate feeling confident that any further zfs send / receive command would carry on without crashing the house down, I'd be forever grateful to my savior !
So I've become acquainted with the zfs, zpool and even sometimes zdb commands and a few of their options. I generally ssh into the server and create a tmux "dashboard" that gives me a nice vision of the global server load: zpool list, zpool iostat PoolName 5 and top each running one next to the other in their tmux pane.
The server is a SuperMicro, dual Xeon E5-2620v2@2.10GHz (2x16 threads) with 192GB RAM and 36 HDs, running FreeNAS-9.2.1.7-RELEASE-x64.
There are 2 Pools, both configured as Striped RAIDZ(1) of 5x3TB disks each.
Pool-1 aggregates 4 RAIDs for a total capacity of 54.5T
Pool-2 aggregates 3 RAIDs for a total capacity of 40.9T
And, yes, there is one spare disk left for those who counted ;)
When moving large datasets (>1T) from Pool-1 to Pool-2, I can observe drastic variations in read/write access speeds reported by zpool iostat. They usually rocket at up to 700M/s shortly after the zfs send / receive command is issued, then oscillate around 300M/s for a few minutes, then plummet to around 60M/s, sometimes reaching back to slightly above 100M/s.
Then, from time to time they'll go back up to 300 or even 500M/s for a moment, then back down again.
This wouldn't be too much of a concern in the end, but sometimes the whole zfs service crashes !
Apparently at random, the zfs service will stop responding, zpool iostat would display all zeros in its operations and bandwidth read / write columns. The Pools would show no activity from the hard drives' led indicators, and all attempt to access datasets through the command line would hang.
Although the system itself would still be running, as top would clearly show other, still active, processes.
Trying to reboot the system from command line via ssh, or using option 10 (reboot) at FreeNAS' command line interface on the server's terminal itself, would always get stuck trying to kill some remaining processes. A hardware reboot becomes the only option at that point.
I've even doubtfully tried service zfs restart in this situation hopping it would restore access to the pools, but to no avail :(
I've observed that, right before some crashes, there was a burst in read/write access speeds, reaching back to around 500M/s before stopping all activity. But that was not always the case.
Also on some other occasions, I noticed the Free Mem reported by top, dangerously flirting with the 0G limit, but once again, this was not always the case. What is certain is that the Swap was never used at anytime.
I've been observing the system as closely as possible and the only correlation I could make was the following:
When the zfs send / receive command is issued, two zfs processes are spawned, which seems pretty logical, one for send (read) and one for receive (write). The WCPU percentage of those two processes, as reported by top, is clearly proportional to the read/write speeds reported by zpool iostat, which also makes perfect sense.
Although Free Mem, ARC Total and MRU display noticeable fluctuations (tens of Gs) during the operation, I couldn't notice a clear correlation between, lets say Free Mem, and the actual access times.
The described behavior is so unpredictable, except that it usually happens when moving large datasets, that I've become scared to operate remotely on the server. Having to hit the road at night to the office to physically push the server's power button is a nightmare I've faced more than once !
In case you've encountered a similar behavior, or have the slightest idea of what might be done to avoid this situation, please post here.
I'd so much appreciate feeling confident that any further zfs send / receive command would carry on without crashing the house down, I'd be forever grateful to my savior !