FreeNAS Crashing on heavy data transfers

Status
Not open for further replies.

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Now I suspected the MTU=9000, set it back to default, again, things seem stable (oddly, the ifconfig below shows "JUMBO_MTU" in options, but I'm not sure where it might be getting that.

It means the hardware's capable of it. Doesn't mean it's actually a good idea to use, might be a good/great idea to get rid of it.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
dedup is off.

And it's never been on?

And whatever process is eating the memory is invisible to "ps aux". "top" shows the memory disappearing... and even at the point that it starts to swap, "ps aux | sort -nk +4" continues to show manage.py using 0.8% of memory, and it just stays there as the top memory consumer at 0.8% as free memory slides to 0. So whatever is using it, appears to be invisible to the ps command.

Try inspecting with top, which provides ARC statistics on the fifth line, which is probably where it is going. Swapping is perfectly acceptable and expected on a small memory system with extreme memory pressure.

except vfs.zfs.arc_max which I set to 16GB

You have no reason to do that, don't. Increases memory pressure even more. Delete this as a tunable.

My thoughts at this point - perhaps the test iSCSI I built on a zvol with block size intentionally set to match iSCSI at 512 bytes, using 100% of the zvol for the iSCSI is particularly unstable?

I only wanted one RAID'd file storage system to maintain on my network. This behavior makes me think that I can't have iSCSI on freenas and will need to install a hardware raid card and a couple of HDDs in the ESXi host to get the storage off freenas.

Use a larger block size. The default that is recommended at volume creation time is "probably" fine, maybe not optimal, but shouldn't cause significant issues.

I notice mention of 2 x 256GB SSD in your signature. Of course you are not using that for L2ARC because you have a tiny amount of RAM, and the manual says not to add L2ARC on a system with less than 64GB RAM. That's because in order to support 512GB of L2ARC you'd need ... oh maybe 96-128GB of RAM. Anything smaller than that could result in all sorts of fun such as lockups. Especially if you had made the tremendous error of trying dedup on the pool at one point.

My eyes go crossed trying to search for stuff in this thread, but if your pool is fairly full it'll also have all sorts of other interesting problems. A pool containing iSCSI probably needs a minimum of 50% free headroom in order to perform well over time, and anything less than 64GB of RAM can get kind of dodgy in various ways. You can absolutely have iSCSI on FreeNAS and have it be awesome, but it takes big resources. The 7TB of VMware iSCSI storage here takes 52TB of raw disk space, 128GB RAM, an E5-1650v3, and 2 x 512GB Samsung 950 SSD's and we have a pretty nice box out of it.
 

rruss

Dabbler
Joined
Sep 14, 2015
Messages
35
jgreco - I was originally reading it as an option that was enabled rather than an option that it is capable of.

For this description, I believe it is running with default MTU=1500:
igb0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO>

With the memory overflowing, I'm pretty sure this isn't a network problem at this point. I think the networking issue was just a symptom of the memory overflow that I wasn't able to see when I was monitoring freenas from the webpage GUI. By the time the system started swapping, the graphical reporting was dead, so I was assuming that networking was the first thing to go when in reality it was memory filling up and swapping.
 

rruss

Dabbler
Joined
Sep 14, 2015
Messages
35
And it's never been on?

To my knowledge, it has never been on. Is there a way to check this from the command line? The GUI shows "zfs deduplication" as inheret (off) everywhere.

Try inspecting with top, which provides ARC statistics on the fifth line, which is probably where it is going. Swapping is perfectly acceptable and expected on a small memory system with extreme memory pressure. You have no reason to do that, don't. Increases memory pressure even more. Delete this as a tunable.

I will delete the tunable and reboot. Once I figured out that the crash was memory related, I watched top with high update rates (top -s1), to see what was happening as the system was stressed with the iSCSI write. Swapping appears to be a hard wall... and when it happens, freenas stops serving iSCSI, NFS, CIFS... and every machine on the network that talks to freenas comes to a screeching halt. And if I don't catch it and stop the write within a split second of it starting to swap, it will lock up hard, dropping its network connections, etc.

Use a larger block size. The default that is recommended at volume creation time is "probably" fine, maybe not optimal, but shouldn't cause significant issues.

The primary iSCSI target was default block size. I created a secondary target with block size = 512 to experiment to see if write speeds would improve. Writing to this target with bs=512 seems to lock up the system very quickly. The target with bs=16k seemed more resilent, but would still lock things up if freenas had other NFS activity going on at the same time. I was going to confirm this tonight - I created 2 identical bare-bones linux VMs on the ESXi. One on the bs=16k iSCSI datastore, and the other on the bs=512 iSCSI datastore.

I notice mention of 2 x 256GB SSD in your signature. Of course you are not using that for L2ARC because you have a tiny amount of RAM, and the manual says not to add L2ARC on a system with less than 64GB RAM. That's because in order to support 512GB of L2ARC you'd need ... oh maybe 96-128GB of RAM. Anything smaller than that could result in all sorts of fun such as lockups. Especially if you had made the tremendous error of trying dedup on the pool at one point.

The SSDs are simply another RAID mirror on the system with data served over NFS. The data on these disks is a cache that we wanted the fastest possible random read access to, so we put it on SSDs. I simply haven't been talking about them because they didn't seem pertinent to the discussion. The iSCSI datastore is on the 4x4TB HDDs.

My eyes go crossed trying to search for stuff in this thread, but if your pool is fairly full it'll also have all sorts of other interesting problems. A pool containing iSCSI probably needs a minimum of 50% free headroom in order to perform well over time, and anything less than 64GB of RAM can get kind of dodgy in various ways. You can absolutely have iSCSI on FreeNAS and have it be awesome, but it takes big resources. The 7TB of VMware iSCSI storage here takes 52TB of raw disk space, 128GB RAM, an E5-1650v3, and 2 x 512GB Samsung 950 SSD's and we have a pretty nice box out of it.

At the end of the day, if what I'm trying to do is beyond the scope of what freenas can do on this platform with 32GB of RAM - that's fine. I'm happy to reconfigure things... but there is time and money involved with changing it, so I felt I needed to understand what might be happening. It didn't feel to me that I could be overwhelming the disk I/O or RAM on the system with just 1Gb/s NICs. It seems so lightly loaded moving NFS traffic while flooding the NIC, that serving 1TB over iSCSI with a dedicated NIC link didn't strike me as something that would require a huge jump in compute power and storage.

Thanks for your thoughts jgreco - I appreciate the time to work through this with me.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
jgreco - I was originally reading it as an option that was enabled rather than an option that it is capable of.

For this description, I believe it is running with default MTU=1500:
igb0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO>
.

It's a list of options or features or whateveryouwanttocallthem that are available to the driver on that particular chipset, and this is important to know because in some cases it requires manual configuration to actually enable the feature, as with MTU, or vlans, etc.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
To my knowledge, it has never been on. Is there a way to check this from the command line? The GUI shows "zfs deduplication" as inheret (off) everywhere.

If you can type "zpool list" at the CLI and see "1.00x" in all columns, let's consider that sufficient as a sanity check.

I will delete the tunable and reboot. Once I figured out that the crash was memory related, I watched top with high update rates (top -s1), to see what was happening as the system was stressed with the iSCSI write. Swapping appears to be a hard wall... and when it happens, freenas stops serving iSCSI, NFS, CIFS... and every machine on the network that talks to freenas comes to a screeching halt. And if I don't catch it and stop the write within a split second of it starting to swap, it will lock up hard, dropping its network connections, etc.

It shouldn't do that, obviously. The swapping isn't a great thing but it is fairly normal for some modest amount of swapout to occur over time. The ~4GB that unused bits of the FreeNAS middleware seems to like to occupy is the usual target. This is because there's a lot of stuff on a NAS that isn't used by your *particular* configuration.

Is there any chance that when it "crashes", it recovers over time? It's possible that you're running into some variation of the issues in bug 1531 relating to transaction group writes, which are supposed to be addressed by the new write throttle mechanism, but if you're maybe catching it before it is able to measure and adjust, it's very possible you could create a situation where the system might go catatonic for ... I'm just going to guess at 30-180 seconds. In such a case, what's actually happening is that one transaction group is being flushed to disk and another full transaction group has been created in the meantime. At that point, ZFS *must* pause, because it isn't committing to disk quickly enough.

You can test that by running "iostat 1" on some of your disks, or gstat, on the console and then causing this to happen. If that keeps running, and shows your disks very busy, .... bingo.

That can be at least partially addressed by reducing the transaction group window to one second. But you're still basically doing what Stanley does to this kid:


to your pool. Seriously, do not just sit your ZFS pool in front of the data firehose and then suddenly turn it on full blast. While the system warms up (after any reboot or pool import), try some smaller heavy bursts of traffic to help ZFS learn the characteristics of your pool, at which point it is likely to smarten up a good bit and not be soaking up totally stupid sized transaction groups. Once you warm up the pool and ZFS has learned its performance, it's fairly good about not totally misjudging transaction group sizes. But it can, and does, when cold.

The primary iSCSI target was default block size. I created a secondary target with block size = 512 to experiment to see if write speeds would improve. Writing to this target with bs=512 seems to lock up the system very quickly. The target with bs=16k seemed more resilent, but would still lock things up if freenas had other NFS activity going on at the same time. I was going to confirm this tonight - I created 2 identical bare-bones linux VMs on the ESXi. One on the bs=16k iSCSI datastore, and the other on the bs=512 iSCSI datastore.

K.

The SSDs are simply another RAID mirror on the system with data served over NFS. The data on these disks is a cache that we wanted the fastest possible random read access to, so we put it on SSDs. I simply haven't been talking about them because they didn't seem pertinent to the discussion. The iSCSI datastore is on the 4x4TB HDDs.

How refreshing. Usually my cynical crystal ball is right on the money. This is suddenly a bunch less tedious. :smile:

At the end of the day, if what I'm trying to do is beyond the scope of what freenas can do on this platform with 32GB of RAM - that's fine. I'm happy to reconfigure things... but there is time and money involved with changing it, so I felt I needed to understand what might be happening. It didn't feel to me that I could be overwhelming the disk I/O or RAM on the system with just 1Gb/s NICs. It seems so lightly loaded moving NFS traffic while flooding the NIC, that serving 1TB over iSCSI with a dedicated NIC link didn't strike me as something that would require a huge jump in compute power and storage.

Well, it's a bit perverse but experience suggests that you just need ... more ... for iSCSI. Whether or not it will work for your scenario is best left to testing. Things here worked okay with 64GB of RAM but I pushed it out to 128 anyways.
 

rruss

Dabbler
Joined
Sep 14, 2015
Messages
35
jgreco - First, thank you for the detailed reply. I really appreciate the help. I apologize for not replying sooner - I saw good things and wanted to confirm the results before cluttering the thread with premature celebrations.

Short summary - I think it might be fine with 32GB of RAM and no tunables set!

The details - I cleaned things up to go back to just the original NFS/CIFS shares, and the original iSCSI target (on port 3260) which was created with 16k block size (the default). At 32GB of RAM, the iSCSI write from the VM on the ESXi seems to be very fast, and freenas is handling the ARC and memory just fine exactly as I would expect it to. I do the 30GB file transfer, writing to the iSCSI target and I don't see any signs of memory overflow, and the ARC just blips down periodically to keep the free memory up (watching with top -s1), and the data keeps writing. IOPs are as expected for the data making it onto disk (watching with gstat in another terminal). And nothing seems to get near the point of swapping.

Originally, I created the secondary iSCSI with 512 byte block size (on port 3270) to debug the slow I/O speed on the primary iSCSI which was occasionally hanging the system. But the secondary iSCSI target was even worse on the system when it had 16GB of RAM. For anyone with insight into the code, perhaps something is behind handled differently when iSCSI is on a non-standard port, or the block size is only 512 bytes? Memory leak or something of that nature.

I'll test some more over the weekend, but I'm fairly confident that the added memory stabilized the system.

If you can type "zpool list" at the CLI and see "1.00x" in all columns, let's consider that sufficient as a sanity check.

It did indeed read 1.00x across the board.

You can test that by running "iostat 1" on some of your disks, or gstat, on the console and then causing this to happen. If that keeps running, and shows your disks very busy, .... bingo.

gstat was very helpful - I wish I knew about the command sooner! For as long as I've used linux (I know freebsd isn't linux, but still), I'm always amazed by what I don't know is hidden in there!

That can be at least partially addressed by reducing the transaction group window to one second. But you're still basically doing what Stanley does to this kid:

Agreed - but there should be a mechanism for the system to slow down the incoming data as the bucket fills up. And with the system stable and using sufficient RAM, it does indeed seem to do just that.


How refreshing. Usually my cynical crystal ball is right on the money. This is suddenly a bunch less tedious.
clip_image001.png

I try to do as much homework as possible. I realize folks are volunteering their time and energy to make life easier for me on a forum like this. I know it's hard not to default to the assumption that it's yet another person ignoring the guidelines, but I only raised the problem after really struggling to understand it and thinking I was doing everything the forum recommended. :)

Well, it's a bit perverse but experience suggests that you just need ... more ... for iSCSI. Whether or not it will work for your scenario is best left to testing. Things here worked okay with 64GB of RAM but I pushed it out to 128 anyways.

So - 32GB really seems like a minimum amount of RAM for my needs.
 

rruss

Dabbler
Joined
Sep 14, 2015
Messages
35
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594

This is filed against 10.0, I don't know whether it would also be the same in 9.x branches, and what FreeNAS has done itself.

Thanks for including the link here. That behavior does sound exactly like what my system was doing.

Perhaps adding the RAM just kicks the can down the road, or perhaps it wouldn't surface at 32GB of RAM unless the network I/O allowed data to come in faster or the drives were slower.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
I re-read a bit of the comments, and it seems the issue extends back to 8.x, so it is probably present in FreeNAS unless they have done something to fix it.

Also, your small blocksize may be related to the crash(the PR doesn't describe a crash I think; it describes stalls). I recall that there is some issue with the ZFS blocksize and various granularity pools of memory the kernel keeps handy for different allocation sizes. If you actually managed to set a 512b (or any small blocksize, really), then it is probably contributing to that problem. Essentially, fragmentation of the kernel address space, leading to the inability to service a request to allocate kernel memory once you approach zero free memory.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Agreed - but there should be a mechanism for the system to slow down the incoming data as the bucket fills up. And with the system stable and using sufficient RAM, it does indeed seem to do just that.

That actually IS the mechanism for the system to slow down. It has to hold off at some point. The problem is that if the mechanism hasn't learned enough about your system's performance, it can potentially make totally bad guesses. Reducing the transaction group flush interval to 1 makes it look more carefully at that, more often. Kinda.

I re-read a bit of the comments, and it seems the issue extends back to 8.x, so it is probably present in FreeNAS unless they have done something to fix it.

That's interesting, it may go a long way to explain the other parts of some behaviours that bug 1531 never seemed to fully explain.
 

rruss

Dabbler
Joined
Sep 14, 2015
Messages
35
I was admittedly sloppy in this thread with my use of the term "crash" and "hang", sorry. System stall (with networking crash or stop) is a more accurate description.

There was never a sign of data loss or even a single error on the pool... so that's a silver lining.

The FreeNAS system never actually rebooted itself, and when it stalled it did respond to keyboard input at the console, although it was usually swapping so badly that I couldn't actually interact with it. Even after waiting a VERY long time, it never came back.

In my environment, FreeNAS is the center of the universe. All Raid'd storage is there, and it is the machine physically connected to the UPS. When it started swapping, the first sign to a user was that it stopped communicating with the NUT clients (our workstations), and they all reported losing touch with the UPS master. All NFS, CIFS, and iSCSI was then unreachable. This made our domain controller (a VM on ESXi) lock up, so even logging into a linux workstation was impossible. Even if you logged in, the home directory is an NFS mount back to freenas, so the login would just hang anyway. So when it swapped like this, everything came to a grinding halt as though it had crashed and that's how I thought about it.

As I noted, at the top... the one thing that did seem to actually crash was the networking on FreeNAS. When I could recover a stall, the networking was often offline as reported by ifconfig.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
That's interesting, it may go a long way to explain the other parts of some behaviours that bug 1531 never seemed to fully explain.

Maybe you should update 1531 with a notice to the PR, or maybe open a new bug referencing that PR, with a suggestion to consider adopting one of the patches?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
1531 has mostly been addressed by the new ZFS write throttle. I'm talking more about other performance issues I had noticed while trying to characterize 1531...
 

rruss

Dabbler
Joined
Sep 14, 2015
Messages
35
Just a quick follow-up to say that it's been about 2 months since I bumped the server to 32GB of RAM and performance has been flawless.
 
Status
Not open for further replies.
Top