FreeNAS datastore(s) becoming inaccessible during/after heavy load

SubnetMask · Feb 4, 2023

I've been having an issue occasionally for a while where after or during heavy writes, one or more FreeNAS pools become inaccessible. Typically, the only time I've seen it is in the few instances where I've had a power failure and I'm trying to suspend a bunch of VMs before my UPS gives out, but during the suspend of the machines, it ends up grinding to a halt and everything crashes anyway. Fast forward, earlier this week, I was migrating a powered off machines storage from two points on one datastore to a single point on another datastore, and this same 'crash' occurred. Once I got everything back up, I built a second FreeNAS machine running FreeNAS 11.1u7 on a Poweredge R620 with 224GB of RAM, connected to a Supermicro JBOD (my main FreeNas is a Poweredge FC630 with 256GB RAM and a Xeon E5-2699v3 running 11.1u7) connected to two Supermicro JBODs, on which I then created two pools consisting of two striped mirrors each in order to do some testing. I moved a number of 'unimportant' machines over to the new FreeNAS to test with, and after doing so, I let it sit for a while, then I set VMWare to move the disks for all 11 machines from one datastore on the test FreeNAS to the other. It ended up doing the same 'crash', and this time, I noticed that on one interface, VMWare was reporting five devices/five paths, but the other interface was reporting 3/3 (neither datastore on the test FreeNAS was available on one interface). I was also unable to browse the datastores on the test FreeNAS. I also found a lot of datastore errors in VMWare indicating path failure and datastore inaccessible and thought maybe it's VMWare, so I rebooted the VMWare host. After that, it was showing 5/5, but I got sidetracked to another task.

After I finished up what I had to do and came back to this, I noticed that I still couldn't browse the datastores, despite VMWare saying all paths were up. I then took my second host out of maintenance mode, and found that it didn't start balancing the VM load like usual. So I rebooted the Test FreeNAS machine, and as soon as it went down, VMs started migrating between hosts like normal, and once the test FreeNAS machine came back up, the datastores were fully accessible again. So it seems like it's actually FreeNAS that's having issues and that was causing the VMWare hosts to hang some functions, most notably, vMotion.

My first thought was that maybe the 'tunalbes' that are in place could have had something to do with the issues (they are autotune entries, quite possibly from as far back as when I was running FreeNAS on a R710), but there are no tunables in place on the test machine, so that likely rules that out. I then found this thread when starting my thread, where overall, while I haven't dug in to look for the errors that poser found, the symptoms seem nearly identical. In his case, he was running 11.1u4 and apparently, adding a SLOG fixed it. So I'm not sure if my issue is the same as that posters, but it seems possible. While adding a SLOG is probably not a bad idea, one would think that once it got to a point where it ran out of cache, the data transfer would slow down, not crash.

Does anyone have any ideas or suggestions? Is any other info needed?

Arwen · Feb 4, 2023

No, sorry, I don't have any suggestions other than to ask for clarification on your setup.

You are using bare metal FreeNAS, correct?
The VMWare hosts are separate computers, correct?
You are connecting the VMWare VMs to space on FreeNAS, correct?

What is used on FreeNAS for the VM storage, zVols via iSCSI or files via NFS?

Perhaps someone else, (with some or all of my questions answered), can help.

SubnetMask · Feb 4, 2023

Arwen said:
No, sorry, I don't have any suggestions other than to ask for clarification on your setup.

You are using bare metal FreeNAS, correct? - Absolutely - running on a Poweredge FC630
The VMWare hosts are separate computers, correct? - Absolutely - two other Poweredge FC630's

Arwen said:
You are connecting the VMWare VMs to space on FreeNAS, correct? - The FreeNAS zvol volumes are presented to the VMWare hosts via iSCSI

What is used on FreeNAS for the VM storage, zVols via iSCSI or files via NFS? - zvols presented to VMWare by way of iSCSI

Perhaps someone else, (with some or all of my questions answered), can help.

SubnetMask · Mar 23, 2023

So I think I may have some sort of 'high level' idea of what's going on here.

So my main VMware pool is a ten disk pool - What I believe FreeNAS would call stiped mirrors (in 'traditional speak', RAID10). Anyway, on that pool, I have two ZVols, VMWare1 and VMWare2 as VMWare volumes. That's where I first encountered this issue.

I then set up a second FreeNAS machine running the same version (11.1u7), and started doing some testing. Initially, I wasn't a methodical as I should have been, so moving past that, I had set up a striped mirror consisting of a total of four disks, and a second duplicate, both of which were provisioned to VWare as 'FreeNAS2-1' and 'FreeNAS2-2', for demonstration purposes. I proceeded to move a dozen VMs to one of the volumes, and from there, storage vMotioned all dozen back and forth, in the same batch, as many as VMWare would do at one time (I didn't pay attention to how many it would move at once, but it was more than two or three). Doing this, they vMotioned back and forth without issue in my testing, many times - in every case, I was selecting all twelve machines. It never blew up.

Then I rebuilt them to more closely mirror my actual setup - the real setup, again, is ten disks in striped mirrors type config with two zVols on it - The 'replica' was eight disks in a similar config (one eight disk striped mirror pool), and after I set that up and moved three VMs to it, that went fine.... but then I did a storage vMotion from one datastore (ZVol) to the other in the same disk pool and roughly 10 minutes in it blew up. Same symptoms. It hasn't been consistent when it'll blow up, as this morning, it blew up with the first attempt to vmotion the machines, but by this afternoon, it did blow up again, but after quire a few storage vMotions (and various lengths in between) - that was probably the third or fourth time today it blew up. While the 'when it would blow up' has been inconsistent, the fact is, in my testing, it would eventually blow up. I was rather surprised at how many moves it took for the latest one after the first one blowing up ten minutes in... It took quite a few attempts before it finally blew up this last time, but it happened.

So based on my testing, it seems like this issue, at least in my case, it has something to do with having a single pool with multiple zvols on it, and high R/W I/O between the two zVols. I can't say I've seen anything remotely similar with my Z3 volume - but, outside 'rare' circumstances, it only sees run of the mill file and/or video data stored. Nothing to write home about.

So that brings me to a question I've been thinking about: is it better, performance wise, to have a single eight disk RAID10 (I apologize, just using the simplest nomenclature) consisting of, say eight or ten disks in a pool, and then two zvols on said RAID10 pool, or having two RAID10 pools consisting of four disks each, each with tis own zvol? while 'traditional logic' suggests that the more spindles you have in a RAID array/pool/etc, the better off you'll be, but my testing seems to suggest that trying to read from one zvol and write to another in the same pool in any 'semi high quantity', doesn't end well.

Any thoughts?

sretalla · Mar 24, 2023

If your aim is to have things moving from one place to another, there's probably benefit to having those 2 places not using the same "pool" of IOPS to do the work, so 2 separate pools will put a clear line of demarcation between the "reader" and the "writer".

Of course that means that the performance of one larger pool isn't available at any time, but your performance per smaller pool is always kept separate, so both pools' maximum performance is available at the time of v-motion.

SubnetMask · Mar 24, 2023

Well, my main goal is stability - as I mentioned in my original post, this behavior was first noticed on the rare instances I had to shut everything down in a hurry and was trying to suspend some machines while shutting others down all at once, but then I had it happen when I was cloning a VM. I found that the way to test it was VMotioning, which yielded pretty repeatable results - basically, the same behavior I saw with the rapid shutdowns, which is the datastore 'going away'.

I don't really know WHY it's happening, but it seems related to the two zVols/Datastores on the single pool, because when I split it out, it didn't 'crash' on me once, and I was hitting it harder with that test than I am right now.

sretalla · Mar 24, 2023

SubnetMask said:
I don't really know WHY it's happening, but it seems related to the two zVols/Datastores on the single pool, because when I split it out, it didn't 'crash' on me once, and I was hitting it harder with that test than I am right now.

I suspect what is happening there is that VMware is seeing timeout limits being hit when the single pool gets too much IO at once, so with it split to 2 pools, you don't reach that same level which triggers the timeout, hence stability.

There may be some possibility to futz around with settings in VMware to extend that timeout or consider looking at your overall pool setup. You're probably better looking into adding RAM if you'd rather have a single pool and more robust performance.

SubnetMask · Mar 24, 2023

Well, my main machine already has 256GB, the test one has 128GB, so they're not really short in the RAM department lol.

I think it's something in FreeNAS that's going sideways, because once it happens, the only way to get the datastores back is to reboot everything - they never 'just return', and just rebooting only the VMWare hosts doesn't get them back - FreeNAS must be rebooted as well.

Seeing as this issue doesn't SEEM to be occurring when each zVol has its own pool, I think I'm going to end up splitting them out as a 'fix', as I don't think I having the larger pool will give me much benefit I/O-wise, since I don't typically hit it all that hard, but I'm curious if others have seen this sort of behavior, and if maybe it could be some sort of bug that may have been fixed at some later date.

sretalla · Mar 24, 2023

SubnetMask said:
maybe it could be some sort of bug that may have been fixed at some later date.

You're certainly leaving a lot of room for that since you're on a version that is long past its end of support.

Also curious as to the version of VMware. Newer versions handle disappearing datastores much better. Were you using @Spearfoot's vmware scripts to re-atach the datastore (or try to)? (https://www.truenas.com/community/resources/utility-scripts-for-freenas-and-vmware-esxi.29/)

SubnetMask · Mar 24, 2023

Well, I'm a bit gun shy about upgrading FreeNAS, to be honest. I had a LOT of issues with one of the newer FreeNAS versions (I forget at this point if it was 11.2 or 11.3 - there should be thread around here somewhere about it) where it would panic and reboot randomly, which caused a domain controller to keep getting corrupted, and nothing could ever be figured out, so I reverted to 11.1u7 and that's been fine ever since (other than this issue). That was a different setup that I had created for work that has since been decommissioned, but since no explanation or cause could ever be determined, I didn't want to risk that happening with my environment, hence keeping my environment at 11.1u7. I had also investigated upgrading to TrueNAS Core, and ran into some 'issues', which again, no one could offer any sort of info on, so put that idea on the back burner and am still on 11.1u7, which again, aside from this particular issue, has been absolutely rock solid. I DID also test with 11.3, and I still was able to get it to 'crash' on that version as well.

I'm running vSphere 7, some time in the next few months it'll get upgraded to vSphere 8. I was not aware of those scripts, I might take a look.

jgreco · Mar 24, 2023

iSCSI drops due to high traffic conditions are a classic sign of a stressed out pool. Are you seeing warnings from ESXi and/or FreeNAS about iSCSI timeouts?

Also see

Resource - The path to success for block storage

ZFS does two different things very well. One is storage of large sequentially-written files, such as archives, logs, or data files, where the file does not have the middle bits modified after creation. This is optimal for RAIDZ. It's what most...

www.truenas.com

The usual problem is that there is insufficient pool I/O capacity. If you are having problems, try again with 25% pool occupancy (I do not see any description of your system or pool status) and then again with 10% if problems persist. Adding L2ARC can increase the ratio of reads to writes that your pool can sustain as well. Your footer suggests you have two 12 bay chassis which means you might have as many as 24 drives, or 12 mirror vdevs. If you generate a lot of fragmentation, such as by filling the pool beyond maybe 60-70% and then doing periodic overwrites, such a small number of HDDs may not be able to sustain high write volumes due to CoW fragmentation. People tend to think "oh my 24 drives can sustain 200MBytes/sec * 24 = 4800MBytes/sec" but the reality is more like "12 vdevs writing can sustain maybe 100 IOPS each or about 1000 IOPS, and at 4KB block sizes that might only be 4MBytes/sec." Once you exceed the pool's capacity to soak up writes, iSCSI will stall and the iSCSI initiators tend to freak.

jgreco · Mar 24, 2023

SubnetMask said:
I don't really know WHY it's happening, but it seems related to the two zVols/Datastores on the single pool, because when I split it out, it didn't 'crash' on me once, and I was hitting it harder with that test than I am right now.

That's also quite possible. VMware has recommendations about the number of LUNs to maintain on iSCSI; there are inherently some limits as to how well iSCSI can work because there's queue depth and timeout considerations. If you have an initiator that is very busy, and is sending all its traffic over a single iSCSI session, you run out of queue very quickly. If you have an initiator that is putting half its traffic out each of two iSCSI sessions, you get double the tolerance for real world latency and delays. Obviously this also works somewhat better if you have 10G rather than 1G, 25G rather than 10G, more hard drives rather than fewer, two storage networks rather than one, etc.

SubnetMask · Mar 24, 2023

I added pool info to my sig, but the pools for my main FreeNAS are:

NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
DVR 2.72T 1.79T 955G - 54% 65% 1.00x ONLINE /mnt
Data 80T 23.8T 56.2T - 4% 29% 1.00x ONLINE /mnt
VMWareData 9.06T 3.02T 6.04T - 65% 33% 1.00x ONLINE /mnt

But on the test one, it's just eight 4TB SAS disks, set up as a single striped mirror and two 4TB zVols for the datastores, and the data I'm working with is only like 150GB total, so there's more than enough free space in the pools.

jgreco · Mar 24, 2023

SubnetMask said:
there's more than enough free space in the pools

Maybe, but you're showing 33% allocated, and you only have four vdevs. This can turn into a big problem if your data is fragmented. What's your array look like running gstat under heavy write load?

SubnetMask · Mar 24, 2023

The 'there's more than enough free space in the pools' comment was actually referring specifically to the test FreeNAS that has the eight 4TB disks and only about 150GB of data being shuffled around that I can get to crash pretty reliably. Probably should have said 'pool', not 'pools' and/or zVols. I'm not running this kind of test on my main one knowing that it may bring it down

. The 'VMWareData' pool is actually five vdevs (10 disks, mirrored and striped).

That being said, here's a clip from the 'basically empty' test FreeNAS when I have the three VMs being relocated from one datastore to the other.

SubnetMask · Mar 24, 2023

Welp, I'm going to have to backtrack.. I killed the single pool, created two four disk pools, reconnected and did more testing, and it blew up. No idea why it seemed to work non-stop when I was sanding a dozen machines back and forth as fast as it could, but then this time around, three blew it up.... Back the the drawing board.

jgreco · Mar 25, 2023

Your disks are running full tilt. You need to at least double and probably quadruple the number of disks to support this workload, or find other ways to mitigate the I/O load towards the pool.

WI_Hedgehog · Mar 26, 2023

Whenever I have a system fail due to excessive workload (over 100% for sustained periods) it turns out to be firmware crashes on less- than- premium hardware. A hardware reset on one piece of equipment can trigger a condition that clears the firmware condition on the actual offending hardware in a different piece of equipment.

I think that's part of why Cisco switches cost far more than competitors, why high- end server boards cost more than budget, and even why HGST drives and WD Gold require actual gold to purchase yet never fail under the strain of full- tilt.

Even discount cables can develop signaling issues (or come with them out of the box).

---
The exception to firmware issues might be "anything made by Microsoft." The kids in Redmond need to mature significantly.

jgreco · Mar 26, 2023

WI_Hedgehog said:
Whenever I have a system fail due to excessive workload (over 100% for sustained periods) it turns out to be firmware crashes on less- than- premium hardware.

Maybe. However, I'm not sure that's relevant to the timeout issue. Due to the way iSCSI is designed, operating over TCP, there isn't really a way to prioritize iSCSI keepalive traffic. I ran into this many years ago, which I characterized in Bug #1531, where you could easily get the ZFS system to go into a complete stall for extended periods of time. This led to some additional tuning and tweaking, and ultimately contributed to calls for a rewrite of the ZFS write throttle. If you understand what's going on underneath the sheets, you can still trick ZFS into stalls but it is admittedly much harder to pull that trick on modern ZFS.

So basically if you are pushing too much traffic at the pool (mostly concerned with seek-inducing IOPS), eventually the latency becomes high enough that you are unable to clear transactions within ten seconds, and this is just fundamentally incompatible with iSCSI and its timeout design. Nothing to do with firmware or less-than-premium hardware, except insofar as maybe you don't have enough hardware to support the workload.

Fixes include (not necessarily in order of importance):

* Don't use RAIDZ (bad for a number of reasons)
* Reduce the transaction group time window (I still use 1 second rather than 5, see Bug #1531 discussion)
* Optimize block sizes so that you do not incur extra read-write penalties
* Add more drives
* Increase ARC/L2ARC resources to reduce unnecessary read calls to pool
* Use larger drives and maintain gobs of free space, like only 10-25% used is really a performance booster, tends to dramatically reduce seeks

WI_Hedgehog · Mar 26, 2023

@jgreco : Thank you for the insight. It seems the best solution is to scale the hardware to manage the worst-case scenario, which would theoretically also decrease normal daily response latency.

Important Announcement for the TrueNAS Community.

FreeNAS datastore(s) becoming inaccessible during/after heavy load

Contributor

MVP

Contributor

Contributor

Powered by Neutrality

Contributor

Powered by Neutrality

Contributor

Powered by Neutrality

Contributor

Resident Grinch

Resident Grinch

Contributor

Resident Grinch

Contributor

Contributor

Resident Grinch

Guru

Resident Grinch

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "FreeNAS datastore(s) becoming inaccessible during/after heavy load"

Similar threads