ESX guest problems with to many ZFS snapshots

Status
Not open for further replies.

jamiejunk

Contributor
Joined
Jan 13, 2013
Messages
134
Has anyone ever had problems with esx guest OSes being able to write to their drives because they had to many zfs snapshots?

I remember we had this problem about two years ago when we were hosting esx guests on 7200 RPM SAS drives. The issue ended up being that with to many zfs snapshots it would take to long for writes to happen. Eventually the linux esx guest would timeout and throw it’s file system into read only mode.

I thought with our new setup we wouldn’t have to worry about that so much being an all flash array. But we started having problems two days ago and today just got much worse.

We run about 50 VMs. They are pretty low usage volume. Looking at the network traffic to the nfs share it’s only pushing 100Mbit on average.


I was taking a snapshots once an hour and keeping them for about 10 days. Today is day 7 and it’s acting up. About 160 snapshots total. Each snapshot is about 500 meg worth of changes, below is an example:
Code:
tank1/esx@auto-20140406.0732-10d            561M      -  1.13T  -
tank1/esx@auto-20140406.0832-10d            541M      -  1.13T  -
tank1/esx@auto-20140406.0932-10d            541M      -  1.13T  -
tank1/esx@auto-20140406.1032-10d            533M      -  1.13T  -
tank1/esx@auto-20140406.1132-10d            546M      -  1.13T  -
tank1/esx@auto-20140406.1232-10d            586M      -  1.13T  -
tank1/esx@auto-20140406.1332-10d            605M      -  1.13T  -
tank1/esx@auto-20140406.1432-10d            606M      -  1.13T  -
tank1/esx@auto-20140406.1532-10d            555M      -  1.13T  -
tank1/esx@auto-20140406.1632-10d            587M      -  1.13T  -
tank1/esx@auto-20140406.1732-10d            564M      -  1.13T  -
tank1/esx@auto-20140406.1832-10d            574M      -  1.13T  -
tank1/esx@auto-20140406.1932-10d            563M      -  1.13T  -
tank1/esx@auto-20140406.2032-10d            575M      -  1.13T  -
tank1/esx@auto-20140406.2132-10d            538M      -  1.13T  -
tank1/esx@auto-20140406.2232-10d            512M      -  1.13T  -
tank1/esx@auto-20140406.2332-10d            511M      -  1.13T  -
tank1/esx@auto-20140407.0032-10d            514M      -  1.13T  -
tank1/esx@auto-20140407.0132-10d            578M      -  1.13T  -
tank1/esx@auto-20140407.0232-10d            536M      -  1.13T  -
tank1/esx@auto-20140407.0332-10d            491M      -  1.13T  -
tank1/esx@auto-20140407.0432-10d            575M      -  1.13T  -
tank1/esx@auto-20140407.0532-10d            671M      -  1.13T  -
tank1/esx@auto-20140407.0632-10d            699M      -  1.13T  -
tank1/esx@auto-20140407.0732-10d            809M      -  1.13T  -
tank1/esx@auto-20140407.0832-10d          1.04G      -  1.15T  -
tank1/esx@auto-20140407.0932-10d          1012M      -  1.15T  -
tank1/esx@auto-20140407.1032-10d            940M      -  1.15T  -
tank1/esx@auto-20140407.1132-10d            961M      -  1.15T  -
tank1/esx@auto-20140407.1232-10d            999M      -  1.15T  -
tank1/esx@auto-20140407.1332-10d          1.01G      -  1.15T  -
tank1/esx@auto-20140407.1432-10d          1.14G      -  1.15T 


So I guess i’m gonna ramp down the number of snapshots. Just wondering if anyone else has had this problem.

System Info:
FreeNAS-9.2.1.3-RELEASE-x64
Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
Memory262088MB
JBOD connected via external SAS.
24 - SAMSUNG 840 Pro Series MZ-7PD512BW 2.5" 512GB SATA III MLC SSDs
Setup in as 6 vdevs of 4 drives each RaidZ2.
Mirrored ZILs are HGST s840Z s840 2.5″ SAS SSD
Cache drive is 200gig HGST s840 2.5″ SAS SSD
NFS Export to VMware ESX servers via 1 gig nic.

Code:
NAME        STATE    READ WRITE CKSUM
    tank1      ONLINE      0    0    0
      raidz2-0  ONLINE      0    0    0
        da0    ONLINE      0    0    0
        da1    ONLINE      0    0    0
        da2    ONLINE      0    0    0
        da3    ONLINE      0    0    0
      raidz2-1  ONLINE      0    0    0
        da4    ONLINE      0    0    0
        da5    ONLINE      0    0    0
        da6    ONLINE      0    0    0
        da7    ONLINE      0    0    0
      raidz2-2  ONLINE      0    0    0
        da8    ONLINE      0    0    0
        da9    ONLINE      0    0    0
        da10    ONLINE      0    0    0
        da11    ONLINE      0    0    0
      raidz2-3  ONLINE      0    0    0
        da12    ONLINE      0    0    0
        da14    ONLINE      0    0    0
        da15    ONLINE      0    0    0
        da16    ONLINE      0    0    0
      raidz2-4  ONLINE      0    0    0
        da17    ONLINE      0    0    0
        da18    ONLINE      0    0    0
        da19    ONLINE      0    0    0
        da13    ONLINE      0    0    0
      raidz2-5  ONLINE      0    0    0
        da20    ONLINE      0    0    0
        da21    ONLINE      0    0    0
        da22    ONLINE      0    0    0
        da23    ONLINE      0    0    0
    logs
      mfid0    ONLINE      0    0    0
      mfid2    ONLINE      0    0    0
    cache
      mfid1    ONLINE      0    0    0
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Wow, that's kind of interesting...

I have three questions:

1) What's the capacity of the pool at, as reported by "zpool list"?

2) What's the normal amount of I/O headed to/from the pool, and what's it like when things get bad ("zpool iostat 1")?

3) What's gstat showing for your log device utilization percentage?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Umm.. I'm very wide-eyed.. your post scares the living bejesus out of me.

Whole disk devices in a pool? Not proper for ZFS.

RAIDZ2 for VMs? Not recommended. Even less recommended for flash based storage.

no clue what mfid0, 1, and 2 are.. but again, scares me as that's not a standard FreeNAS design.

To be honest, I'm not buying that snapshots are the problem directly. They may appear to be by correlation. But as we all know "correlation is not causation".

If I could be so bold, when I see people that have made such mistakes as I'm seeing these "red-flags" on the surface generally have made deeper mistakes.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
It actually makes sense. With a seriously sufficient base system, there's no point in swap, so despite the fact that we're not used to seeing that with FreeNAS, it is not forbidden by ZFS.

The 4 drive RAIDZ2 vdev is a common compromise to get to a certain tier of reliability; the only other option in this situation would be three-way mirror. With six of them he basically has six high availability vdev's that make up the pool. That should be way fast.

mfi devices are MegaRAID Firmware Interface virtual devices.

There's something going on here, but, you'll have to excuse me for sayin' it, since I've been savin' this for awhile, I think you're barkin' up the wrong tree. :smile:
 

jamiejunk

Contributor
Joined
Jan 13, 2013
Messages
134
Wow, that's kind of interesting...

I have three questions:

1) What's the capacity of the pool at, as reported by "zpool list"?

2) What's the normal amount of I/O headed to/from the pool, and what's it like when things get bad ("zpool iostat 1")?

3) What's gstat showing for your log device utilization percentage?


RE: 1)
Code:
[root@ds0] ~# zpool list
NAME    SIZE  ALLOC  FREE    CAP  DEDUP  HEALTH  ALTROOT
tank1  11.2T  5.23T  5.92T    46%  1.00x  ONLINE  /mnt


RE: 2)
I was running zpool iostat 1 while the problem was happening today. VMs were wonky for about 30-45 minutes. When i looked over at zpool iostat it didn't seem like there was a bunch of data being read or written.. At least nothing to be alarmed at. I didn't document it as I wasn't sure what the problem was at the time and was looking into other possible causes. It was only after that I remembered our issue from a few years ago and how we fixed it.

RE: 3)
Again, gstat looks fine right now, but I've also deleted a bunch of the snapshots. Crap. At least I'll know to look for it next time.
 

jamiejunk

Contributor
Joined
Jan 13, 2013
Messages
134
Umm.. I'm very wide-eyed.. your post scares the living bejesus out of me.

Whole disk devices in a pool? Not proper for ZFS.

I've seen you mention this before on the forums. But i've seen lots of documentation online saying stuff like the following:
"To create a RAID-Z pool, use this command, specifying the disks to add to the pool:
# zpool create storage raidz da0 da1 da2"

http://www.allanjude.com/bsd/zfs-quickstart.html
RAIDZ2 for VMs? Not recommended. Even less recommended for flash based storage.

Why not raidz2 for VMs? Especially since they are flash? The reason I didn't go with mirrors was because if you lose the second drive in a mirror (like while resilvering) you lose the whole pool. The reason not to do raidz2 would be because of loss of IO. But since i'm using flash, IO shouldn't really be an issue and I gain the ability to lose more drives without the risk of losing the entire pool.

no clue what mfid0, 1, and 2 are.. but again, scares me as that's not a standard FreeNAS design.
mfid0, 1, and 2 are whatever freenas decided to name the stech drives.

To be honest, I'm not buying that snapshots are the problem directly. They may appear to be by correlation. But as we all know "correlation is not causation".

If I could be so bold, when I see people that have made such mistakes as I'm seeing these "red-flags" on the surface generally have made deeper mistakes.
 

jamiejunk

Contributor
Joined
Jan 13, 2013
Messages
134
There's something going on here, but, you'll have to excuse me for sayin' it, since I've been savin' this for awhile, I think you're barkin' up the wrong tree. :)


Thanks for taking the time to read my post and responding. Maybe you're right, and I am barking up the wrong tree. I guess I only have two ways to retest this.

#1 See how things go with less snapshots.
And / or
#2 Bring the number of snapshots up again and see/wait for it to become a problem again and document stuff a bit more.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Yes, that zpool creation command is valid. It works great.. for FreeBSD. In FreeNAS though FreeNAS does alot more and it has expectations. It expects you to use the WebGUI. We've seen numerous users that do exactly what you did, used the CLI, then later they had a reboot and the pool was unmountable and all the drives were listed as CORRUPT or UNAVAIL. No explanation for why other reboots had no problem and then suddenly that one day it went to poop. The short answer is "don't do that". The FreeNAS manual gives steps on creating the pool from the WebGUI, and that's the only way you should be doing it. It doesn't include steps for creating a pool from the CLI, and for good reason. You shouldn't be doing that. Generally, if the WebGUI can do what you want to do, you should be using it. Doing user adds, password changes, and whatnot also make a mess and can give you an usuable system. If you want to use FreeBSD commands, you are welcome to use the full fledge FreeBSD OS. But, the reason people use FreeNAS is to avoid the CLI. Choosing to then try to do FreeBSD stuff on FreeNAS is very dicey and has resulted in plenty of lost data. Generally, doing anything from the CLI should be done only after seriously considering it. Anything done from the CLI is basically done behind the back of FreeNAS, outside of FreeNAS' config file, and possibly breaking things FreeNAS is assuming.. FreeNAS may not like what you do and you may end up being very unsatisfied with the results later on.

Flash drives have limited writes. On RAIDZ2 you'll be writing your data plus at least 2 additional copies. If the data goes in a single block you're actually writing that block plus two parity blocks. So one write is amplified into three. Not a very ideal situation for SSD. Additionally, RAIDZ2 adds processing overhead(that is minimal but can be noticeable in certain workloads) and cuts down on the I/O you can throw at a vdev because of the parity writes. For a situation like yours it's less than ideal. It can work to various degrees ranging from very well to very poorly. But, when you look at the pluses and minuses for RAIDZ2 over a 2 disk(or even 3 disk) mirror, there's little reason to go with RAIDZ2 when the "cost vs benefit" is examined closely(and I'm not talking financially.. although that is a small aspect of it).

The mfidX devices scare me because they should have been gptids.. because FreeNAS should have labeled them as that. It didn't. Presumably because you did CLI stuff. Please see the first paragraph for warnings on why that's not the most ideal option.

I will say this, as I'm somewhat sleepy and this popped into my head when I started writing. When snapshots expire and are deleted, it can put serious strain on the pool during the deletion process. Really really big pools that have huge snapshots can take a very long time(apparently pools going unresponsive for hours has been seen in some cases!). I'm wondering if the snapshots themselves aren't the problem so much as the deletion process when they expire. Deletions are very I/O intensive because ZFS has to do cross checking of the zpool metadata and determine if the block can actually be freed, free the block and update the metaslabs, etc. I'm wondering if your pool can't keep up with your VMs + the deletion process and it's causing your problem. Because of how ZFS is designed the deletion is handled in a single transaction. This can prevent further transactions, causing problems. Last I read there was a feature flag that was going to be written to make deletions work a little differently. The deletion would actually run in the background cleaning up and freeing the pool blocks after the transaction cleared. I have no clue if or when that's going to be done. Or if it's already been implemented. I've never personally had a problem with snapshots so I've never bothered to investigate it. Might be something I'll have to look at tomorrow...
 
Status
Not open for further replies.
Top