SAS Writeback cache performance & Device vs file extent performance disparities

SubnetMask · Feb 3, 2018

After having badblocks running for 500+ hours on my first set of SAS disks and several searches to see if I could figure out why they were running so slow but coming up empty handed, I finally found a post were someone else was seeing the same thing with some new WD SAS disks and finding that the writeback cache was disabled on them. After enabling it, badblocks went from 8Mbps (he posted it as Mbps, don't know if he meant 'Megabits' or 'Megabytes', but it doesn't really matter - either is really slow for a SAS disk) to 150 MB/s. I found that the writeback cache was disabled on my disks and after enabling it, I saw the same thing - they took off and finished in no time.

That being said, while it obviously had a HUGE performance impact on badblocks, will it have a big impact on FreeNAS performance if it's disabled? Or does the way FreeNAS works negate any performance benefit it has, such as in the case of badblocks, or might it even be better if it's off?

joeschmuck · Feb 3, 2018

I'm curious of the answer here. To me this sounds like a testing opportunity.

SubnetMask · Feb 3, 2018

Any thoughts on how I might go about testing? I've got these SAS disks set up as a 7-disk RAIDZ2 with a device extent set up as an iSCSI target over 10G Ethernet.

So far I've moved a VM to the pool, off, then enabled the Write cache, and the move to this pool, both times, was slower than the move off of it. Moving it back onto the pool was actually slower with the write cache on.

The bottom row is to the new SAS pool (writeback off), the middle is off it to the existing 4 SATA disk Z1 pool and top is back onto the new pool. first time to it was about 7.5 minutes, off was just under 4 minutes, and back with writeback cache on was just under 10 minutes.

I then tried using ATTO on a VM to see what I would get.

On the new pool with Writeback cache on:

Writeback cache off:

The original 4 disk SATA pool:

Then I tested against my Promise:

Six disk RAID50:

Five disk RAID5 SAS:

So in the promise, the five disk RAID5 was a decent bit faster than the six disk RAID50, but that could have something to do with the fact that it's made up of HGST SAS disks, vs a mix of various 'desktop' HDDs for the RAID50.

The thing is, I don't think the ATTO tests were even making it to the disks, between the fact that the benchmarks were 'identical' for the four disk SATA pool, made up of 3TB drives consisting of two WD Blue, a WD Red and a HGST and the seven disk SAS pool made up of HGST 7K3000 series disks, as well as the fact that I really didn't see any disk activity in the reporting on the FreeNAS. I suspect that the entire test took place in the RAM on the FreeNAS. I'm not sure that even the seven disk pool, even if set up as a non-redundant stripe instead of a double parity RAIDZ2, would be able to transmit or receive enough data to saturate a 10Gbps link - the four disk RAIDZ1 certainly would not - I suspect, assuming perfect balance across all disks might be able to yield a max of maybe 150MB/s each, so maybe 450MB/s out of the pool?

The transfer between the pools did seem rather slow - the write speed seemed to run between 10 and 20MB/s and read speeds being between 20 and 30MB/s (per drive). Perhaps I have something configured wrong.

tvsjr · Feb 3, 2018

I ran into similar issues years back - I ended up with a pile of HGST SAS drives that were intended for use in a large SAN. I first had to reformat them to 512-byte sectors from the 528-byte sectors they came with. Then, I, too, discovered horrible performance. Enabling the write cache fixed it... but for whatever reason, it doesn't persist across reboots. So, I added a post-init script to enable the write cache at each boot:

Code:

#!/bin/bash
for file in /dev/da? /dev/da??; do echo "WCE: 1" | camcontrol modepage $file -m 0x08 -e; done

Never had an issue with it, and it solves the issue. Is there a more elegant solution? Maybe, but I haven't found it.

SubnetMask · Feb 3, 2018

These also were originally from a SAN - I believe an EMC Clarion based on the model the firmware reports. I discovered their 520b block size and re-formatted them to 512b and then ran badblocks on them. Badblocks returned no errors on all of them (One of the eight I had purchased did start returning errors in badblocks and was returned). The strange thing is enabling the writeback cache actually seems to make them slower - I did a clone from the SATA pool to the SAS pool, and with the write cache off, it pretty much averaged 20MB/s per disk. After enabling the writeback cache and starting another clone, there was a VERY brief burst up to about 35MB/s and then it dropped down and averaged around 15MB/s.

I did note that all but two of them have 'fairly high' ECC corrected errors, but from reading around, it sounds like these aren't generally something to be concerned about.

SubnetMask · Feb 3, 2018

Wow - this is interesting. Either I'm doing something really wrong, or there's a HUGE difference in performance between device extents and file extents. I removed the extent, removed the ZVOL and created a dataset instead of a ZVOL, then I created a file based extent (all without touching the underlying RAIDZ2 disk pool), recreated a VMWare Datastore and did another clone of the same VM - this one completed in about three minutes vs the previous about eight minutes, with the reporting showing a pretty significant difference in write speed.

The documentation claims that device extents with ZVOLS provide the best performance and the most features - Like I said, either I'm doing something wrong, or not so much.

And even more interesting, disabling the write cache actually did seem to speed it up a bit:

joeschmuck · Feb 3, 2018

SubnetMask said:
Any thoughts on how I might go about testing?

I think you are doing fine with the tests you are performing. My only advice is to only compare these SAS drives gagainst themselves, not other systems. I think the goal here would be to test your SAS drives under different configurations with bot the write cache disabled first and then enabled. I would create the script that @tvsjr provided and then just enable/disable it as needed for your testing.

Think about different ways to configure your drives and then test them. Be consistent in your testing as well. I used the Intel NAS Testing Suite when I was doing a lot of drive throughput testing to figure out what works best overall for me. You can use whatever type of testing you desire, just do it all the same. I don't have SAS drives, especially the ones you have so I can't do any performance testing to help you out.

@tvsjr Nice script, glad you provided it here.

SubnetMask · Feb 5, 2018

So I'm thinking here that the problem isn't with the drives or whether the writeback cache is enabled or disabled, but with FreeNAS or ZFS. Through all my testing nothing I did could get much better than 50-60MB/s on each disk - even when it was cloning a VM that was small enough that the entire thing fit into the RAM on the FreeNAS box (subsequent clones did not appear to perform reads on the source four disk RaidZ volume). I then blew away the entire pool and set up a six disk pool that consisted of three mirrors - after that, all the drives peaked out at around 115MB/s for the same transfers.

After doing some reading, it appears as if RaidZ/Z2/Z3, as implemented by FreeNAS or ZFS, supposedly does not give ANY performance benefits over that of a single drive. The 'good performance' I had been seeing with the four disk RaidZ was probably due to the fact that everything I had transferred around was small enough to fit in RAM and it could then write to the disks on its own terms. Now, I'm not entirely sure if that's really how it works, but if it is - it's broken. Badly. I have never before encountered a setup where having data striped across multiple drives didn't give ANY performance benefit over that of a single drive.

Similarly, the old belief that RAID10 is far superior to any other RAID is outdated (and NOT universally true these days). Back in the day, this was true - with RAID5, while you will always have some IO penalty due to the writing of the checksum data to the parity drive, so you don't get five drives of performance out of a five drive RAID5 array, but the bigger hit to performance back in the day was from the generation of the checksum data. Back then, when controllers didn't have as powerful processors as they do now, the generation of that data was a bigger deal. In the case of FreeNAS, or say a Compellent (which at their most basic hardware level, is the same as FreeNAS/TrueNAS), the generation of those checksums is trivial when compared to older standalone controllers - probably even compared to current standalone controllers like a PERC H730. No standalone controller that I'm aware of has eight 2.3Ghz CPUs with 24GB+ RAM. Take an array made up of SSDs - while yes, a RAID10 should give you blistering IOPs and transfer speeds, a RAID5 or 6 array made up of SSDs will also be blisteringly fast - so while the RAID10 may be crazy fast, it's very likely that a RAID5 or 6 made up of SSDs will still be way faster than fast enough for a LOT of situations.

Enter my promise array for an example. When I got it way back when, I was doing similar testing to determine what kind of RAID I wanted to set up, and being familiar with the concept that RAID10 offers the best performance (at the time), I set up a six disk RAID10 array for my VMs to run on. Performance SUCKED. BIG TIME. I switched from 7.2K drives to some 15K drives to test some more. It still SUCKED. I then build a RAID5 array with the 7.2K drives - that RAID5 array blew the doors off the 15K RAID10 array. It was like nothing I'd ever seen. So I called Promise support and explained it all to them. Their response? That's normal - the vTrak is optimized for RAID5. With further testing, I determined that RAID50 did give a bit more performance over RAID5 with those disks at the time, so that's what I've been running my VMs on. The RAID5 array on the Promise is for two large VMDKs that contain my data and media library, and with the right source and destination, I can saturate the 4Gb FC links with it.

I'm probably going to have to bite the bullet and get a few more 2TB SAS drives to make a RAID10 pool that is large enough to give me enough usable space for my VMs to run on, but that's a terrible waste of space. For the load I'm running, RAID5 works just fine on my Promise - if RaidZ/Z2 behaved the way RAID5/6 does on every other controller I've encountered, then RAIDZ or Z2 would be fine for what I'm doing.

I just can't wrap my head around the idea that ZFS, with the memory and processing power it has at its disposal, is incapable of doing something that a PERC H710 that you can get for $150 and has only 512MB cache and a dual core 800Mhz PowerPC processor can do.

I would have to imagine that there's got to be some way to enhance ZFS to allow it to perform like any other RAID controller out there would with RAID5/6 (Probably better, actually).

tvsjr · Feb 5, 2018

Keep in mind that ZFS is inherently different, because it's utterly paranoid with your data. I would expect other RAID controllers to be able to do things faster... but can they do it better? Not so much, if your measure of "better" involves the accuracy of your data.

SubnetMask · Feb 5, 2018

True - not going to argue with you on that because you are absolutely correct - But at the same time, it has WAY more processing power and memory at its disposal than any typical RAID controller, which is why I was rather taken aback when I started reading that RAIDZ literally only offers data protection and space via RAID5/6-like spanning, but without ANY of the performance benefits you typically get with RAID5/6 - granted, it's not always the best (depending on the controller), but you typically get SOMETHING.

If we were talking about a typical hardware RAID controller trying to do RAID5/6 calculations as well as ZFS calculations to ensure data integrity, I'd expect it to choke. But take my system - it's not the best or fanciest, but at the same time, it's no slouch. I've been trying it beat it up best I can and my CPU usage peak is like 23% (once) for a very brief time (It appears to have been for a single poll). And the system in your sig, under the same load, probably wouldn't have broken 10%. It's like we have all this processing power sitting here ready to go, but in the end, a Xeon 5140 from 12 years ago probably wouldn't break much of a sweat. Personally, if it was possible, I wouldn't mind having an option to 'make my CPU hurt' and give the best possible performance from a RAIDZ/Z2 pool, or 'play it safe' (for the less-capable systems, like maybe the FreeNAS Mini, which can only have so much CPU power), so I could see what 'make my CPU hurt' would really do. If that was somehow possible, maybe it would really make my CPU hurt and not help performance - but on the other hand, maybe with pushing the CPU load way up to increase RaidZ performance, it might still hum along like 'is that all you've got?'.

Even if it gave near RAID5/6-like read performance but sucky write performance, I could be wrong, but I think that could be easily mitigated with a few relatively small SSDs as a SLOG (I think?), so all writes go to SSD and then are moved off to HDD at the HDD's leisure.

And to be fair, I did rather suspect it had more to do with ZFS which FreeNAS uses/is built on rather than FreeNAS itself.

SubnetMask · Feb 12, 2018

So to continue on with the performance testing, I now have ten 2TB HGST SAS disks, and have some more data. When I set them up as a mirror so that the pool size was just over 9TB (Half the total ~20TB that the ten disks consist of, so it should have been ZFS's equivalent of RAID10) and I could not get anything over like 50MB/s out of any of the disks - then I realized that I probably wasn't able to feed the pool data fast enough to stress the drives - the M4's read speed of almost 300MB/s works out to about 60MB/s across five disks, which is about what I was seeing. So I re-configured into a four disk 'RAID10' (and later two separate four disk 'RAID10's which didn't have any performance difference over using one pool for the purposes of the testing).

I set it up with a 256GB Crucial M4 SSD in one of the empty bays as a lonely stripe all by itself to have something that has 'fairly serious' speed compared to anything else. This is my read source for the cloning of the 20GB VM that I've been testing with, which reads aside, will fit in its entirety into the 72GB RAM the system now has. Now oddly, some of the clones I did appeared to show ZERO read from the SSD, which would seem to indicate that it was almost certainly coming all from RAM - I would think that recipient disks willing, RAM > disk would be able to saturate the 24Gbps SAS link - assuming a 10 disk 'RAID10', I would think that would have no issue pushing them all to 125MB/s+. Or maybe I'm missing something as usual lol.

Here's what I've found so far - File-based iSCSI extents are WAY faster than device-based extents - we're talking over twice as fast - roughly 55MB/s vs 130-145MB/s or so. Doing the same clone over and over. I'm not sure why this is, but it's what the data seems to show. The writeback cache didn't have a massive effect on performance (certainly nothing like the difference it made with badblocks) - the file vs device extents had a far greater effect. I'm also not sure why this is, when with badblocks, the difference was HUGE (~550 hours and only part way through the fourth write sequence vs less than 48 hours all in huge).

I also switched the extents around so to make sure it wasn't a big difference related to the disk pool, like a flaky disk or something. I removed everything down to the base disk pools and set up a dataset where I had previously had a ZVOL, and a ZVOL where there had been a dataset and set up the extents accordingly, and the performance difference flipped with the change, so it's not the disks.

In the chart below, da1 is one of the disks in the first pool and da2 is one of the disks in the second pool. All other disks in the respective pools have essentially identical charts.

The first two transfers shown on da1 (the ones at 125-135MB/s) are when the file-based extent was on that pool and the first transfer on da2 (The one centered over 15:40) was when the device based extent was on that pool. The large space in between is the time when I was flipping the extents between the pools. Of the twelve transfers after the swap, the first two on either pool is with write cache off, the second two are with write cache on, and the last two are with write cache turned back off. You can see on the transfer to the disk-based extent that it appears to have had a small boost at the very beginning, but then it actually settled down to be a little slower than with it off. The file based extent was similar.

Now, the documentation claims, in section 10.5.6, that device extents with ZVols provide the best performance and most features, but in my testing, unless I'm doing something really wrong, that is not at all the case.

Anyone have any other thoughts/feedback/etc?

bigphil · Feb 12, 2018

I'm super curious about this. I'd love one of the devs (maybe @mav@) to chime in on this issue. I've always used device based extents, especially for VMware as they are the only type that supports the VAAI UNMAP primitive. But the speed difference between the two seems crazy. When I have a little spare time I'll also test the theory on my system.

Any reason you haven't upgraded to 11.1-U1? Your sig says 11.1, so wondering if you've already updated or not...a lot of fixes in that release.

SubnetMask · Feb 12, 2018

bigphil said:
I'm super curious about this. I'd love one of the devs (maybe @mav@) to chime in on this issue. I've always used device based extents, especially for VMware as they are the only type that supports the VAAI primitives (so says the documentation...I've never actually confirmed it). But the speed difference between the two seems crazy. When I have a little spare time I'll also test the theory on my system.

Any reason you haven't upgraded to 11.1-U1? Your sig says 11.1, so wondering if you've already updated or not...a lot of fixes in that release.

Actually, I am on 11.1-U1. Just neglected to update that line of my sig. Fixed that and added a few other bits.

If I'm doing something wrong that is causing this disparity, I'd love to be enlightened - but I'm not sure what that might be.

SubnetMask · Feb 21, 2018

No other thoughts on the performance differences between the device extent and the file extent?

Ericloewe · Feb 21, 2018

File extents are obsolete. It's in the manual:
http://doc.freenas.org/11/sharing.html#extents

There are no good reasons to use files instead of zvols.

SubnetMask · Feb 21, 2018

Ericloewe said:
File extents are obsolete. It's in the manual:
http://doc.freenas.org/11/sharing.html#extents

There are no good reasons to use files instead of zvols.

Yes, I read that - except in my testing, unless I'm doing something wrong, the file based extent on a dataset is a solid twice as fast as the device based extent on a ZVol. On the same disk pool. Can anyone explain that one? Like I said before, if I'm doing something wrong that's causing the substantial disparity, I'd love to be enlightened. But best I can tell, the 'good reason to use files instead of zvols' would be performance. But again, unless I'm somehow doing something horribly wrong.

Chris Moore · Feb 22, 2018

You should create a new title for your thread. This discussion is not related to SAS disk cache and many knowledge people who could comment are not necessarily looking at the discussion.

Sent from my SAMSUNG-SGH-I537 using Tapatalk

Ericloewe · Feb 22, 2018

It's hard to say without combing through this, since there are a lot of moving parts, but your results are definitely unexpected.

SubnetMask · Mar 4, 2018

Any experts out there have anything to contribute? As Ericloewe pointed out, the results I've seen are quite unexpected - you'd think that a device based extent would have a performance edge over writing files into a file on a filesystem, but from what I've seen from my testing, the file extent is ridiculously faster than the device extent. Best I can tell, the file extent is twice as fast as the device extent, and that's nuts. While you might expect SOME difference between the two, that's a LOT (and you'd think that the device extent would have a little bit of an edge).

Another thing that I've noticed is that it seems like there's a bottleneck somewhere. From my testing with four drive 'RAID10' pools (I use 'RAID10' to describe the mirror setup, understanding that it's 'not exactly' RAID10), when cloning the VM, where best I could tell from looking at the read/write stats of all the drives, it was coming 100% from memory, it was peaking the drives at 125-135MB/s or so, which they may be capable of more, they may not be, but that's quite decent IMO. Thinking traditionally, one would expect that if you expand that to a 10 drive 'RAID10', you'd still be able to peak the drives at the same rate, but only for as long as you are sending the data. So for example, a four drive pool might be able to write 20GB of data in about 80 seconds (20GB / 250MB/s, 250MB's being the write sum of two of the four drives), which isn't too far off from my testing. That being said, you'd think a 10 drive pool would be able to write the same data in about 32 seconds (give or take a little). But what I saw was the speed at which it cloned to the larger pool didn't really change overall, and the peak to the drives dropped - a lot. Down to I think 50MB/s or so each. Now, my machine isn't exactly the latest and greatest dual socket 24 core per socket with 256GB RAM and 12GB SAS system, but at the same time, it's not exactly a single dual core Xeon from forever ago, so it doesn't seem the RAM or CPUs would be the bottleneck, and theoretically, the single 4 channel SAS link to the backplane should support more than 2GB/s between the controller and backplane, so honestly, I'm not sure where there could be a bottleneck. Even going from a single SSD to two SSDs didn't push the read rate from the ~300MB/s I was seeing from the one SSD to ~600MB/s with two in a 'RAID0' (for testing, to try and get the best possible throughput and stress the HDD's as much as possible... which was a FAIL..., I had set them up as a stripe, not a mirror) - it was actually more like 150-200MB/s each, for a total not much higher than the single SSD.

I'm getting to where I'd like to migrate over to my FreeNAS machine, but I want to set it up right, and frankly, the guides have been the opposite of helpful, mainly considering that they 'recommend' device extents for iSCSI, yet my testing shows them to be WAY slower. Let's be honest - if for whatever reason, my testing showed the device extents to be 10% slower, but the benefit was the VMWare hardware acceleration support, it would be a non-issue, but the testing I did showed the device extents to be WAY slower than file extents.

If I'm doing something wrong, I'm opened to being schooled, but I don't know what I don't know... sooo...

Chris Moore · Mar 4, 2018

@Stux Do you have any insight into this question?

Important Announcement for the TrueNAS Community.

SAS Writeback cache performance & Device vs file extent performance disparities

Contributor

Old Man

Contributor

Guru

Contributor

Contributor

Old Man

Contributor

Guru

Contributor

Contributor

Patron

Contributor

Contributor

Server Wrangler

Contributor

Hall of Famer

Server Wrangler

Contributor

Hall of Famer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "SAS Writeback cache performance & Device vs file extent performance disparities"

Similar threads