SOLVED Xeon D-1541 vs E3-1231v3 in an all-flash pool?

dxun · Dec 2, 2021

I am looking at replacing my existing E3-1231v3 system with LSI SAS2008 (in signature) with a D-1541 solution with an on-board LSI SAS3008 controller and I am not sure if I'd be getting any improvement at all with this move and would appreciate some thoughts. This will be a brand new pool on TrueNAS12-Ux.x.

The D-1541 board I am looking at is AsRock Rack D1541D4U-2T8R and I'd be flashing the SAS3008 on that board to the IT mode and deploying a mirror of Optane 900p and 8 (consumer) SSDs in a 2-way striped mirror pool as I am concerned the SAS2008 might choke off the pool capabilities with its max IOPS (290 kIOPS on SAS2008 vs > 1 MOPS on SAS3008).

The board is perfect in almost every respect and I'd be pulling the trigger already were it not for the large disparity in core speeds (that E3 clocks at 3.5 GHz vs 2.1 GHz on the D-1541). Since this will be a pool with iSCSI zvols all the way and with basically Kubernetes VM nodes deployed on it, I just don't know how much of an impact for latencies and IOPS will such a drop in core speed cause (I am not concerned about the network as this will have a SolarFlare 10G SFP+ NIC). I do know that single-threaded processes would suffer (such as SMB) and that scrubbing might be slower (?) but I have found very little discussion on what iSCSI zvols favour - cores speed or core count?

jgreco · Dec 2, 2021

Replacing a 3.4GHz CPU with a 2.1GHz CPU seems like a losing game.

iSCSI is in-kernel and very efficient, and unless you're trying to push full 10G, I doubt you'd notice a difference. It seems like picking up an LSI 3008 for your E3 system might be a better way to go, or even just try it with the 2008 and see if it's a problem.

dxun · Dec 2, 2021

Thanks - I suspected as much.

What would be a good way of testing the limits of that SAS2008 on that pool?
I am thinking of tools like `fio`, `zpool iostat`, and `gstat -p` and simulating large concurrent writes from multiple consoles with `dd`. Is there a better a way?

Etorix · Dec 2, 2021

Unless your nodes appreciate the higher core count and/or more (RDIMM) RAM, that's not a good move.
Note that the D1541D4U-2T8R has two Intel 10 GbE ports on-board (no need for a SolarFlare!), but that the the two M.2 slots are x1: Fine for boot, not for NVMe SLOG/L2ARC.

dxun · Dec 2, 2021

Mind you - the nodes will not be deployed on this system; this will be a SAN for these nodes. What I am suspecting/hearing still holds - that the large drop frequence won't be offset with extra core count in iSCSI zvols?

I've got a 10G fibre switch, not copper - I know there is a RJ-45 -> SFP converter, but short of media converters, were you thinking of some kind of an inverse converter, i.e. SFP -> RJ-45? I haven't come across on these.

M.2 slots would be boot only - the plan was to connect the Optane mirror to a dual-slot NVMe card (with a complicated U.2 -> SFF-8643 -> m.2 adaptation route) and put that card into the x8/x16 slot. Any reason why that's a bad idea?

jgreco · Dec 2, 2021

Etorix said:
two Intel 10 GbE ports on-board (no need for a SolarFlare!),

You misspelled "two crappy copper 10GbE ports on-board (totally need a better network option)"

Etorix · Dec 2, 2021

dxun said:
I've got a 10G fibre switch, not copper - I know there is a RJ-45 -> SFP converter, but short of media converters, were you thinking of some kind of an inverse converter, i.e. SFP -> RJ-45? I haven't come across on these.

No, I just missed the "SFP+" part.

jgreco said:
You misspelled "two crappy copper 10GbE ports on-board (totally need a better network option)"

To complete my education here, is "crappy" related to "copper" or to having a PHY working with the D-1541 SoC rather than a full-featured NIC?

jgreco · Dec 2, 2021

10GBASE-T is a crappy technology. It burns watts, has increased latency, doesn't support PoE, and even though it has been around for about a decade, it hasn't seen significant uptake. I don't have a single resource that documents all the ways in which it sucks, but if you look for articles on the forum by me containing the term "10GBASE-T", you will probably find lots of relevant tidbits.

dxun · Dec 2, 2021

Thank you all for chiming in - I won't be pursuing this hardware combination. It seems something like Supermicro X10SRi + E5-1620 v4 might be a better choice for not much more money.

Would appreciate any advice on how to test if that SAS2008 is an IOPS bottleneck.

jgreco · Dec 3, 2021

dxun said:
Would appreciate any advice on how to test if that SAS2008 is an IOPS bottleneck.

Keeping it simple:

If you can arrange a spare SSD to be attached to it, use that as your testing baseline. This might require moving one of your pool disks to a mainboard port temporarily or something like that.

Using ssh or the actual console, and NOT the web UI:

When the system is largely idle, run a "dd" from the raw disk device to /dev/null with a large blocksize, i.e.

# dd if=/dev/da8 of=/dev/null bs=1048576
load: 1.01 cmd: dd 97503 [physrd] 5.01r 0.00u 0.10s 0% 3268k
521+0 records in
521+0 records out
546308096 bytes transferred in 5.022194 secs (108778781 bytes/sec)
^C848+0 records in
848+0 records out
889192448 bytes transferred in 7.810964 secs (113839011 bytes/sec)
#

You can use control-T to get the mid-run status report, or control-C to stop the command. This is entirely harmless to a disk as long as you are making sure to READ from the disk ("if" is the input device). My example uses a HDD, so you can see the speed isn't stellar.

So once you are feeling comfortable with that, you can do a few things to test:

1) Note the dd speed when the system is largely idle

2) Start a scrub of your pool and then note the speed

3) Move the test SSD to a mainboard SATA port and see how much faster that dd speed is.

If there is a significant difference between 1 and 2, like more than, let's say 10% reduction, your HBA is bottlenecking (as opposed to just being busy due to lots of I/O). This may not cause problems during normal non-scrubbing NAS operations, but ZFS does do periodic scrubs.

I expect there to be some difference between 1 and 3, and it may even be as large as maybe 25%. The SAS2008 was designed for HDD attachment and the CPU isn't really up to snuff for modern "full SATA speed" SSD's. Whether this bothers you is really up to you. The aggregate performance of the SAS2008 across its 8 6Gbps lanes is 48Gbps, so even if it were to be underperforming by 50%, it should still be able to cope reasonably with a 10Gbps ethernet network connection.

dxun · Dec 3, 2021

Thanks for this - very useful! Finding a spare SSD won't be a problem. I'll share the results when I have something.
I suspect any differences/bottlenecking shown by a single SSD would be amplified with a set of disks?

Related to this - I've found what seems to me like an excellent fio write-up by the resident mercenary_sysadmin - attaching the link here.
Also, I found an interesting comparison of a hefty slew of HBA controllers' throughputs during "parity check" on an neighbouring forum - attaching the parts relevant to this discussion down below (note - the post is from 2015, but seems relevant for these SAS chipsets at least).

I not exactly sure what kind of workload this particular test runs (the details of tests and test platform itself are pretty skimpy), but it seems to me like a combination of sequential reads + CPU parity calculation.

I am surprised to see such a drop in speeds with the increase in lane occupation on these SAS controllers - aren't all lanes supposed to get full speed (which is 6 Gbps ~= 700 MB/s)? Is this related to controller itself or SAS (full-duplex) vs SATA (half-duplex) protocol differences?

One other interesting thing to note is the impact of DMI (the Intel bus between PCH/SouthBridge and CPU) generation (1.0 vs 2.0 vs 3.0) - in particular, DMI 2.0 has a max bandwidth of 2 GiB but seems like 4 drives connected to it are capped at 270 MB/s (which a little more than half of the nominal). Not sure if that test is relevant for SATA-2 only perhaps and four (4) SATA-3 drives should get the full bandwidth at all times when connected to that bus.

jgreco · Dec 3, 2021

dxun said:
I am surprised to see such a drop in speeds with the increase in lane occupation on these SAS controllers - aren't all lanes supposed to get full speed (which is 6 Gbps ~= 700 MB/s)? Is this related to controller itself or SAS (full-duplex) vs SATA (half-duplex) protocol differences?

No. As wonderful as it would be if everything went full speed all the time, what's actually happening with an LSI RAID controller is that you have a MIPS-flavored Raspberry PI-like device attached to some HDD ports. It communicates with the device driver on the host platform, moving data back and forth. When the host says "Read block 2414892 of virtual disk 2", it does some quick computation to determine which physical HDD and LBA that is, issues a read command, reads the data, and then feeds that to the host's driver. If there was an error, it attempts to read the data from the RAID redundancy. Maybe replace a drive if it's failed. Etc.

I've explained LSI RAID in this manner so that you can see that there is no ACTUAL direct path between the host and the HDD ports. The host cannot talk to the disks DIRECTLY.

HBA IT mode uses a different firmware, so that communications is parroted through the LSI's CPU. The host driver says "I want to read LBA 231324 of physical disk 5" to the HBA CPU, then the HBA CPU issues that actual read command down the correct SAS lane, the drive responds with the data (or error), and the HBA CPU parrots that back to the host.

This effectively nullifies the fancy RAID features that the LSI CPU could be doing, but still allows it to do stuff like SAS management. This is why you can hook 24 HDD's to an LSI 9211-8i via an SAS expander.

This design comes at the cost of the LSI CPU "touching" every I/O back and forth to the actual drives. Therefore, there is an ultimate limit for performance -- whatever point the LSI CPU tops out at.

Now, LSI did make sure that the 2008 could handle reasonable common workloads such as 8 directly attached hard drives, but it has been known for a long time that the 2008 is not actually capable of shoveling around the 48Gbps that might potentially be required if you hooked up 100 HDD's on SAS expanders. Or 8 SSD's speaking at full 6Gbps, but that's more of a recent development with the advent of full-speed SATA SSD's.

dxun · Dec 3, 2021

This makes perfect sense to me - I was wondering why an HBA in IT mode would have its CPU involved in block translations (and, by extension, in each disk I/O operation).

I've just had a thought - would it be useful to split the load between the SAS2008 HBA and the on-board HBA controller?
Since I'll be splitting the 8 SSDs into 4 vdevs of 2-way mirrors (with some OP), what if I had each half of the mirror sit on the SAS2008 and the other half on the on-board HBA?
Conveniently, the X10SLM+-F board has four Intel HBA ports (that I'd connect each half of the mirrors) and two Marvell-powered ports (that I'd use for the boot mirror).
An added benefit of doing this would be splitting the I/O queue between two HBAs.

Would this even be a worthy test case?

Ericloewe · Dec 3, 2021

Let's also not forget: The SAS2008 is a PCIe 2.0 x8 device (sometimes even x4). That means it's limited to an absolute maximum of ~4 GB/s (8 x ~500 MB/s).

That's why the SAS2308 exists. PCIe 3.0 and a beefed up CPU for extra performance with SSDs. Also a couple of new features that don't really matter much, and support for more devices (something like 1024 instead of 128)

jgreco · Dec 3, 2021

dxun said:
This makes perfect sense to me - I was wondering why an HBA in IT mode would have its CPU involved in block translations (and, by extension, in each disk I/O operation).

Which is why I thought it worth the time to pound out that long-winded explanation.

I've just had a thought - would it be useful to split the load between the SAS2008 HBA and the on-board HBA controller?

Certainly. That significantly ups the game.

Ericloewe said:
The SAS2008 is a PCIe 2.0 x8 device (sometimes even x4).

This is also true, but overall, PCIe 2.0 x8 is not horribly slow, so it is important to understand that you can still extract value by simply being aware of the limitations. I feel that the performance bottleneck @dxun is seeking will happen at a point well before the PCIe 2.0 x8 speed limit becomes a factor.

dxun · Jan 9, 2022

In an effort to test the performance deltas between controllers, I've finally had a chance to sit and dive a bit deeper into `fio` and so far, I've gotten results that make me question if I am running these tests at all half-decently or not, so I thought it reasonable to ask here before I invest more time.

I've started running a several test cases (inspired by JRS articles on Ars Technica and some other sites I deemed useful) on an SSD in the ZFS pool, but the numbers were so surprisingly low that I wanted to see how a decent, single NVMe drive would behave.

For this test, I have a Sabertooth X99 with i7-5820k and a Samsung 980 PRO 1TB connected via M.2 slot - it's a PCIe Gen3 x4 connection. The disk is in steady-state with roughly 240 GB free space. OS is Windows 10, latest stable build.

I'll be posting the test cases with full command line input as I'd like to know if I am making a mistake somewhere or the numbers are really supposed to be this low. Also, I was doing manual trimming after each test (with "Optimise Drives" app built in Windows).

Code:

1. [Single 4KiB random write process] fio.exe --name=random-write --ioengine=windowsaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
2. [16 parallel 64KiB random write processes] fio --name=random-write --ioengine=windowsaio --rw=randwrite --bs=64k --size=256m --numjobs=16 --iodepth=16 --runtime=60 --time_based --end_fsync=1
3. [Single 1MiB random write process] fio --name=random-write --ioengine=windowsaio --rw=randwrite --bs=1m --size=16g --numjobs=1 --iodepth=1 --runtime=60 --time_based --end_fsync=1
4. [Random Read/Write Operation Test - 75/25] fio --randrepeat=1 --ioengine=windowsaio --direct=1 --gtod_reduce=1 --name=fiotest --filename=testfio --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75
5. [Random Read/Write Operation Test - 50/50] fio --randrepeat=1 --ioengine=windowsaio --direct=1 --gtod_reduce=1 --name=fiotest --filename=testfio --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=50
6. [Random Read/Write Operation Test - 25/75] fio --randrepeat=1 --ioengine=windowsaio --direct=1 --gtod_reduce=1 --name=fiotest --filename=testfio --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=25

And here are the matching results for each run (condensed output).

Code:

1. write: IOPS=43.9k, BW=171MiB/s (180MB/s)(12.1GiB/72082msec); 0 zone resets
   WRITE: bw=171MiB/s (180MB/s), 171MiB/s-171MiB/s (180MB/s-180MB/s), io=12.1GiB (13.0GB), run=72082-72082msec
 
2. IOPS => (min 3.8k - max 9 k) => avg ~6 k
   WRITE: bw=6443MiB/s (6756MB/s), 239MiB/s-558MiB/s (250MB/s-586MB/s), io=387GiB (416GB), run=61128-61569msec
 
3. write: IOPS=285, BW=286MiB/s (300MB/s)(16.8GiB/60303msec); 0 zone resets
   WRITE: bw=286MiB/s (300MB/s), 286MiB/s-286MiB/s (300MB/s-300MB/s), io=16.8GiB (18.1GB), run=60303-60303msec
 
4. read: IOPS=30.2k, BW=118MiB/s (124MB/s)(6141MiB/52134msec)
   write: IOPS=10.1k, BW=39.3MiB/s (41.2MB/s)(2051MiB/52134msec); 0 zone resets
 
   READ: bw=118MiB/s (124MB/s), 118MiB/s-118MiB/s (124MB/s-124MB/s), io=6141MiB (6440MB), run=52134-52134msec
   WRITE: bw=39.3MiB/s (41.2MB/s), 39.3MiB/s-39.3MiB/s (41.2MB/s-41.2MB/s), io=2051MiB (2150MB), run=52134-52134msec
 
5. read: IOPS=22.7k, BW=88.5MiB/s (92.8MB/s)(4098MiB/46285msec)
   write: IOPS=22.6k, BW=88.4MiB/s (92.7MB/s)(4094MiB/46285msec); 0 zone resets
 
   READ: bw=88.5MiB/s (92.8MB/s), 88.5MiB/s-88.5MiB/s (92.8MB/s-92.8MB/s), io=4098MiB (4298MB), run=46285-46285msec
   WRITE: bw=88.4MiB/s (92.7MB/s), 88.4MiB/s-88.4MiB/s (92.7MB/s-92.7MB/s), io=4094MiB (4292MB), run=46285-46285msec
 
6. read: IOPS=9892, BW=38.6MiB/s (40.5MB/s)(2050MiB/53059msec)
   write: IOPS=29.6k, BW=116MiB/s (121MB/s)(6142MiB/53059msec); 0 zone resets
 
   READ: bw=38.6MiB/s (40.5MB/s), 38.6MiB/s-38.6MiB/s (40.5MB/s-40.5MB/s), io=2050MiB (2150MB), run=53059-53059msec
   WRITE: bw=116MiB/s (121MB/s), 116MiB/s-116MiB/s (121MB/s-121MB/s), io=6142MiB (6440MB), run=53059-53059msec

The reason I am suspecting misconfiguration here is because the IOPS are so far off the stated specification that either Samsung is criminally inflating their spec or my test cases are completely flawed. The 980 PRO has a stated "up-to 1 MOPS" for read and write random ops - The highest IOPS I've seen is around 43 k IOPS, which is not even 5% of the stated possible maximum, so surely I am doing something wrong?

On the other hand, if these tests are representative of actual physical disk capabilities (and the numbers Samsung is stating are just a brazen, marketing lie), then I question the purpose of even considering anything better than LSI 2008 in an all-flash pool of consumer SSDs. If the numbers are so meagre with a single NVMe that seems to be state-of-the-art consumer SSD, then what hope is there of ever reaching the stated limit of 300 k IOPS that the LSI 2008 is supposed to be capable of delivering with an array of SATA SSDs?

Is what I am seeing is perhaps the limitation of the onboard HBA I am using on X99?

Lastly, if my tests are flawed and the stated IOPS numbers can be reached, how do I do that?

jgreco · Jan 9, 2022

dxun said:
The 980 PRO has a stated "up-to 1 MOPS" for read and write random ops - The highest IOPS I've seen is around 43 k IOPS, which is not even 5% of the stated possible maximum, so surely I am doing something wrong?

I'm sure that in some testing scenario, designed by the people who built the thing and who optimized the heck out of it, then ran the test, tied a rocket to it, fired it downwards from the ISS towards earth, that a 980 PRO probably momentarily blipped up to speeds of 1M IOPS. And while we're at it, I'm sure that they counted an "IO" as a 512-byte sector, and that they requested large contiguous ranges, and that the large contiguous ranges had been seeded into the controller in an optimal manner too.

Putting things into a full stack usually doesn't work that well. I talk about "laaaaaatency" in the SLOG/ZIL post, and while this is not the same thing, many similar issues are involved. You lose significant speed by adding layers of complexity. Each layer can have its own mitigation strategies to minimize the speed lost at that layer, but taken in aggregate, you're still going to lose out.

Let's take FreeNAS completely out of the equation for a moment. I have an ESXi hypervisor host with three development VM's on a 980 PRO. There is nothing going on on the datastore (but I can't stop the hypervisor's other VM's which are mildly busy).

A FreeBSD VM with thin provisioned space writes zeroes at 865MBytes/sec, reads it back at 1107MBytes/sec, and overwrites it at 870MBytes/sec.

So where's the 7GB read speed? It's lost due to ESXi overhead and suboptimal data placement. I am not entirely disappointed with the numbers above, because I am also aware that I can run two VM's in parallel and get some additional speed out of it. This is definitely better than the 500MBytes/sec peak I see with SATA SSD's, but certainly nowhere near the potential of the NVMe.

You are never going to get a million IOPS out of this drive on a TrueNAS system. There's just way too much stuff in between clients and the flash.

dxun · Jan 9, 2022

Agreed 100% and just to be clear - I wasn't expecting to. But I also didn't expect 5% of that maximum.

The issue is reasoning about these data - I don't dispute the accuracy of fio, but then again, I have no good background to reason about IOPS in general.

I'd just like to hammer the crap out of my pool and see how it behaves under _extreme_ duress but with the IOPS numbers what I am seeing on a drive that I _expected_ to be far faster than my whole pool of 8 SATA SSDs, I am questioning if I'll even see any IOPS deltas once I switch through these scenarios:
- 1 SATA SSD drive on on-board HBA
- 1 SATA SSD drive on LSI 2008
- 1 SATA SSD drive on LSI 3008
- 4 SATA SSD drives on on-board HBA + 4 SATA SSD drives on LSI 2008
- 8 SATA SSD drives on LSI 2008
- 8 SATA SSD drives on LSI 3008

Any advice/thoughts?

jgreco · Jan 10, 2022

dxun said:
but then again, I have no good background to reason about IOPS in general.

That's fine, I'm pretty sure most of the people who work in storage don't either. I see all sorts of ridiculous claims and inconsistencies. Some people define the IO in IOPS to mean a 512 byte sector, others a 4K sector, yet others a larger stripe (as in a RAID), while even others use it to mean a seek. I tend towards use of seek since it is the pessimistic value in all cases.

But you can get pretty wack very quickly talking to storage vendors who want to misrepresent. Oh, hey, and suddenly we're here in this thread. Welcome to "what does IOPS mean" hell.

If we define IOPS to mean operations to non-contiguous locations, then for HDD's that involves a seek, and for SSD's, it means retrieval from flash. This is not always the most pessimistic case, because the data might have been in cache, and for HDD's, a long start-to-end-of-disk seek takes longer than a next-track-over seek, but it lets us establish that even the fastest HDD's are only reliably capable of a few hundred IOPS. SSD's will typically be capable of "much" more due to their lack of a need to seek. These are what *I* think of when I think of IOPS. It doesn't really matter too much to me if a HDD does a seek and gets 512bytes or does a seek and gets 64KBytes, either way, it's really the seek that is the constraining item. In that same way, when people get all OCD about whether their HDD is capable of 200MBytes/sec, I'm not that impressed, because that will only ever happen for true sequential workload at the edge of the platter, and in my experience, that's not really all that realistic. I'm more interested in the pesssimistic 250 IOPS * 512bytes = 128KBytes/sec minimum throughput a HDD can "guarantee", and where between that and the 200MBytes/sec a realistic workload will land you. There's a crash course in the meaninglessness-of-IOPS-as-used-by-industry and some glimpse of a more reasonable practical interest in the topic.

The LSI 2008 vs 3008 is likely to be a very noticeable difference. The 2008 was created when HDD and PCIe 2 ruled, and so while it has theoretical maximum 6Gbps per SAS channel throughput, in practice when under full load it's maybe 2/3rds that, some people have spotted it at half or three quarters or whatever. This is typically still fine if you're only using it to connect 8x HDD, which never exceed 3Gbps per channel. So it's popular for SOHO/hobbyist use.

The 2308 was a significant improvement in CPU and PCIe 3, and the 3008 was an improvement on THAT, to be able to support 12Gbps. I would expect the ability to fully squeeze 8x SATA SSD's for all they are worth out of a 3008. Due to the nature of overhead, even that might be just a bit shy of what you can get out of Intel PCH SATA, but it should be really close.

Ericloewe · Jan 10, 2022

jgreco said:
Due to the nature of overhead, even that might be just a bit shy of what you can get out of Intel PCH SATA, but it should be really close.

Depends on the PCH, especially since other things are hanging off the PCH. Less-than-recent ones used a PCIe 3.0 x4 link to the CPU [32 Gb/s nominal], and something like a C236 takes up to 8 SATA 6 Gb/s ports [48 Gb/s nominal], whereas an SAS2308 or SAS3008 can do PCIe 3.0 x8 [48 Gb/s nominal]. More recent PCHs can go to PCIe 3.0 x8 or even PCIe 4.0 x8, but that stuff is almost too new to be relevant.
Not that it's going to make a huge difference either way, a handful of decent NVMe SSDs will just destroy anything SATA can muster, with less complexity, if you can get the PCIe configuration to support it.

Important Announcement for the TrueNAS Community.

SOLVED Xeon D-1541 vs E3-1231v3 in an all-flash pool?

Explorer

Resident Grinch

Explorer

Wizard

Explorer

Resident Grinch

Wizard

Resident Grinch

Explorer

Resident Grinch

Explorer

Resident Grinch

Explorer

Server Wrangler

Resident Grinch

Explorer

Resident Grinch

Explorer

Resident Grinch

Server Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Xeon D-1541 vs E3-1231v3 in an all-flash pool?"

Similar threads