SOLVED Help me find my bottleneck?

Status
Not open for further replies.

Greg10

Dabbler
Joined
Dec 16, 2016
Messages
24
Please help me determine why my IOmeter test configuration never gives me more than 31-33,000 total IOPS.

Specs of the drives (4x250gb EVO 850 SSD in RAID-0) says that the drives are supposed to deliver "up to" 98,000 IOPS, so where is my bottleneck?



I've just stood up a FreeNAS box and connected to it via iSCSI. I have created four extents and mounted them on a Windows Server 2016 box. I have tried tests with 2, 3, 4 SSDs in RAID-0, a SATA II (3Gb/s) RAID Card, a SATA-III (6Gb/s) card.I have tried varying the packet size from 512bits to 16k, I have tried 100% Random reads and 100% sequential reads.


Here's my setup:

FreeNAS box:
AMD Athlon II X4 640 Processor (4 cores @ 2+GHz)
MSI 870-G45 Motherboard
8GB RAM
Samsung Evo 850 OS drive
IBM M1015 flased to IR mode
4 Evo 850 250gb drives in RAID-0
2-port Broadcom NIC

Windows Server box:
Dell Precision T5400
2 Xeon E5405 CPU @ 2GHz
20GB RAM
2-port Broadcom NIC


Each box is set up with the on-board NIC as a management interface and both ports on the Broadcom NIC are set up in individual VLANs (so ISCS1 is 192.168.50.x and ISCSI2 in a different VLAN in 192.168.51.x).

FreeNAS is set up with both ports on the NIC in one target portal, and the Windows box is set up using MPIO across both iSCSI interfaces using round robin. The RAID card and the NIC are in the PCI-E x16 slots on the motherboard and the four SSDs are in straight passthrough mode.

When I run the IOmeter test, CPU on FreeNAS is around 30%, Wired Memory is less than one GB, no swap. Network utilization on FreeNAS is about 36gb on each iSCSI interface.

On the Windows box CPU utilization is around 20%, memory is less than 2gb and network traffic across both iSCSI interfaces matches the FreeNAS box at 36gb each.

IOmeter is set up with 8 workers (1 for each core on the Dell) each with the following worker specification:
100% Read
100% Random
512b packet size
32 queue depth
 

snaptec

Guru
Joined
Nov 30, 2015
Messages
502
Don't use IR mode. You need IT
Don't use raid or do you mean stripes?
How many iops do you get on FreeNAS itself?
Did you ran iperf tests?


Gesendet von iPhone mit Tapatalk
 
Last edited by a moderator:

Greg10

Dabbler
Joined
Dec 16, 2016
Messages
24
Hey Snaptec,

Thank you for your response.

Don't use IR mode. You need IT

From this article, IR delivers pass through as in IT mode, but you also have RAID options. IR mode vs IT mode doesn't make a difference if the disks are in passthrough mode.

Even if they did, for another test I used an entirely different card (HP P420 with 1GB flash-backed write cache) with the drives configured in four RAID-0 arrays (to simulate JBOD mode) as well as in a single RAID-0 array, with similar results. Therefore I have a hard time believing that IR vs IT mode will make any substantial difference in performance.

Don't use raid or do you mean stripes?

RAID-0 is striping.

How many iops do you get on FreeNAS itself?

Good question. Is there a way to pull statistics from the FreeNAS device? I'm attaching volumes to a windows box and running IOmeter on that box because that is the only way I could figure out how to pull the information.

Did you ran iperf tests?

No, but in other IOmeter tests in using larger packets (16k) I was able to saturate both iscsi links, with bandwidth utilization hitting approximately 980mb/s on both links. During the 100read 100random512size qd32 test, the iscsi links reached no more than 70mb/s sent / 11mb/sec received.

So it isn't the iSCSI network blocking things.
 
Last edited:

snaptec

Guru
Joined
Nov 30, 2015
Messages
502
First of all, it may not be your problem, but using Raid Cards with FreeNAS is the worst thing you could do.

ZFS can do really strange thinks when you simulate devices and not giving FreeNAS total control of your disks.

Please read the Hardware Recommendations Guide. Never the less:

With my read /stripes question I would like to make sure that you use ZFS stripes. As you said, you are using Raid. That is not supported.

You can test IOPS on FreeNAS f.e. with the tool fio on the shell.
 
Last edited by a moderator:

Greg10

Dabbler
Joined
Dec 16, 2016
Messages
24
With my read /stripes question I would like to make sure that you use ZFS stripes.

I am using stripes. I had the alternate set up on the HP card to see if there was a performance implication and it turns out there wasn't. If RAID-0 in the HP card, JBOD on the HP card and striping on the M1015 card in passthrough mode all produce similar results, then the bottleneck isn't on the disk subsystem.

You can test IOPS on Freenas f.e. with the tool fio on the shell.

Thanks! I'll try this and see what it reveals.
 

tvsjr

Guru
Joined
Aug 29, 2015
Messages
959
Keep in mind that IR mode often doesn't let the system monitor the drives properly, etc. You should *not* be using a card that doesn't function 100% as an HBA - no RAID functionality.
 

m0nkey_

MVP
Joined
Oct 27, 2015
Messages
2,739
AMD Athlon II X4 640 Processor (4 cores @ 2+GHz)
MSI 870-G45 Motherboard
8GB RAM
You're not going to get good iSCSI performance on this set-up. Typically at least 32GB RAM and Intel NICs are recommended.
Don't use IR mode. You need IT
Keep in mind that IR mode often doesn't let the system monitor the drives properly, etc. You should *not* be using a card that doesn't function 100% as an HBA - no RAID functionality.
False. IR mode is best of both worlds, it allows to use the HBA native RAID functionality, while at the same time unassigned disks are presented as normal, allowing both ZFS and SMART to utilize them.
 

Greg10

Dabbler
Joined
Dec 16, 2016
Messages
24
So I ended up flattening the box, installing Windows on it and running IOmeter locally, thereby eliminating the potential for any iSCSI or networking bottlenecks.

Runningn IOmeter with the M1015 in JBOD with software RAID (as a storage group) and with the HP P410 in four RAID-0 1-disk arrays as well as a single RAID-0 array still produced within 30-33,000 IOPS.

I looked up the specs on both cards as well as the PCI-e 8x specification and they should allow way more IOPS and throughput than the tests suggest.

I'm in the process of reinstalling FreeNAS and will run fio from the shell to see what it gives, but it looks like no amount of fiddling will get me past 33k read IOPS.

I'd really like to know what is keeping me from the 98,000 IOPS as advertised. I understand that it is marketing, but still, I'm getting less than a third of advertised performance.
 

JustinClift

Patron
Joined
Apr 24, 2016
Messages
287
Looking through this, one thought pops up. Are you able to double check if your HBA is connected at x4 or x8 in it's PCI slot?

I can give you the steps to check this on Linux if it helps. Not sure how on BSD (yet), nor Windows.
 

Evi Vanoost

Explorer
Joined
Aug 4, 2016
Messages
91
4 Evo 850 250gb drives - those average about 9k IOPS, so 4 of them should give you a peak of a little under 40k IOPS.

Oh, you mean the advertised IOPS, lol, I have another post on this forum here testing Samsung NVMe (950Pro) vs Intel NVMe (DC edition) drives, the Samsung drives are consumer grade, the benchmarks only test non-sync writes so yes, they do get 100k IOPS non-sync writes for very short periods of time.

The Samsung chips have 1-4GB of RAM that is NOT battery backed up, so when you benchmark you're really testing the speed of the stick of RAM. After you write about 1GB of data to it really fast, the performance statistics will collapse like an overcooked soufflé.

When you use it for 'real' storage, you tell the chip, hey, I want to make sure you have this committed to disk before you come back, now the chip has to do a read/modify/write of some chunk of it's internal data because these (3D vNAND or whatever they call it now) SSD's have multiple data cells stacked, changing a single bit requires it to rewrite the entire stack of cells. For non-synced writes, this gets caught by the RAM which then does it 'whenever' (or tries to combine writes) - kind of like a ZFS ZIL, for storage/ZFS uses you want to make sure your data is actually on the disk before promising your 'clients' the data is actually there, if the power gets cut or the system crashes, you don't want to lose the data.

If you want to get your 100k promised performance, use the Intel NVMe DC edition, they DO get 100k IOPS on synced writes, they're also almost a magnitude more expensive. But their RAM has a huge capacitor that can finish writes even if power gets cut and they aren't built to save cost, they're built for performance so the cells are smaller (meaning less rewrites if something changes), I don't think there are any SLC SSD's anymore, they're all MLC but if you really want good, steady performance, an SLC would be even better.
 
Last edited:

snaptec

Guru
Joined
Nov 30, 2015
Messages
502
please also tell with which options you are running fio, that will make huge differences
 

Greg10

Dabbler
Joined
Dec 16, 2016
Messages
24
Looking through this, one thought pops up. Are you able to double check if your HBA is connected at x4 or x8 in it's PCI slot?

I can give you the steps to check this on Linux if it helps. Not sure how on BSD (yet), nor Windows.

This game me some food for thought about PCIe lanes and stuff like that.

So I went back to my rig and pulled out every PCIe card except the HBA and moved the card to PCIe_2 (the firxt x16 slot). Ran the 100% read 100% sequential 512b test across four individual drives and the result jumped to ~65k IOPS!

So clearly there is some kind of bottleneck across the PCIe subsystem.

JustinClift, I'd love to know the linux way to tell how a card is connected. It might help me figure out how to do it in Windows.


edit: Adding my 2-port NIC into the second available PCIe slot dropped the test to ~63K IOPS, even though it wasn't taking any network traffic over those slots. But since adding another card affected IOPS, it's gotta be the PCIe subsystem causing the bottleneck.

edit2: Sending traffic across the NIC drops IOPS even further. Interesting to think that I am saturating the PCIe interconnect of this motherboard. I may need to rethink this combination of hardware. Or build it anyway and be done with it. :D
 
Last edited:

JustinClift

Patron
Joined
Apr 24, 2016
Messages
287
JustinClift, I'd love to know the linux way to tell how a card is connected. It might help me figure out how to do it in Windows.

It turns out to be the same command, lspci -vv, in both FreeBSD and Linux. I'd just forgotten about it as it had been so long since needing to use it. ;)

"lspci" is part of the "pciutils" package. In FreeBSD it can be installed using "pkg install pciutils". In FreeNAS itself, I think it's part of the base already (haven't checked).

Use "lspci" by itself to list the PCI devices in your system. For example, on a FreeBSD system here:

# lspci
00:00.0 Host bridge: Intel Corporation 5520 I/O Hub to ESI Port (rev 22)
00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 22)
00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 22)
00:05.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 5 (rev 22)
00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7 (rev 22)
00:08.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 8 (rev 22)
00:09.0 PCI bridge: Intel Corporation 7500/5520/5500/X58 I/O Hub PCI Express Root Port 9 (rev 22)
00:13.0 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub I/OxAPIC Interrupt Controller (rev 22)
00:14.0 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub System Management Registers (rev 22)
00:14.1 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub GPIO and Scratch Pad Registers (rev 22)
00:14.2 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub Control Status and RAS Registers (rev 22)
00:14.3 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub Throttle Registers (rev 22)
00:16.0 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.1 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.2 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.3 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.4 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.5 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.6 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.7 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:1a.0 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4
00:1a.1 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #5
00:1a.2 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6
00:1a.7 USB controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2
00:1c.0 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express Root Port 1
00:1d.0 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1
00:1d.1 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2
00:1d.2 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3
00:1d.7 USB controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)
00:1f.0 ISA bridge: Intel Corporation 82801JIR (ICH10R) LPC Interface Controller
00:1f.2 SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI Controller
00:1f.3 SMBus: Intel Corporation 82801JI (ICH10 Family) SMBus Controller
01:01.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a)
06:00.0 InfiniBand: Mellanox Technologies MT25208 [InfiniHost III Ex] (rev a0)
08:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
08:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)


To look at the speed actually in use by the Mellanox Infiniband controller (3rd from the bottom in the list above), we use the address shown there ("06:00.0") along with lspci -vv. Look for the "LnkSta" (Link state) line:


# lspci -s 06:00.0 -vv
06:00.0 InfiniBand: Mellanox Technologies MT25208 [InfiniHost III Ex] (rev a0)
Subsystem: Mellanox Technologies MT25208 [InfiniHost III Ex]
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 256 bytes
Interrupt: pin A routed to IRQ 26
Region 0: Memory at fbc00000 (64-bit, non-prefetchable)
Region 2: Memory at f9800000 (64-bit, prefetchable)
Expansion ROM at fbb00000 [disabled]
Capabilities: [40] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] Vital Product Data
Not readable
Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [84] MSI-X: Enable+ Count=32 Masked-
Vector table: BAR=0 offset=00082000
PBA: BAR=0 offset=00082200
Capabilities: [60] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE- FLReset- SlotPowerLimit 25.000W
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #8, Speed 2.5GT/s, Width x8, ASPM L0s, Exit Latency L0s unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-


So, this particular card is running at PCI-E x4. I should probably move it to a different slot. ;)

Does that help? :)
 
Last edited:

Holt Andrei Tiberiu

Contributor
Joined
Jan 13, 2016
Messages
129
You can start changing the MB.
According to MSI
https://www.msi.com/Motherboard/870G45.html#hero-specification

That Motherboard has 2 pci-ex 16x slots ( phisical lenght 16 x ) but the first is 16 x, the other one is 4 x.


Capture.PNG

AMD 770 Chipset Specs:
  • Codenamed RX780,[27] final product name revealed by ECS[28]
  • Single AMD processor configuration
  • One physical PCIe 2.0 x16 slot, one PCIe 2.0 x4 slot and two PCIe 2.0 x1 slots, the chipset provides a total of 22 PCIe 2.0 lanes and 4 PCIe 1.1 for A-Link Express II solely in the Northbridge
  • HyperTransport 3.0 and PCI Express 2.0
  • AMD OverDrive
  • Energy efficient Northbridge design
    • 65 nm CMOS fabrication process manufactured by TSMC
So you need another setup, or go with this as it is.

You cold try a MB based on 990FX, so you can keep the CPU

990FX
  • Codenamed RD990
  • Four physical PCIe 2.0 ×16 slots @ x8 electrical which can be combined to create two PCIe 2.0×16 slots @ x16 electrical, one PCIe 2.0×4 slot and two PCIe 2.0×1 slots, the chipset provides a total of 38 PCIe 2.0 lanes and 4 PCIe 2.0 for A-Link Express III solely in the Northbridge
  • HyperTransport 3.0 up to 2600 MHz and PCI Express 2.0
  • ATI CrossFireX supporting up to four graphics cards
  • 19.6 Watt TDP
  • Southbridge: SB950
  • Enthusiast discrete multi-graphics segment
 

Greg10

Dabbler
Joined
Dec 16, 2016
Messages
24
So it's been a while and after a bunch of tests with various hardware combinations, it turned out to be the motherboard PCI interconnect that was the bottleneck.

I ended up using a newer Xeon-based motherboard, paying careful attention to lanes per channel, and I was able to put in an disk controller HBA, a Mellanox network HBA and a dual-port NIC into two 8x slots and a 4x slot and saw no drop in IOPS.

So thanks to everyone in this thread for chiming in.
 
Status
Not open for further replies.
Top