64TB iSCSI targets on Storinator

Status
Not open for further replies.

RAIDTester

Dabbler
Joined
Jan 23, 2017
Messages
45
Hi all,
I've been breaking my head on this case, researching heavily for a good 40+ hours. I finally decided to reach out for help.
TL;DR; Getting 60-100MB/s read over iSCSI dual direct wired Intel 10GBE, when we're expecting 500+MB/s read performance.

I inherited this [unique?] case on a "Storinator"
Current Config
- 128GB Ram
- 2x Highpoint Rocket 750 HBA
- 30x6TB WD SE drives (30 additional, not used yet)
- ZPool of 3 vdevs x 10 disks in RAIDZ2
- 1 SSD for ZIL
- 1 SSD for L2ARC
- 2x 10GBE Intel NIC
- Direct wired to server
- Exported as single 64TB iSCSI target to Windows 2012 NTFS 64k block size

We are storing and processing backups for ~60 servers.
Files consist of a base file (usually 100GB - >1TB), and then a daily incremental backup
Nightly backups are verified every night.
Every Saturday, the nightly backups are rolled up into a weekly backup & every end of month, this is rolled up even further, consolidating and clearing out based on retention policy.

This is a really disk-intensive process, since the software needs to access the base files, and the path through all the past monthly/weekly/daily backups and write a new consolidated file.
We process up to 10 backups like this simultaneously

We were seeing dismal performance - peak 150MB/s max read.
After tons of research, tweaking and watching, repeat, over and over, I decided that the RAM is incapable of handling the huge iSCSI target and I made these changes:

I don't know if throwing more RAM at the machine will help, but we're ready to spend the money, if it will deliver.

Prevent too many queued commands to disk
sysctl vfs.zfs.vdev.max_active=10
I played with this one a LOT. Tried 1, 2 and finally tried leaving it on 10
https://forums.freenas.org/index.ph...ve-previously-vfs-zfs-vdex-max_pending.19212/

Only cache metadata
zfs set primarycache=metadata zpool
zfs set secondarycache=metadata zpool
I think this is the only thing that made a real difference so far

Allow prefetch
sysctl vfs.zfs.l2arc_noprefetch=0

Let L2ARC fill up faster
sysctl vfs.zfs.l2arc_write_max=67108864
sysctl vfs.zfs.l2arc_write_boost=67108864
sysctl vfs.zfs.l2arc_write_boost=67108864

Increase streams to total CPU cores
sysctl vfs.zfs.zfetch.max_streams=24


Since I made the changes - mostly to primary/secondary cache=metadata, we've been seeing a very slow rise in performance, which I attribute to the cache filling up with metadata.
I don't know if it will continue to get faster as metadata gets cached.
Am I doing this right? Should we be using CIFS instead, so that ZFS can track metadata on the actual files, instead of a giant iSCSI target that it has no idea about?
If we can't get 500-600MB/s read out of this machine, we're going to need to have to scrap this.

Anything would help at this point.
Thanks!
 
Last edited:

tvsjr

Guru
Joined
Aug 29, 2015
Messages
959
Right off the bat, I have two major concerns. First is the Highpoint HBA. Many of their products are known for being crap. I don't know how this specific card works.
Second, you're demanding an awful lot from 3 vdevs. When looking at performance, bandwidth is only one factor... you also have to consider IOPS and latency. You say you're doing a very disk-intensive process, and running 10 of these processes in parallel. That's most likely a lot of IOPS. You only get the summation of the IOPS of the slowest drive in each vdev... in this case, you've only got ~150 IOPS.

FN9.10 added new graphs in Reporting/Disk for Disk Busy, Disk Latency, and Disk Operations. What do the various disks show when under load? That may well tell the story.

What's the utilization of your pool? The usual guidance is no more than 50% utilization for NFS/iSCSI stores.

Beyond that, you'll need to start breaking it down step by step. First, use dd to a new dataset without compression to test read and write performance. If that looks good, use iperf to check the network interface. Work out, step by step, until you find the problem.
 

RAIDTester

Dabbler
Joined
Jan 23, 2017
Messages
45
First of all, thank you for responding. I greatly appreciate your time and help. This forum is a treasure trove of info!

I read about the Highpoint general issues. This card is the one Backblaze uses and it came as a package with the machine. We're obviously willing to swap it, but from testing, it doesn't seem to be the issue at all.

The speeds I documented are from idle. Just iscsi mounted on windows and copying to local SSD on Windows. Dismal. I was seeing 12-15MB/s before making the changes I documented. Now we're getting closer to 120MB/s. Write speeds had been literally double the dismal previous read speeds.

We're utilizing 74% of the pool. I don't think shrinking is an option. My plan was to build a second pool (in either config as mentioned above) on the remaining 30 disks and move everything over. BUT, it will literally take almost a YEAR to move the data at the read speeds we're getting.

I am the second guy to work out every step. Disk, iperf, nic, windows, everything. Something is not set correctly, or we're missing something. The disks are barely reading at 4MB/s after my changes. Before, they barely were doing 1-2MB/s. (I didn't upgrade from 9.3 yet, so I don't have those graphs.)

The results we've seen from the changes I documented is more than anyone has been able to achieve.

Ultimately, I'm asking:
Is this a generic issue with iSCSI, since ZFS can't cache the entire 64TB volume in RAM?
A lot of the data will actually be read fairly often. So it's not a matter of caching the MFU bits.
Do we need to throw more RAM at it?
Coming from a database background, I understand the primary=metadata to be like an index on a table and it's helping performance. Maybe it needs a few days to warm up the cache? I am seeing a slight uptick in performance - but it's been over 24 hours and it's been gaining traction very slowly.

I did a LOT of research, but I'm totally stumped. Thanks!
 

tvsjr

Guru
Joined
Aug 29, 2015
Messages
959
Try this. Create a new dataset with compression disabled. Then, SSH in and:
dd if=/dev/zero of=/mnt/some_dataset/zerofile bs=1M count=10000
dd if=/mnt/some_dataset/zerofile of=/dev/null bs=1M count=10000

Post the results from both. Obviously, replace some_dataset with the name of the new dataset you created. Make sure there's NO traffic to/from the pool when you do this.

You also didn't post full system specs. What motherboard, CPU, etc.?

74% of the pool is likely a challenge, especially if you're running iSCSI. You mentioned moving to CIFS... is there a benefit to mounting the data via iSCSI vs. CIFS? If not, I would consider at least testing the CIFS option.

As for the HBA, search around the forum and you'll discover much gnashing of teeth from cyberjock, jgreco, and others far smarter than I about its potential performance issues. The gold standard M1015/9211-8i/etc. and an expander may be worth trying.
 

RAIDTester

Dabbler
Joined
Jan 23, 2017
Messages
45
Thanks. I've been doing exactly this for the past week.
I don't know the specs on Motherboard. CPU E5-2620 v3 @ 2.40GHz

Made a pool of mirrors 15 sets of 2 disks (WD SE 6TB)
Ran it 3 times. All were within split second of each other..

# dd if=/dev/zero of=/mnt/r10/zerofile bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 9.052913 secs (1158274696 bytes/sec)
# dd if=/mnt/r10/zerofile of=/dev/null bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 36.505526 secs (287237608 bytes/sec)

Primary and Secondary Cache are both set to metadata on this pool.
There is no ZIL, but there is a Cache SSD
(BTW - I'm not loving this read speed, but if we can get this level of performance from the other pool, we would have something to discuss)

This is while the other pool is under load.
 

Deadringers

Dabbler
Joined
Nov 28, 2016
Messages
41
Thanks. I've been doing exactly this for the past week.
I don't know the specs on Motherboard. CPU E5-2620 v3 @ 2.40GHz

Made a pool of mirrors 15 sets of 2 disks (WD SE 6TB)
Ran it 3 times. All were within split second of each other..

# dd if=/dev/zero of=/mnt/r10/zerofile bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 9.052913 secs (1158274696 bytes/sec)
# dd if=/mnt/r10/zerofile of=/dev/null bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 36.505526 secs (287237608 bytes/sec)

Primary and Secondary Cache are both set to metadata on this pool.
There is no ZIL, but there is a Cache SSD
(BTW - I'm not loving this read speed, but if we can get this level of performance from the other pool, we would have something to discuss)

This is while the other pool is under load.

I came across this post while looking for performance issues myself.

FYI here are my results:

[root@freenas] /mnt/VM-Store-01/test-1234# dd if=/dev/zero of=/mnt/VM-Store-01/test-1234/zerofile bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 19.161680 secs (547225505 bytes/sec)

[root@freenas] /mnt/VM-Store-01/test-1234# dd if=/mnt/VM-Store-01/test-1234/zerofile of=/dev/null bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 2.322381 secs (4515090294 bytes/sec)


This is on a Mirror/Mirror (RAID10) with 4 x 1.2 TB 10k SAS disks.
 

tvsjr

Guru
Joined
Aug 29, 2015
Messages
959
I'm still thinking your HBA is an issue. Even Highpoint is pretty clear that the 750 is built for lots'o'drives, not for performance.

Just for reference, here's a similar test on my system. Specs (also in my footer):
Chassis - Supermicro CSE-847E16-R1K28LPB
Motherboard - Supermicro X9DRi-LN4F+
CPU - dual E5/2670 (8C/16T @ 2.6GHz)
RAM - 128GB ECC RDIMM
Fast storage (Tier2) - 12x HGST NAS 3TB 7.2K SATA in striped mirrors, Intel S3700 200GB downprovisioned to 16GB for SLOG, Intel S3700 200GB for L2ARC (which has a whopping 2.4% hit rate right now... meh)
Slow storage (Tier3) - 6x HGST 4TB 7.2K SAS in RAIDZ2
HBA - 9211-8i variant HBA (M1015 I think, can't remember)

Tier2 (12 drive striped mirror):
Code:
[root@freenas] ~# dd if=/dev/zero of=/mnt/Tier2/NoCompression/zerofile bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 11.404259 secs (919460004 bytes/sec)
[root@freenas] ~# dd if=/mnt/Tier2/NoCompression/zerofile of=/dev/null bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 1.642550 secs (6383830053 bytes/sec)


Tier3 (6 drive RAIDZ2):
Code:
[root@freenas] ~# dd if=/dev/zero of=/mnt/Tier3/NoCompression/zerofile bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 21.392384 secs (490163227 bytes/sec)
[root@freenas] ~# dd if=/mnt/Tier3/NoCompression/zerofile of=/dev/null bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 1.653532 secs (6341432360 bytes/sec)


I haven't gone out and conclusively proven it, but I believe my read speed is actually being limited by the available SAS/SATA channels, or PCIe lanes. Haven't cared enough to track it down. All of my content is being served over GbE today, maybe 10GbE soon... that's my limiting factor.

Point being, your system is comparable to or hotter than mine, and you're running quite a number of spindles. If you aren't seeing similar performance, I'd be thinking HBA. I haven't really done any tuning to this system either... built it up based on lots of reading here, got it running, and it's happy. Just another case of FreeNAS being awesome. I certainly haven't been messing with metadata caching, etc. (and, I would suggest that, once we figure out this problem, you consider backing a lot of that stuff out... FN expects things to run the way it likes, such modifications may post challenges down the road).

Just another random thought, and the reason I asked about the motherboard. What sort of slot is the HBA in? The card itself is PCIe 2.0 x8, which should be good for a theoretical 32Gbps max, or 4GBps. If you had it in a slot that was, for example, x4 lanes in a x8 slot, that's 2GBps... and starting to get down to the read speeds you're seeing.

I'd pick up a $100 9211/M1015/etc. HBA and give that a shot. That seems to be the most likely thing, and the one key component you haven't yet swapped.
 

RAIDTester

Dabbler
Joined
Jan 23, 2017
Messages
45
I'm still thinking your HBA is an issue. Even Highpoint is pretty clear that the 750 is built for lots'o'drives, not for performance.

Just for reference, here's a similar test on my system. Specs (also in my footer):
Chassis - Supermicro CSE-847E16-R1K28LPB
Motherboard - Supermicro X9DRi-LN4F+
CPU - dual E5/2670 (8C/16T @ 2.6GHz)
RAM - 128GB ECC RDIMM
Fast storage (Tier2) - 12x HGST NAS 3TB 7.2K SATA in striped mirrors, Intel S3700 200GB downprovisioned to 16GB for SLOG, Intel S3700 200GB for L2ARC (which has a whopping 2.4% hit rate right now... meh)
Slow storage (Tier3) - 6x HGST 4TB 7.2K SAS in RAIDZ2
HBA - 9211-8i variant HBA (M1015 I think, can't remember)

Tier2 (12 drive striped mirror):
Code:
[root@freenas] ~# dd if=/dev/zero of=/mnt/Tier2/NoCompression/zerofile bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 11.404259 secs (919460004 bytes/sec)
[root@freenas] ~# dd if=/mnt/Tier2/NoCompression/zerofile of=/dev/null bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 1.642550 secs (6383830053 bytes/sec)


Tier3 (6 drive RAIDZ2):
Code:
[root@freenas] ~# dd if=/dev/zero of=/mnt/Tier3/NoCompression/zerofile bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 21.392384 secs (490163227 bytes/sec)
[root@freenas] ~# dd if=/mnt/Tier3/NoCompression/zerofile of=/dev/null bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 1.653532 secs (6341432360 bytes/sec)


I haven't gone out and conclusively proven it, but I believe my read speed is actually being limited by the available SAS/SATA channels, or PCIe lanes. Haven't cared enough to track it down. All of my content is being served over GbE today, maybe 10GbE soon... that's my limiting factor.

Point being, your system is comparable to or hotter than mine, and you're running quite a number of spindles. If you aren't seeing similar performance, I'd be thinking HBA. I haven't really done any tuning to this system either... built it up based on lots of reading here, got it running, and it's happy. Just another case of FreeNAS being awesome. I certainly haven't been messing with metadata caching, etc. (and, I would suggest that, once we figure out this problem, you consider backing a lot of that stuff out... FN expects things to run the way it likes, such modifications may post challenges down the road).

Just another random thought, and the reason I asked about the motherboard. What sort of slot is the HBA in? The card itself is PCIe 2.0 x8, which should be good for a theoretical 32Gbps max, or 4GBps. If you had it in a slot that was, for example, x4 lanes in a x8 slot, that's 2GBps... and starting to get down to the read speeds you're seeing.

I'd pick up a $100 9211/M1015/etc. HBA and give that a shot. That seems to be the most likely thing, and the one key component you haven't yet swapped.

Thanks for the info.

I don't think you can get those numbers for read performance without reading from cache.
The HBA is in the correct slot, and I've pushed the disks individually until I had 10 disks running simultaneous dd read to /dev/null and they were getting >160MB/s each. So I'm not looking at HBA as cause.

My concern is that we're reading a large portion of the 64TB volume every few days, while adding and modifying data during the same period.
It's being done over iSCSI, so ZFS has no idea about the actual data arrangement in NTFS.
I'm betting that the caches can't keep up with only 128GB RAM. I'm looking for further guidance before spending what will likely be a couple of $K in RAM
 

tvsjr

Guru
Joined
Aug 29, 2015
Messages
959
Yep, you're right. Here are some additional tests with 1TB files... well beyond my memory and L2ARC (and only Tier2 has L2ARC anyway). During this process, the individual disks of Tier2 were reading about 120MB/sec and writing about 90MB/sec. For Tier3, writing at 75MB/sec and reading around 120MB/sec. Tier2 does have several production VMs running on it that are somewhat IO intensive (Splunk, Graylog, mail server, Percona DB cluster, etc.) so that will impact things slightly. Shutting these down for testing isn't a great idea :)

Tier2:
Code:
[root@freenas] /mnt/Tier2/NoCompression# dd if=/dev/zero of=/mnt/Tier2/NoCompression/zerofile bs=1M count=1000000
1000000+0 records in
1000000+0 records out
1048576000000 bytes transferred in 1464.591686 secs (715951080 bytes/sec)
[root@freenas] /mnt/Tier2/NoCompression# dd if=/mnt/Tier2/NoCompression/zerofile of=/dev/null bs=1M count=1000000
1000000+0 records in
1000000+0 records out
1048576000000 bytes transferred in 969.314953 secs (1081770168 bytes/sec)


Tier3:
Code:
[root@freenas] /mnt/Tier2/NoCompression# dd if=/dev/zero of=/mnt/Tier3/NoCompression/zerofile bs=1M count=1000000
1000000+0 records in
1000000+0 records out
1048576000000 bytes transferred in 3568.089240 secs (293876058 bytes/sec)
[root@freenas] /mnt/Tier2/NoCompression# dd if=/mnt/Tier3/NoCompression/zerofile of=/dev/null bs=1M count=1000000
1000000+0 records in
1000000+0 records out
1048576000000 bytes transferred in 2005.229230 secs (522920764 bytes/sec)


Still a fairly respectable 1,032MBps read and 682MBps write on Tier2, and 499MBps read/280MBps write on Tier3.

It still seems very odd to me that your write speeds are faster than your read. That pretty much says something is wrong - and by doing it local on the box, we've eliminated all the iSCSI vs CIFS issues, network problems, etc.
 

RAIDTester

Dabbler
Joined
Jan 23, 2017
Messages
45
Right! that's the crazy thing. We're getting horrible performance from a pretty powerful machine. HELP! :(
 

tvsjr

Guru
Joined
Aug 29, 2015
Messages
959
I guess that's my point, something's outside the norm. Throwing RAM at it won't fix this problem.
 

bigphil

Patron
Joined
Jan 30, 2014
Messages
486
Are you using one controller per 30 drive pool? or do you have both controllers serving the current 30 drive pool?
 

RAIDTester

Dabbler
Joined
Jan 23, 2017
Messages
45
Are you using one controller per 30 drive pool? or do you have both controllers serving the current 30 drive pool?
We're using 1 per pool. Second one is totally idle besides for the testing we did on it.
The thing is, we have 2 of these devices. Same HBA, but entirely different subsystem. They are both performing the same. They were not setup by FN pros and I'm definitely not a pro either. I just spent literally over 45 hours researching this and trying different things.

Honestly, I am seeing a slight uptick in performance since setting both caches to metadata. However, it is VERY slow. It's already >50 hours, but it seems a little faster.
I'm cautiously optimistic and will wait over the next few days and see what happens. Hopefully, I'll have some good news to share
 

bigphil

Patron
Joined
Jan 30, 2014
Messages
486
Since the other HBA is idle, I'd be curious to see if you gain any performance by splitting the disk load for the currently used 30 disk pool across both HBA's. I bet you'd see a good boost.
 
Last edited:

RAIDTester

Dabbler
Joined
Jan 23, 2017
Messages
45
Maybe. So you think the HBA is the cause?
The idle HBA, is hardly delivering any performance too. Reading @ 275MBps from 30 enterprise disks in mirrored vdevs is not that exciting. Especially considering that each drive can technically push about 220MBps alone. I'm down for swapping out the card for a higher performing one and testing. What do you suggest that can handle 30 disks?
 

bigphil

Patron
Joined
Jan 30, 2014
Messages
486
Not sure, but easy to find out if you split the load for the one pool on two HBA's. If it increases the performance, you got your answer. It would also be worth a shot to get an LSI chipped card to test. It'd be cheap and allow you to test 8 drives if you could destroy the pool thats not in use and setup a new pool of 8. Dell Perc H310 can be had for about $40 on ebay and crossflashed to LSI-9211 IT mode for testing.
 

bigphil

Patron
Joined
Jan 30, 2014
Messages
486
Another thing I noticed...that HBA card you have states Disk Format compatible: 512, 512e. The WD Se drive says its "Advanced Format" but doesn't give better info. You should check one of the disks with "smartctl -i /dev/yourDev" (you can run "smartctl --scan" to show all drives) and see what sector sizes info it lists. Maybe not the issue, but if the drives are 4K native (4096 bytes logical and physical) then there could be some compatibility issues. Seems unlikely if you say BackBlaze uses them, but worth a check since WD doesn't explicitly say what they are on their website.
 

RAIDTester

Dabbler
Joined
Jan 23, 2017
Messages
45
Just ran it..
Sector Sizes: 512 bytes logical, 4096 bytes physical
Does that make a difference?
Also, we're configured for 128k record size on the iscsci zvol. It's being exported as 4096 logical block size in iSCSI and NTFS formatted as 64k allocation size unit
Are those compatible? If not, what's optimal?

Thanks
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Just ran it..
Sector Sizes: 512 bytes logical, 4096 bytes physical
Does that make a difference?
Also, we're configured for 128k record size on the iscsci zvol. It's being exported as 4096 logical block size in iSCSI and NTFS formatted as 64k allocation size unit
Are those compatible? If not, what's optimal?

Thanks
I would try a different hba. Also test using iperf. What kind of nic bonding is configured? Disable all that stuff and just test 1. Remove any system config changes also.

Sent from my Nexus 5X using Tapatalk
 
Status
Not open for further replies.
Top