SOLVED Rebuild of current system with different layout and added hardware

Herr_Merlin

Patron
Joined
Oct 25, 2019
Messages
200
Hi all,

We were suffering bad performance from our current freenas build. This build has been upgraded over the last 1 1/2 year on the fly.
After posting here regarding bad performance and no real answer was given ( https://www.ixsystems.com/community...-preformance-how-to-improve-vmware-nfs.79863/ ), we eventually figured it out. Please check below if we really did...
As the system had been upgraded from 14 to 51 disks over time we did made some mistakes.

So the plan is to use the most possible of the hardware currently present. To have 2 pools, one for backups and one for VMs. Remember our main VM Storage is an ~2 million IOPS full flash 8Gb FC SAN, which is mostly limit by the dual 8Gb links to each ESXi host.
90% of the production VMs will continue to reside on the FC SAN as well as about 20% of the test VMs. The backups of those VMs should go to the pool for backups.
10% of non critical production VMs should reside on the 2nd pool together with 80% of the test VMs. The backups of those production VMs should go to the pool for backups. In case the FreeNAS goes tits up it won't be an issue as long as the FC SAN is still there.

The hardware list(hardware currently laying around and used aggregated):
HP DL 380e Gen8 14x LFF SAS 6G
- 2x Intel Xeon E5-2430L 6x2Ghz
- 12x 16GB DDR3L ECC reg PC12800
- 1x HP Dual Port 10G SPF+ card
- 1x HP P420 512MB running in IT Mode SAS 6G
- 1x LSI SAS9200-8e
- 1x LSI SAS9207-8e
- 1x SLC SATA Dom with FreeNAS
Disk Expansion Shelves:
- HP D2700 4x SAS 6G in / 4x SAS 6G out 25x SFF SAS 6G
- Xyratex HB-1235 6x SAS 6G in / 6x SAS 6G out 12x LFF SAS6
- HP MSA70 4x SAS 3G in / 4x SAS 3G out 25x SFF SAS 3G (won't be use in this build)
Disks and SSDs
- 21x HGST SAS 12G 7.2k 4TB
- 29x Seagate SAS 12G 10k 1.8TB
- 5x Intel SATA3 DC S3500 80GB with PLP (won't use in this build)
- 2x Samsung SATA2 MZ5S7100XMCO 100GB SLC with PLP
- 2x Samsung SAS 12G PM1633a 480GB with PLP
- 1x Samsung SATA3 PM863a 480GB with PLP
- 1x HGST SAS 12G ZeusIOPS S842E 800GB with PLP
- 2x Intel SATA2 320 Series 160GB (won't use in this build)
- 1x Samsung SATA3 850 EVO 250GB (won't use in this build)

Code:
So after lots of reading and calculation of performance and IOPS the following idea came up:
    1) The pool for backups:
        a. This should deliver over 500MB/s in constant writes with lz4 enabled and always synced writes
            i. 3x RAID Z2 of 6 x HGST SAS 12G 7.2k 4TB
                1) 12 Disks within the Xyratex HB-1235, which should be connected to the LSI SAS9200-8e with 2 cables
                2) 6 Disks directly within the HP DL 380e connected to the HP P420 in IT Mode
            ii. 3x HGST SAS 12G 7.2k 4TB as hot spares
                1) Those 3 disk reside within the HP DL380e connected to the HP P420 in IT Mode
            iii. For SLOG 2x Samsung SATA 2 MZ5S7100XMCO 100GB SLC as mirror
                1) The SLOG mirror would be placed at the HP P420 in IT Mode directly at the HP DL380e
            iv. For L2ARC the Samsung SATA3 PM863a 480GB
                1) Drive would be installed in the HP DL380e and connected to the P420 in IT Mode
        b. Current math would hint the following performance, by only looking at the spinning disks in RAID
            i. IPOS 4K
                1) Read: 300
                2) Write: 300
            ii. Streamed throughput
                1) Read: 780 MB/s
                2) Write: 780 MB/s
            iii. Pool size
                1) About 44TB raw
                2) Useable 36TB
    2) The pool for the VMs:
        a. This pool should deliver way greater IOPS, while still be huge. Compression lz4 and always synced writes will be used:
            i. 7x RAID Z1 of 3 x Seagate SAS 12G 10k 1.8TB
                1) 21 Disks within the HP D7000 connected to the LSI SAS9207-8e with 2 cables
            ii. 2x Hotspare of the Seagate SAS 12G 10k 1.8TB
                1) 2 Disk within the HP D2700 connected to the LSI SAS9207-8e with 2 cables
            iii. For SLOG 2x Samsung SAS 12G PM1633a 480GB with PLP as mirror
                1) Those 2 SSDs would as well go into the HP D2700
            iv. For L2ARC 1x HGST SAS 12G ZeusIOPS S842E 800GB
                1) This SSD would as well go into the HP D2700 connected with 2 cables to the LSI SAS9207-8e
        b. Current math would hint the following performance, by only looking at the spinning disks in RAID
            i. IOPS 4K
                1) Read: 1750
                2) Write: 2870
            ii. Streamed throughput
                1) Read: 3444 MB/s
                2) Write: 1260 MB/s
            iii. Pool size
                1) About 22TB raw
                2) Useable 18TB


Would this work?
What could be tweaked even further?
Which settings could be optimized?
Any suggestions on hardware?
Change of HDD + SSD in a pool recommended?

BR
 
Last edited by a moderator:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Reviewing your expectations about bandwidth/IOPS I think the behavior of a ZFS SLOG and sync writes needs some explanation.

When you enable sync writes, all writes smaller than vfs.zfs.zil_slog_bulk (at default tunables, 768KB in size) will flow through the SLOG device. That means your back-end disks will never be able to write faster than your SLOG device(s) can. Sync writes to the SLOG are also done without queuing - and the big numbers that vendors like to advertise for their SSDs are generally done at higher queue depths. Think "QD32, unless specified otherwise." This is because ZFS wants assurance that the blocks written are on stable storage before it will pass that acknowledgement back up the proverbial food chain.

Cutting out as much latency for your writes as possible should be the goal. This might be done best by finding a way to get your SLOG devices into the DL380e head unit, so they aren't competing for SAS bandwidth on the downstream link with the capacity drives themselves. The 100GB SM825 drives might see 150MB/s at 128KB recordsize. The PM1633a should give significantly better results but definitely nowhere near the 1GB/s speed you're hoping to get. For those kind of speeds you're going to have to get off of SAS and onto the PCIe bus, preferably with an Optane device like the P4801X or maybe even the full-powered P4800X for the running VM pool. There's also the more exotic NVRAM cards but you're unlikely to find any kind of official vendor support or warranty on anything you pick up there.

While sync writes are certainly a requirement for your running VMware pool, I don't know that they are a necessity for the backup pool. You're never likely to overwrite a previous backup file until the new one has been completed, so in the worst case scenario, a failure during writing of backups would mean having to remove the "current" backup and take it again. Previous copy shouldn't be impacted there.

Regarding the pool layouts, strongly prefer mirrors for the running VMs. The "back-end" (non-cached/logged) IOPS potential of a pool is largely driven by the number of vdevs, not the number of drives.

Overall performance in a pool will tend to decrease as free space decreases as well. This is largely due to free space fragmentation - for heavily random I/O like running VMs, try to keep as much free space as possible. On spinning disks this has often been stated as "50% free" - but the more, the merrier.

Recordsize is also a consideration. The smaller the writes, the slower they tend to be, since each one has to traverse the network, be written to stable storage, and acknowledged. There's a thread in my signature regarding SLOG benchmarking that might provide some context to this, but you could expect very small (4K/8K) write benchmarks to turn in a speed of maybe 25-33% of your 128K results.

Obviously this will never approach the performance of an all-flash-array, but it should help improve things somewhat.
 

Herr_Merlin

Patron
Joined
Oct 25, 2019
Messages
200
Reviewing your expectations about bandwidth/IOPS I think the behavior of a ZFS SLOG and sync writes needs some explanation.

When you enable sync writes, all writes smaller than vfs.zfs.zil_slog_bulk (at default tunables, 768KB in size) will flow through the SLOG device. That means your back-end disks will never be able to write faster than your SLOG device(s) can. Sync writes to the SLOG are also done without queuing - and the big numbers that vendors like to advertise for their SSDs are generally done at higher queue depths. Think "QD32, unless specified otherwise." This is because ZFS wants assurance that the blocks written are on stable storage before it will pass that acknowledgement back up the proverbial food chain.
What value would be ideal to be tuned to`?
Cutting out as much latency for your writes as possible should be the goal. This might be done best by finding a way to get your SLOG devices into the DL380e head unit, so they aren't competing for SAS bandwidth on the downstream link with the capacity drives themselves.
The idea was here to have a controler per shelve / the head unit. Thus a controller only has that many of drives to manage.

The 100GB SM825 drives might see 150MB/s at 128KB recordsize. The PM1633a should give significantly better results but definitely nowhere near the 1GB/s speed you're hoping to get. For those kind of speeds you're going to have to get off of SAS and onto the PCIe bus, preferably with an Optane device like the P4801X or maybe even the full-powered P4800X for the running VM pool. There's also the more exotic NVRAM cards but you're unlikely to find any kind of official vendor support or warranty on anything you pick up there.
Sadly the HP DL380e does not support NVMe drives, thus I didn't looked into this. This is an old plattform by today's standards..
While sync writes are certainly a requirement for your running VMware pool, I don't know that they are a necessity for the backup pool. You're never likely to overwrite a previous backup file until the new one has been completed, so in the worst case scenario, a failure during writing of backups would mean having to remove the "current" backup and take it again. Previous copy shouldn't be impacted there.
indeed, should do it.
Regarding the pool layouts, strongly prefer mirrors for the running VMs. The "back-end" (non-cached/logged) IOPS potential of a pool is largely driven by the number of vdevs, not the number of drives.
Using about 20 drives with 1.8TB, will only give 1.6TB usable storage per disk. Now mirroring 20 disk will only lead to 16TB Storage, of which only 8TB would be advised to be used, which is a little to less.

Overall performance in a pool will tend to decrease as free space decreases as well. This is largely due to free space fragmentation - for heavily random I/O like running VMs, try to keep as much free space as possible. On spinning disks this has often been stated as "50% free" - but the more, the merrier.
As far as I got it should be 20% free for NFS and 50% for iSCSI?
This FreeNAS would only be connected via NFS.
Recordsize is also a consideration. The smaller the writes, the slower they tend to be, since each one has to traverse the network, be written to stable storage, and acknowledged. There's a thread in my signature regarding SLOG benchmarking that might provide some context to this, but you could expect very small (4K/8K) write benchmarks to turn in a speed of maybe 25-33% of your 128K results.
True on that. That's why I want to use the fastest possible drives for SLOG, which we currently have laying around.
Obviously this will never approach the performance of an all-flash-array, but it should help improve things somewhat.
That was the idea.
Right now there are 14 disk on the HP P420 and the other 37 disks are connected via one LSI controller and limited to SAS 3G as the HP MSA70 is used.
Thus split up the disk and SSDs more even over the controllers and replace the MSA70 with a D2700 to have SAS 6G.

BR
 

Herr_Merlin

Patron
Joined
Oct 25, 2019
Messages
200
So done as mentioned above.
With a positive, a strange and a negative outcome.
Negative:
The 4000GB disk pool is way to slow. Tops out at 70MB/s for long streamed data writes.
Will test to remove the SLOG mirror and see if it improves.. maybe those SATA2 SSDs are way to slow.
Positive:
The 1800GB disk pool performance faster than I expected.. but that might be because it's only 30% used right now.
Strange:
I have added 2 600GB 15k SAS drives as a mirror with no SSD Cache etc. With synced writes. Prefomance is as aspect and fine.
But as soon as I copy something to this disk transfers and IOPS break to the ground for the other pools.. pools who got SSDs and are not even on the same controller.
Anyone an idea?
Already checked CPU usage. Never above 60%. Usually at 30%
 

Herr_Merlin

Patron
Joined
Oct 25, 2019
Messages
200
Anyone an idea regarding the mirror?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
The description of the IOPS crashing would imply that there's a controller in place with an insufficient queue depth. You're certain the HP P420 is in HBA mode and it's not trying to do something silly like put its own write cache in the way of every device?

What value would be ideal to be tuned to`?

Whereever your back-end vdevs start becoming faster than your SLOG device ... 1M is usually the point where a hard drive's sequential write capability is considered sufficient.

The idea was here to have a controler per shelve / the head unit. Thus a controller only has that many of drives to manage.

A good idea but bear in mind where the latency is most felt - on the incoming writes to the SLOG device(s). Once the write is acknowledged onto stable storage, it is spooled to the vdevs asynchronously. Keep your SLOG as close to the compute/head as possible.

Sadly the HP DL380e does not support NVMe drives, thus I didn't looked into this. This is an old plattform by today's standards..

It may not support booting from one or have any U.2 hotswap bays, but it should support them in a PCIe slot at least.

Using about 20 drives with 1.8TB, will only give 1.6TB usable storage per disk. Now mirroring 20 disk will only lead to 16TB Storage, of which only 8TB would be advised to be used, which is a little to less.

As far as I got it should be 20% free for NFS and 50% for iSCSI?
This FreeNAS would only be connected via NFS.

What is the required amount of space here? The recommendation for free space is not an "on/off" switch, where everything is fine up until a certain level and then nosedives as soon as you write 50% plus one byte - it tends to degrade, most specifically in the write speed, as free space goes down. You might find acceptable performance with higher usage if your workload is mostly reads with few rewrites, but a "test VM datastore" implies a lot of updates, edits, and deletes.

True on that. That's why I want to use the fastest possible drives for SLOG, which we currently have laying around.

You may need to invest in some faster ones, unfortunately. They don't necessarily need to be NVMe but they are the fastest, hottest ticket on the market right now.

That was the idea.
Right now there are 14 disk on the HP P420 and the other 37 disks are connected via one LSI controller and limited to SAS 3G as the HP MSA70 is used.
Thus split up the disk and SSDs more even over the controllers and replace the MSA70 with a D2700 to have SAS 6G.

BR
None of the SSDs being used for SLOG are on the SAS 3G controller are they?
 

Herr_Merlin

Patron
Joined
Oct 25, 2019
Messages
200
I replaced the 3g sas chassis with a 6g ;)
The mirror is not connected to the P420.
There are now 3 cards.
The p420 and the 2 LSI HBAs.
On one LSI HBA is the pool with the 1.8tb 10k drivew on the other the mirror.. and some 4TB the rest of the 4TB is connected to the P420.
As soon as I write a huge (100GB) file to the mirror the prefomance from both other pools go way down. That part I do not understand
Plus they are not fighting over pcie bus bandwidth...
 
Last edited:
Top