dxun
Explorer
- Joined
- Jan 24, 2016
- Messages
- 52
Trying to understand bottlenecks in a future all-flash pool of consumer SSDs, I've come across an interesting question - how much does a single LSI 2008 throttle back a mirrored pool of 8 SSDs? The idea is to use this pool as a fast SAN for VMs. After spending a fair amount doing these tests, I wanted to share my current results and observations in hope of finding some errors I might have made in testing. I am also curious in interpretations of some of the things I've found.
The specified max throughput of an LSI 2008 tops out at 290 kIOPS, which I initially suspected would hold back an all-flash pool, so I set out to investigate the max IOPS capabilities of basically three scenarios:
a) 4 SATA SSD drives on on-board HBA + 4 SATA SSD drives on LSI 2008
b) 8 SATA SSD drives on LSI 2008
c) 8 SATA SSD drives on LSI 3008
The SATA SSD drives are all consumer-grade Samsung drive - a mix of five 256 GB and three 512 GB drives (some PROs, some EVOs). Basically, these are just left-over drives I am trying to consolidate into a single fast pool.
The system under test is a E3-1231 v3 with Supermicro X10SLM-F and 32 GB of DDR3. Each scenario was done on a fresh pool with ashift=12, encryption=off, atime=off, recordSize=128k and disabled auto-trim. Pool configuration is four 2-way mirrored vdevs setup so that the pool has least amount of space loss with the available hardware. Additionally, for scenario a), I have distributed half of the pool to target the on-board HBA and the other half the LSI HBA (basically, each half of a 2-way mirror was "sitting" on two HBAs). The enclosure is Supermicro Mobile Rack CSE-M35TQB (not SAS3 capable but is a SATA/SAS enclosure).
Testing platform is latest stable TrueNAS (12.0-U7).
I ran the tests in each scenario on two separate datasets - both with recordSize=4k, one with sync=disabled, other with sync=always.
Right off the bat - I am not even going to bother tabulating scenario b). The results were simply inferior enough when compared to the other two that I didn't even bother writing them down (from memory, I'd say something like 30-50% poorer than a) and c)). That was certainly unexpected.
All fio tests are 4 KiB random write tests (bs=4k). IOPS reported are rounded off to a thousand - if there is a range of values, that just means the lo/hi observed values.
Here is what I found interesting (on this system):
- combined LSI 2008 + Intel PCH performs better than LSI 3008...in some cases even 50% better!
- with sync=always, ZSTD compression comes basically for free (no reason not to turn it off)
- sync=always absolutely devastates IOPS, irrespective of the HBA
- the single LSI 2008 was considerably inferior to a combined LSI + Intel PCH HBA. I expected some variance but not 30+ percent.
- none of the reported IOPS came even close to reported maximums of their manufacturers
- CPU was maxed out in any test with jobs/iodepth > 1 (HyperThreading was ON for all tests)
One big piece of this puzzle is the latencies - and I haven't tracked them rigorously (due to just too much work and too little time available). What I could visually scan was roughly comparable, though (i.e. sub-ms latencies for most scenarios, except for the 64 jobs/64 iodepth combination, when latencies would shoot up into 10 ms territory).
Questions:
0) Is there anything wrong with these tests or are these SSDs just incapable of reaching higher IOPS in these settings? Why is this pool maxing out far below the stated IOPS limits of the HBAs? I understand that there is a lot of going on under the bonnet of such a pool and that a definition an IOPS is....relaxed....but what I found absolutely shocking are the results for sync=always which I feel like are tremendously poor (500 - 15000 IOPS.....for a pool of 8 SSDs and an LSI 3008?).
1) what is the purpose of having SANs with HBAs that can nominally reach hundreds of thousands or even millions of IOPS, when sync=always reduces these to a fraction of their capability?
2) what would it take to reach 100 kIOPS with sync=always? Is this NVMe-only territory?
3) why would a LSI+PCH combination defeat an LSI 3008 that should be almost three times faster than the 2008? I expected LSI 3008 to shred everything
4) what tuning could I try to further eek out more out of this setup? I still have a hunch this should be capable of much more IOPS
5) it seems to me CPU is the most limiting factor in this system - if going dual-processor (say Supermicro X10DRH-CT), what would be a better upgrade for higher IOPS in a moderately busy pool (i.e. what is represented by fio jobs=16 and iodepth=16)?
- 2x E5-2637 v4 (3.5 GHz, quad core)
- 2x E5-2640 v4 (2.4 GHz, 10-core)
The specified max throughput of an LSI 2008 tops out at 290 kIOPS, which I initially suspected would hold back an all-flash pool, so I set out to investigate the max IOPS capabilities of basically three scenarios:
a) 4 SATA SSD drives on on-board HBA + 4 SATA SSD drives on LSI 2008
b) 8 SATA SSD drives on LSI 2008
c) 8 SATA SSD drives on LSI 3008
The SATA SSD drives are all consumer-grade Samsung drive - a mix of five 256 GB and three 512 GB drives (some PROs, some EVOs). Basically, these are just left-over drives I am trying to consolidate into a single fast pool.
The system under test is a E3-1231 v3 with Supermicro X10SLM-F and 32 GB of DDR3. Each scenario was done on a fresh pool with ashift=12, encryption=off, atime=off, recordSize=128k and disabled auto-trim. Pool configuration is four 2-way mirrored vdevs setup so that the pool has least amount of space loss with the available hardware. Additionally, for scenario a), I have distributed half of the pool to target the on-board HBA and the other half the LSI HBA (basically, each half of a 2-way mirror was "sitting" on two HBAs). The enclosure is Supermicro Mobile Rack CSE-M35TQB (not SAS3 capable but is a SATA/SAS enclosure).
Testing platform is latest stable TrueNAS (12.0-U7).
I ran the tests in each scenario on two separate datasets - both with recordSize=4k, one with sync=disabled, other with sync=always.
Right off the bat - I am not even going to bother tabulating scenario b). The results were simply inferior enough when compared to the other two that I didn't even bother writing them down (from memory, I'd say something like 30-50% poorer than a) and c)). That was certainly unexpected.
All fio tests are 4 KiB random write tests (bs=4k). IOPS reported are rounded off to a thousand - if there is a range of values, that just means the lo/hi observed values.
Scenario | sync | jobs | iodepth | file_size | compression | IOPS reported |
a) | DISABLED | 1 | 1 | 4g | OFF | 73k |
a) | DISABLED | 16 | 16 | 256m | OFF | 82k - 161k |
a) | DISABLED | 64 | 64 | 256m | OFF | 168k - 183k |
a) | ALWAYS | 1 | 1 | 4g | OFF | 0.5k |
a) | ALWAYS | 1 | 1 | 4g | ZSTD | 0.5k |
a) | ALWAYS | 16 | 16 | 256m | OFF | 16k |
a) | ALWAYS | 16 | 16 | 256m | ZSTD | 15k |
c) | DISABLED | 1 | 1 | 4g | OFF | 72k - 103k |
c) | DISABLED | 16 | 16 | 256m | OFF | 131k - 184k |
c) | DISABLED | 64 | 64 | 256m | OFF | 81k -122k |
c) | ALWAYS | 1 | 1 | 4g | OFF | 0.5k |
c) | ALWAYS | 16 | 16 | 256m | OFF | 17.7 k |
c) | ALWAYS | 16 | 16 | 256m | ZSTD | 8.1 k (!) |
Here is what I found interesting (on this system):
- combined LSI 2008 + Intel PCH performs better than LSI 3008...in some cases even 50% better!
- with sync=always, ZSTD compression comes basically for free (no reason not to turn it off)
- sync=always absolutely devastates IOPS, irrespective of the HBA
- the single LSI 2008 was considerably inferior to a combined LSI + Intel PCH HBA. I expected some variance but not 30+ percent.
- none of the reported IOPS came even close to reported maximums of their manufacturers
- CPU was maxed out in any test with jobs/iodepth > 1 (HyperThreading was ON for all tests)
One big piece of this puzzle is the latencies - and I haven't tracked them rigorously (due to just too much work and too little time available). What I could visually scan was roughly comparable, though (i.e. sub-ms latencies for most scenarios, except for the 64 jobs/64 iodepth combination, when latencies would shoot up into 10 ms territory).
Questions:
0) Is there anything wrong with these tests or are these SSDs just incapable of reaching higher IOPS in these settings? Why is this pool maxing out far below the stated IOPS limits of the HBAs? I understand that there is a lot of going on under the bonnet of such a pool and that a definition an IOPS is....relaxed....but what I found absolutely shocking are the results for sync=always which I feel like are tremendously poor (500 - 15000 IOPS.....for a pool of 8 SSDs and an LSI 3008?).
1) what is the purpose of having SANs with HBAs that can nominally reach hundreds of thousands or even millions of IOPS, when sync=always reduces these to a fraction of their capability?
2) what would it take to reach 100 kIOPS with sync=always? Is this NVMe-only territory?
3) why would a LSI+PCH combination defeat an LSI 3008 that should be almost three times faster than the 2008? I expected LSI 3008 to shred everything
4) what tuning could I try to further eek out more out of this setup? I still have a hunch this should be capable of much more IOPS
5) it seems to me CPU is the most limiting factor in this system - if going dual-processor (say Supermicro X10DRH-CT), what would be a better upgrade for higher IOPS in a moderately busy pool (i.e. what is represented by fio jobs=16 and iodepth=16)?
- 2x E5-2637 v4 (3.5 GHz, quad core)
- 2x E5-2640 v4 (2.4 GHz, 10-core)