Very Slow Enterprise SSD ZIL speeds

TheKurrgan

Cadet
Joined
Jan 20, 2021
Messages
1
I’m building a storage server for ESXi virtualization and need sync writes to be quick, as this is a VERY write intensive work load. I purchased 12 Seagate 1200.2 800GB SSD’s which are SAS 12Gbit. Specs indicate a max of 850 sequential write, so I divided that in half for good measure. Literature says the have Power Loss Protection, so they should work pretty good as SLOG for ZFS in an all sync write situation with ESXi. However they are performing very poorly. Each drive maxes out at around 30MB/sec during sequential writes with 64KB block sizes. Just to test it isnt the source, i did a zfs set sync=disabled Pool and viola instant 10Gbit ingest rate no problem.
I’ve checked modepage 0x08 on the drives and found nothing obviously broken there.
It’s as if something is denying the drive use of its internal SSD cache, as I would see on consumer SSD’s without PLP.
The drives are connected to an LSI 2308 card, which connects to an expander back plane with 4 channels and then connects to the 12 SSD’s. The card is in IT mode with firmware version 20.

Transport used is NFSv3, with ESXi defaults.
Anyone out there had this problem with enterprise SSD’s and ZFS?
Any thoughts are appreciated.

System Hardware:
Motherboard: Supermicro X9DBS-F
CPU: 2x Intel Xeon e5-2420 v2
RAM: 96GB DDR3 1333 Hynix ECC Memory, Rank 1
Intel X540-t2 configured in lagg, LACP. Jumbo frames enabled.
Drive listing:
<SEAGATE ST800FM0053 XGEG> at scbus0 target 8 lun 0 (pass0,da0)
<SEAGATE ST800FM0053 XGEG> at scbus0 target 9 lun 0 (pass1,da1)
<SEAGATE ST800FM0053 XGEG> at scbus0 target 10 lun 0 (pass2,da2)
<SEAGATE ST800FM0053 XGEG> at scbus0 target 11 lun 0 (pass3,da3)
<SEAGATE ST800FM0053 XGEG> at scbus0 target 12 lun 0 (pass4,da4)
<SEAGATE ST800FM0053 XGEG> at scbus0 target 13 lun 0 (pass5,da5)
<SEAGATE ST800FM0053 XGEG> at scbus0 target 14 lun 0 (pass6,da6)
<SEAGATE ST800FM0053 XGEG> at scbus0 target 15 lun 0 (pass7,da7)
<SEAGATE ST800FM0053 XGEG> at scbus0 target 16 lun 0 (pass8,da8)
<SEAGATE ST800FM0053 XGEG> at scbus0 target 17 lun 0 (pass9,da9)
<SEAGATE ST800FM0053 XGEG> at scbus0 target 18 lun 0 (pass10,da10)
<SEAGATE ST800FM0053 0006> at scbus0 target 19 lun 0 (pass11,da11)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 20 lun 0 (pass12,da12)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 21 lun 0 (pass13,da13)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 22 lun 0 (pass14,da14)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 23 lun 0 (pass15,da15)
<LSI SAS2X36 0e1b> at scbus0 target 24 lun 0 (pass16,ses0)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 25 lun 0 (pass17,da16)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 26 lun 0 (pass18,da17)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 27 lun 0 (pass19,da18)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 28 lun 0 (pass20,da19)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 29 lun 0 (pass21,da20)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 30 lun 0 (pass22,da21)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 31 lun 0 (pass23,da22)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 32 lun 0 (pass24,da23)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 33 lun 0 (pass25,da24)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 34 lun 0 (pass26,da25)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 35 lun 0 (pass27,da26)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 36 lun 0 (pass28,da27)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 37 lun 0 (pass29,da28)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 38 lun 0 (pass30,da29)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 39 lun 0 (pass31,da30)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 40 lun 0 (pass32,da31)
<LSI SAS2X28 0e12> at scbus0 target 41 lun 0 (pass33,ses1)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 42 lun 0 (pass34,da32)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 43 lun 0 (pass35,da33)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 44 lun 0 (pass36,da34)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 45 lun 0 (pass37,da35)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 46 lun 0 (pass38,da36)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 47 lun 0 (pass39,da37)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 48 lun 0 (pass40,da38)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 49 lun 0 (pass41,da39)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 50 lun 0 (pass42,da40)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 51 lun 0 (pass43,da41)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 52 lun 0 (pass44,da42)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 53 lun 0 (pass45,da43)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 54 lun 0 (pass46,da44)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 55 lun 0 (pass47,da45)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 56 lun 0 (pass48,da46)
<SEAGATE ST14000NM0288 ET02> at scbus0 target 57 lun 0 (pass49,da47)
<LSI SAS2X28 0e12> at scbus0 target 58 lun 0 (pass50,ses2)
<SATADOM-MV 3ME S130710> at scbus1 target 0 lun 0 (pass51,ada0)
<SATADOM-MV 3ME S130710> at scbus2 target 0 lun 0 (pass52,ada1)
<AHCI SGPIO Enclosure 2.00 0001> at scbus7 target 0 lun 0 (pass53,ses3)
<ADATA USB Flash Drive 1100> at scbus9 target 0 lun 0 (pass54,da48)

Pool as currently created: (i've tried almost every possible configuration, including a stripe pool containing only the SSD's)

root@truenas[/dev]# zpool status
pool: Pool1
state: ONLINE
config:

NAME STATE READ WRITE CKSUM
Pool1 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/f10dfee4-5abb-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/f13db027-5abb-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/f145c7a0-5abb-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/f18ac91b-5abb-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/f4e848d9-5abb-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/f4ded41e-5abb-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/f4a6ee0f-5abb-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/f4b017b0-5abb-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/f5d27d06-5abb-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
gptid/f61cd5f5-5abb-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/f5474cea-5abb-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/f689c6de-5abb-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/f6f8efae-5abb-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/f798ddc6-5abb-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/f77a1011-5abb-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/f7c07d97-5abb-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/fac906da-5abb-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/fe65c5f4-5abb-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
raidz2-2 ONLINE 0 0 0
gptid/fb44c6ac-5abb-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/fe8464f1-5abb-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/00058e6d-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/0075fd01-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/01208b0b-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/01fd482b-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/0290317c-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/03cc3b17-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/034672f0-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
raidz2-3 ONLINE 0 0 0
gptid/03adb6d1-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/037cdf89-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/0407f08d-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/04814688-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/04df525f-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/050a9d15-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/0579bb8a-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/090c4c5c-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/0a942462-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
special
mirror-8 ONLINE 0 0 0
gptid/0e167fe9-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/0dfa2f2a-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
logs
gptid/0d5bb269-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/0d9af509-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/0d8aaf88-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/0db6bba3-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
cache
gptid/0af5c0ee-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/0c3778a9-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/0c48c6cc-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/0cc3cf1b-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/0ccd7891-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/0d53d5d8-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0

errors: No known data errors

pool: boot-pool
state: ONLINE
config:

NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0p2 ONLINE 0 0 0
ada1p2 ONLINE 0 0 0

errors: No known data errors
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
What drives are you using for the special VDEV?

You should be aware that it's a Mirror and that doesn't match the redundancy level of all of your other VDEVs in the pool (if you lose them both, your pool is dead).

Is that special VDEV a metadata VDEV?

Also, you should probably be aware that RAIDZ2 isn't a good match for block storage/IOPS performance:
 

Herr_Merlin

Patron
Joined
Oct 25, 2019
Messages
200
check this toppic out for SLOG performace. varios enterprise SSDs (SAS,NVME,SATA) may performe poorly because they are designed for a different workload.
Addionally you want to go with mirrored vdevs for bock storage.

looking at this:
Code:
logs
gptid/0d5bb269-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/0d9af509-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/0d8aaf88-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/0db6bba3-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
cache
gptid/0af5c0ee-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/0c3778a9-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/0c48c6cc-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/0cc3cf1b-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/0ccd7891-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0
gptid/0d53d5d8-5abc-11eb-89ab-0cc47a3ae82c ONLINE 0 0 0


you have stiped those - why?
why more than one l2arc drive?

form my point you have not understood how those drives work.

Ideas:
You got 14 SSDs with 800GB each
You got 36 HDD with 14TB each

Thus I would go for about the following setup:
Pool layout:
- 18x Mirror vdev for data consiting of 2 14TB HDD each
- 2x Intel Optane 100GB P48x for ZIL/SLOG
- 1x random enterprise NVMe for L2ARC

If you can max the memory out, which would be 192GB for your setup. Or get a different board and go for 384GB.

Those 800GB SAS SSDs are to small for special data vdev (metadata) for a pool of your size (252TB).
If you don't have extra Money go for the following:

- 18x Mirror vdev for data consiting of 2 14TB HDD each
- 2x SAS 800GB SSD
- 1x SAS 800GB SSD

Once you have delete your pool can you test the write speed of a single SAS SSD?
diskinfo -wS /dev/XXX

so in your case it could be: diskinfo -wS /dev/da4
And please post the results here and in the slog benchmark thread linked below.

And then recreate as mentioned and test again?


Slog benchmark thread: https://www.truenas.com/community/threads/slog-benchmarking-and-finding-the-best-slog.63521/


maybe it's possible to creat a multiple mirrored vdevs for metadata. In that case that might be a good usecase for those 800GB SSDs @HoneyBadger might know.

Looking at a review of those SSDs they don't seem to be great at all: https://www.storagereview.com/review/seagate-1200-enterprise-ssd-review
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
maybe it's possible to creat a multiple mirrored vdevs for metadata. In that case that might be a good usecase for those 800GB SSDs @HoneyBadger might know.
I really need to figure out why I'm not getting notified when people @ tag me. (Unless that was edited in after the initial post, that won't trigger the alert.)

@TheKurrgan -

I suspect there's something in the Seagate SSD firmware that's causing them to actually flush to NAND immediately rather than acknowledge the flush request and do it asynchronously with the knowledge that their PLP capacitors will hold up in case of unexpected power loss. I'd like to see the result of the diskinfo -wS /dev/XXX test (it's destructive, so make sure you cleanly remove the test-candidate drive from the pool first) to see if that might be the case.

Mirrors are strongly encouraged for block data as mentioned by @Herr_Merlin - see also the resource here regarding why mirrors are preferred:
Using RAIDZ2 even with multiple vdevs will cost you a significant amount of back-end vdev speed as well as result in much worse space efficiency than you are expecting.

For your metadata you can absolutely set up multiple mirrored vdevs for special as well in order to have more metadata space. Multiple l2arc/cache drives are fine and they are also fine to be a stripe, losing an L2ARC drive is only a performance loss and not a data integrity issue. Multiple SLOG drives can drive improved performance if the drives themselves are up to the task but as I mentioned initially I suspect the firmware is trying to be almost "too honest" with its response to a cache flush. A diskinfo test might suss that out.
 
Top