Which M2.PCIe (2280) SSD for fast 'sync writes'

saveZFS

Explorer
Joined
Jan 6, 2022
Messages
87
I currently have a mirror of two gigabyte SSDs (GIGABYTE NVMe SSD 512GB) and I am not satisfied at all with the 'sync-writes' performance (approx. 50 MB/s).
I think at this write rate even a 'SLOG' won't be able to improve much.
So, unfortunately, I will probably have to buy new SSDs. However, I don't want to make the next bad buy.
Can someone please give me a recommendation for M2.PCIe SSDs (form factor 2280) that can handle 'sync-writes' very well?

System (ESXi-Server):
Intel Xeon E3-1245v5
64GB DDR4 ECC
Boot: SSD Intel SSD DC S3610 Series 200GB
with TrueNAS-VM:
2 Cores
16 GB RAM
HBA: IBM M1215 IT-Mode (Passthrough) with: Pool1: 6 x WD Red 4TB (Raid-Z2) & Pool2: 2 x WD Red 5TB (Mirror)
Pool3: 2 x GIGABYTE NVMe SSD 512GB (Mirror) (Passthrough)
 
Joined
Oct 22, 2019
Messages
3,641
Not to go off topic, but do you believe the constant write benchmarks without trims in between might contribute to the slow write speeds you're seeing? After all, the NVMe SSD needs to clear and reset the cells before writing new data in the same areas.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Is the intention here to use the TrueNAS VM to present NFS/iSCSI storage back to the ESXi host (the "all-in-one" approach?)

16GB is very light on the RAM for serving that kind of workload from a read perspective. If your VMs have a light workload it may work just fine though.

Cheapest M.2 2280 option would be a pair of 16GB Optane M10 cards. Good for about 100MB/s sync writes (at 4K recordsize)
 
Last edited:

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Plus, 100MB/s is already pretty good, IMHO. Since the OP, also being German speaking, asked me directly, I'll just cite what I got with my Samsung 970 EVO Plus:
Code:
root@freenas[~]# nvmecontrol devlist
[...]
 nvme1: Samsung SSD 970 EVO Plus 1TB
    nvme1ns1 (953869MB)
 nvme2: Samsung SSD 970 EVO Plus 1TB
    nvme2ns1 (953869MB)
root@freenas[~]# zpool status ssd
  pool: ssd
 state: ONLINE
  scan: scrub repaired 0B in 01:10:58 with 0 errors on Sun Apr 10 01:11:08 2022
config:

    NAME                                            STATE     READ WRITE CKSUM
    ssd                                             ONLINE       0     0     0
      mirror-0                                      ONLINE       0     0     0
        gptid/8d299abe-e22e-11ea-9ee7-ac1f6b76641c  ONLINE       0     0     0
        gptid/ffc026aa-5b0d-11eb-b6a3-ac1f6b76641c  ONLINE       0     0     0

errors: No known data errors
root@freenas[~]# zfs create ssd/test
root@freenas[~]# zfs set primarycache=none ssd/test
root@freenas[~]# zfs set compression=off ssd/test
root@freenas[~]# zfs set sync=always ssd/test
root@freenas[~]# dd if=/dev/zero of=/mnt/ssd/test/foo bs=10m count=1024
1024+0 records in
1024+0 records out
10737418240 bytes transferred in 81.451770 secs (131825474 bytes/sec)

That's 130 MB/s. I mean, 10G in just under one and a half minutes sustained write ... I wonder what the expectation is, here?
If an Optane can deliver 100MB/s then 50MB/s for some second tier brand "chips of the day" product is not particularly bad ...

I did not want to remove one of the drives from my productive mirror for a real "raw" write test. I hope the above approximates that well enough.

Kind regards,
Patrick
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
That's 130 MB/s. I mean, 10G in just under one and a half minutes sustained write ... I wonder what the expectation is, here?
If an Optane can deliver 100MB/s then 50MB/s for some second tier brand "chips of the day" product is not particularly bad ...

I did not want to remove one of the drives from my productive mirror for a real "raw" write test. I hope the above approximates that well enough.

Kind regards,
Patrick
Do dd if=/dev/zero of=/mnt/ssd/test/foo bs=4k count=1m and see how the Samsungs hold up.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
So I got impatient and hit Ctrl-T ...
Code:
root@freenas[~]# dd if=/dev/zero of=/mnt/ssd/test/foo bs=4k count=1m
load: 0.65  cmd: dd 64991 [zcw->zcw_cv] 248.96r 0.24u 3.86s 1% 2552k
61627+0 records in
61627+0 records out
252424192 bytes transferred in 248.950508 secs (1013953 bytes/sec)


Well ...
 

saveZFS

Explorer
Joined
Jan 6, 2022
Messages
87
First of all thank you for your great support! :)

After all, the NVMe SSD needs to clear and reset the cells before writing new data in the same areas.
Does that mean you would recommend enabling trim support on the NVMe pool?

Is the intention here to use the TrueNAS VM to present NFS/iSCSI storage back to the ESXi host (the "all-in-one" approach?)
Thats absolutly correct. And the NVMe-Pool should be the storage for the VMs (Windows, Linux, etc).

16GB is very light on the RAM for serving that kind of workload from a read perspective. If your VMs have a light workload it may work just fine though.
OK, I will make some more tests tomorrow with 32 GB RAM. Maybe the 16 GB are not enough.


Cheapest M.2 2280 option would be a pair of 16GB Optane M10 cards. Good for about 100MB/s sync writes (at 4K recordsize)
That's 130 MB/s. I mean, 10G in just under one and a half minutes sustained write ... I wonder what the expectation is, here?
Thank you for both of your experiences and the test results from the Samsung SSDs. I think I had a completely wrong idea of the speeds I can expect here in the end. I thought the 10 GB/s network to my other computer would be the limit here rather than the TrueNAS SSD mirror. ;)

These were my test values with a record size of 128 KiB:
mb_sync.JPG

iops_sync.JPG


The values are then not as bad as I thought (and please be completely honest)?
 
Joined
Oct 22, 2019
Messages
3,641
Does that mean you would recommend enabling trim support on the NVMe pool?
I'm assuming this means enabling "Auto Trim". If that's the case, then me personally? No. I wouldn't use that.

I would instead have a Cron Task run that runs zpool trim once a week.

 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Thats absolutly correct. And the NVMe-Pool should be the storage for the VMs (Windows, Linux, etc).
Are you currently using NFS? I assume so, as your sync-write performance seems like it's in the right ballpark. Obviously a pair of 16G Optane M10 devices isn't enough to hold any actual VMs - those would be SLOG devices. The alternative, if you're willing to accept the risk, is to disable sync writes (since you're in an all-in-one situation, and any hardware failure is likely to take down both the guest VM and your TrueNAS VM simultaneously) - but this runs the obvious risk of "your VMs could break themselves on power loss or ESXi crash." You can mitigate this risk by backing up your VMs (either at the ZFS or VM level) and store a copy of them on the RAIDZ2 dataset.

OK, I will make some more tests tomorrow with 32 GB RAM. Maybe the 16 GB are not enough.
Your read speeds look fine for now, but bear in mind they'll only be valid for the 16GB or so of the hottest data. Falling back to NVMe read speeds should be fine though. If you're not trying to run VMs off of the RAIDZ2 spinning disks you can probably get away with it, but manage your expectations appropriately.

The values are then not as bad as I thought (and please be completely honest)?
They're exactly what I'd expect from consumer SSDs trying to handle a sync write workload.

Here's a few HDDs with an Optane SLOG:

crystaldiskmark.png
 

saveZFS

Explorer
Joined
Jan 6, 2022
Messages
87
Are you currently using NFS? I assume so, as your sync-write performance seems like it's in the right ballpark. Obviously a pair of 16G Optane M10 devices isn't enough to hold any actual VMs - those would be SLOG devices.
Yes, thats right.

The alternative, if you're willing to accept the risk, is to disable sync writes (since you're in an all-in-one situation, and any hardware failure is likely to take down both the guest VM and your TrueNAS VM simultaneously) - but this runs the obvious risk of "your VMs could break themselves on power loss or ESXi crash."
I started building this new "system" as a all in one system and data protection comes first. So disable sync-writes is no option, but thank you for the proposal.

They're exactly what I'd expect from consumer SSDs trying to handle a sync write workload.
So, in theory, SSDs with PLP should be much faster when writing sync-writes.
They can also use the DRAM for sync writes, because it is safe thanks to the PLP?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Some of my results: - iSCSI Network is 10Gb. All sync=enabled
This is a windows 10 VM on ESXi on a pool of 6 mirrored (3 vdevs) Intel DC S3610 SSD's, with a mirrored Optane 900p SLOG
1649717516305.png

Same machine on pool of 8 Seagate Exos Mirrored (4 vdevs)with same SLOG.
1649716950353.png

Same machine on slightly unbalanced NVMe Pool with same SLOG. Consumer Drives. See sig for details. Drives are in a aliexpress 4 way NVMe to PCIe adapter card with Bifurcation set to 4X4X4X4X
1649717100990.png


Now a single test on NFS share. Sync=enabled. NFS network is 10Gb
Same machine on previous pool of 8 Seagate Exos Mirrored (4 vdevs) with same SLOG. This time on NFS Share
1649717298752.png
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Yes, thats right.

I started building this new "system" as a all in one system and data protection comes first. So disable sync-writes is no option, but thank you for the proposal.
If that's the case, you'll want to either add another pair of Optane SLOG devices on an add-in-card, to accelerate sync writes for the existing Gigabyte SSDs, or you'll need to replace the Gigabyte ones entirely. But NVMe M.2 2280 SSDs with power-loss-protection aren't likely to be cheap.

So, in theory, SSDs with PLP should be much faster when writing sync-writes.
They can also use the DRAM for sync writes, because it is safe thanks to the PLP?
Correct, however they're expensive. Having the PLP-capable SLOG in front eliminates the requirement for PLP on your vdevs while keeping performance up.
 

saveZFS

Explorer
Joined
Jan 6, 2022
Messages
87
Some of my results: - iSCSI Network is 10Gb. All sync=enabled
Wow, that are very impressive results. :)
I am most amazed by the performance of the conventional HDD pool with the SLOG.
Are the IOPS with a good SLOG similar in an HDD pool and a consumer SSD pool, or are the SSDs much faster here?

If that's the case, you'll want to either add another pair of Optane SLOG devices on an add-in-card, to accelerate sync writes for the existing Gigabyte SSDs, or you'll need to replace the Gigabyte ones entirely. But NVMe M.2 2280 SSDs with power-loss-protection aren't likely to be cheap.

Yes, my goal was to have a system that protects my data from the OS (e.g. Win10-VM) to the storage via ZFS and ECC. Then I will replace the gigabytes as soon as possible. Since the 2280 SSDs are so expensive, I may have to look for U2 SSDs and adapters to M2.PCIe, because unfortunately I don't have a free PCIe slot anymore! :(

Correct, however they're expensive. Having the PLP-capable SLOG in front eliminates the requirement for PLP on your vdevs while keeping performance up.
When I started the project and thought about it, I also thought about a SLOG.
But I've come to the conclusion that it's too dangerous for me as a beginner, because I can't fully assess the scope.
I've also read a lot, that if the SLOG fails, the pool can be destroyed or have to be reset.


Another idea:
Can I just buy a good PLP SSD due to the price and run it in a single drive stripe?
But back up the pool daily on a Z2 pool by replication.
Then I would have almost the full ZFS security apart from the automatic bit red correction?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Buy a couple of the U.2 900P's from China - not too expensive and a multiple U.2 to PCIe adapter (as long as the motherboard board supports bifurcation) from Ebay/Aliexpress/Somewhere

The figures I gave you there were I guess the equivalent of the sequential 1M write from @HoneyBadger. You can ignore the read value as its ARC
But, just in case. This is the same machine on the DC S3610 with SLOG
1649755389295.png


The drives I use are: ebay Link. They are a mixed mode drive at 5 DWPD
 

saveZFS

Explorer
Joined
Jan 6, 2022
Messages
87
Buy a couple of the U.2 900P's from China - not too expensive and a multiple U.2 to PCIe adapter (as long as the motherboard board supports bifurcation) from Ebay/Aliexpress/Somewhere
Thank you for our recommendation.
My problem is, that I don't have a free PCIe-slot. :( So I must use a M2.PCIe to U.2. I hope this could work.
Do you maybe have a link, were you bought your SSDs in China?
And will it also work with one SSD and replication to a Z2-Pool or is this no good idea to save costs?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Optane 900p Link: Ebay Link

Yes it would work with 1 Optane, a SLOG is not Pool critical, if it fails then ZFS falls back to the normal on-disk ZIL. (I think I am correct here)

BTW - I noticed that you posted a partial kit list many posts up. You didn't however mention the motherboard and what you are using the PCIe lanes for. However, I consider the CPU you are using to be a workstation CPU rather than a server CPU (very limited PCIe lanes, iGPU etc) so you will have to make compromises
 

saveZFS

Explorer
Joined
Jan 6, 2022
Messages
87
Yes it would work with 1 Optane, a SLOG is not Pool critical, if it fails then ZFS falls back to the normal on-disk ZIL. (I think I am correct here)
OK, then I think I will try it the next time with one U2.SSD.
If I only use one SSD and not a mirror I should scrub the pool very often to detect bit rot very early?

BTW - I noticed that you posted a partial kit list many posts up. You didn't however mention the motherboard and what you are using the PCIe lanes for. However, I consider the CPU you are using to be a workstation CPU rather than a server CPU (very limited PCIe lanes, iGPU etc) so you will have to make compromises
Yes :(
I have a Asus P10S-E-4L board
But the CPU only supports 16 lanes :(
PCIe-Slot1: Quadro K2000 (Passtrough to active working VM)
PCIeSlot2: HBA IBM M1215
PCIeSlot3: USB3-Card (Passtrough to active working VM)
So there is no slot left and the lanes are more than fully used! :(
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Just a thought - you could get rid of the Quaddro and use QS on the CPU for transcoding could you not - that would free up a slot??? or replace the USB card with a PCI USB Card. Startech do one: Link to PCI USB Card
 
Last edited:

saveZFS

Explorer
Joined
Jan 6, 2022
Messages
87
Just a thought - you could get rid of the Quaddro and use QS on the CPU for transcoding could you not - that would free up a slot???
Unfortunately no, the board does not support iGPU. I have already contacted Ausus support. Unfortunately, he confirmed it! :(
or replace the USB card with a PCI USB Card. Startech do one: Link to PCI USB Card
OK, this could be an option! :)
 

saveZFS

Explorer
Joined
Jan 6, 2022
Messages
87
Another idea.
The Intel SSD DC S3610 Series 200GB (has PLP) is my local ESXi Datastore. I currently share 20 GB of this with the TrueNAS-VM.
Could I pass another virtual partition to the TrueNAS-VM and use it as a SLOG?
Or is that a really bad idea, because then TrueNAS cannot address the HW directly?
 
Top