NVMe's (RAIDZ1) - LBA Size vs. Record Size vs. ashift

heyitsjel

Dabbler
Joined
Sep 5, 2023
Messages
13
Hey there fellow TrueNAS users,

Recently, after having a few traditional HDD failures (well... not complete failures, but multiple drives with bad sector counts that are increasing); I've migrated my NAS to using 4x 4TB NVMe drives instead. The NAS specs are as follows:

CPU: 10700K
Mobo: ASRock Z490M-ITX/AC
RAM: 32gb DDR4 3600MT
Storage: 2x Crucial P3 Plus 4tb; 2x Lexar NM790 4tb.
Boot: 2x 120 gb sata SSD's (intel & sandisk)
Network: Mellanox ConnectX-3 (MCX311A-XCAT) 10 GB; using a DAC directly into the switch. Systems on the network are either 2.5GbE or 10 GbE.

Basically, the NAS is mainly used for general data storage (photos; media etc.), along with desktop system backups - nothing too intensive like databases (yet!).
All drives are each running at PCIe 3.0 x 4 lanes (ie. even though they're Gen4 drives, I'll be looking at around 3000-3500 mb/s read/write speeds, per drive). Obviously I won't achieve anywhere near this speed even with simultaneous writes, as I'll be capped by my network interface (around 1Gb/s).

What I did want to confirm however, is if there are any tangible benefits regarding settings for LBA size; Record Size, and ashift.

Currently, I've confirmed that the Crucial P3 Plus 4TB drives support 4096 byte LBA, running: sudo smartctl -c /dev/nvme2n1


Code:
Supported LBA Sizes (NSID 0x1)
Id Fmt   Data  Metadt  Rel_Perf
0  +      512       0         1
1  -     4096       0         0



Similarly, running: sudo nvme id-ns -H /dev/nvme2n1 | grep 'Relative Performance'


Code:
LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0x1 Better (in use)
LBA Format  1 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0 Best



Repeating the above shows that the Lexar NM790 4TB drives appear to only handle LBA sizes of 512 natively? (or 512e, as some call it)


Code:
Supported LBA Sizes (NSID 0x1)
Id Fmt   Data  Metadt  Rel_Perf
0  +      512       0         0

LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0 Best (in use)



Based on the above, I guess it seems I could change the LBA to 4096 for the P3 Plus drives, but I'm not sure if this will cause some sort of incompatibility issue in the RAIDZ1 configuration with the NM790's running 512 bytes; and also, is the potential speed benefit even worth it? From what I could see online, the results (benchmarks) people ran seemed a bit ambiguous.

So that leads me to the next question regarding "ashift" (or from what I can gather, ZFS's block allocation size). To confirm the existing pool's (NVME-16TB-Z1) ashift value, running: sudo zpool get ashift NVME-16TB-Z1


Code:
NAME          PROPERTY   VALUE  SOURCE
NVME-16TB-Z1  ashift     12     local



Based on the above, I can see it's at 12 (corresponding to 2^12, or 4096 bytes). Many posts refer to running ashift=12 as a minimum, but some also run ashift=13 (8192 bytes) for things like Samsung NVMe's. Is this worth changing for an all NVMe setup, or is it literally device specific at this point as to whether there is any performance gain (ie. specific Samsung NVMe's).

Finally, I've read quite a few posts regarding Record Size, with many suggesting if the average data record is a larger file, to run a record size of 512kb to 1M, as opposed to the standard 128kb. I guess this is because as a "home user" I won't be writing a whole heap of small changes to things like a database? Rather, my data will largely be copied in chunks; sit on the server, and be largely read back. Has anyone noticed any performance difference in this like home use (system backups; media files etc).

Would I be better off setting the Record Size for a specific dataset (eg. Media) to 1M, while keeping the record size for alternative data sets (eg. Docker containers; VM's; desktop backups etc.) at something smaller, like the usual 128Kb?

I know it's a fair bit to sift through, but I'm sure there are some power users here who have gone through this, and can give me some pretty straight forward answers.
Hoping to get this sorted, so I can start migrating all my data over to the new drive setup!

Cheers,

Christian
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
It is not clear to me whether or not your are already reaching the limit that is imposed by your network. To me you are implying that this is the case. If so, any other change on the NAS itself will make zero difference, since it is not attacking the bottleneck.

If you are willing to spend enough time (probably a couple of weeks), you will need to understand your workload in a lot of detail and take it from there.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
Based on the above, I can see it's at 12 (corresponding to 2^12, or 4096 bytes). Many posts refer to running ashift=12 as a minimum, but some also run ashift=13 (8192 bytes) for things like Samsung NVMe's. Is this worth changing for an all NVMe setup, or is it literally device specific at this point as to whether there is any performance gain (ie. specific Samsung NVMe's).
ashift=13 would be highly desirable for 8k-native SSDs, to avoid write amplification.
Since this does not appear to be your case, and reformatting would wipe the drives, I suggest leaving it at 12.

Finally, I've read quite a few posts regarding Record Size, with many suggesting if the average data record is a larger file, to run a record size of 512kb to 1M, as opposed to the standard 128kb.
This is a possible optimisation on a dataset basis. For instance, if you have a dataset with only large media files (videos?) and no small sidecar files, a larger record size might be useful, though probably more for HDDs and raidz that for SSDs and mirrors.

Short of switching to a platform which supports ECC RAM, the most obvious improvement to your system would be to replace the Mellanox NIC by a preferred Chelsio, Intel or Solarflare.
 

heyitsjel

Dabbler
Joined
Sep 5, 2023
Messages
13
Hey Chris and Etorix, thanks for the replies.

It's not so much that I'm concerned I'm already potentially maxing out the network capacity, but more so that I also prefer to plan for the future (eg. 40gbps links, once hardware is cheaper).
I'd rather *NOT* have to move *ALL* the data around again, as then I'd need to buy enough storage to temporarily house the NAS's existing data (ie. a spindle drive or two), to then be able to make the possible changes.

Basically, I'd rather do the job right the first time, than try to fix it in a couple years.

Interestingly, I've spent more time copying to/from the machines with 2.5 GbE, as that is where most of the existing data has been backed up to temporarily (spread across multiple machines). I have noticed some writes (from desktop to the NAS) dropping down to around the 230-240mb/s mark... which seems to imply there's some sort of bottleneck occurring... most of the time they're pegged at just shy of 290mb/s, which is more or less the theoretical limit of 2.5 GbE when overheads are taken into account. These drops around 230-240mb/s are what made me look into this, as I would have assumed with 4x gen4 nvme drives (even running at gen3 speeds) in a RAIDZ1 setup would still significantly out-pace the capacity of the 2.5GbE NIC... I guess I was expecting even the 2.5 GbE systems to remain pegged at the 290mb/s.

The only things I can think of based on the above, is that:

1) The NICs are getting hot and possibly throttling/reducing throughput (unlikely, cases have decent airflow and they're "premium" motherboards).
2) Switch is getting hot and throttling... again unlikely, as the switch is designed to handle some 10 gbps throughput... a single 2.5gbps should be a walk in the park.
3) The desktop's nvme (2TB Samsung 980 Pro's, or 2TB WD SN850X's, running at full speed) are somehow read-limiting the transfer... this is definitely a possibility, but again, would surprise me given they're both higher end Gen4 drives, particularly the WD SN850X's... I guess one option to help eliminate this, would be to transfer from two different drives over the network at the same time (ie. even if a single NVMe drive is under very random access, where read speeds may drop to 100-200mb ranges for some files, having two drives copying simultaneously should still exceed the 2.5 GbE capacity).
4) The NAS's NVMe drives are throttling due to sustained writes/temps; and or a limitation of DRAM-less design (again, highly unlikely as I haven't seen the temps get too high... in the RAIDZ1 array, the drives are each barely breaking a sweat). Similarly, most reviews show the NM790 outperforming the Samsung 980 Pro even in sustained writes, as it does have HMB support despite it's DRAM-less design.

With regards to asize=13 (vs. 12), I guess the only way to tell would be to actually benchmark the scenario. I've only copied a small amount of data back to the NAS (about 1.5 TB), so reformatting isn't an issue yet. As far as 8k native goes, it seems the issue is actually working out what the drives actually are. From what I have read, supposedly modern Crucial NVMe drives are 16kb+ these days, and I have no doubt the NM790 is probably in a similar range... the 8k native was for the older generation Samsung drives (eg. 970). Any idea on how to best benchmark these drives in Truenas Scale? Are there any integrated tools for read/write performance testing?

Finally, with the NIC... I do remember reading quite some time back about the preferred NIC options for TrueNAS Core (ie. FreeBSD), with Chelsio the preferred choice by iX Systems themselves due to compatibility. Looking online, I'm a bit hesitant to jump on the Intel bandwagon (eg. X520), despite looking like a decent card - as supposedly there's a significant amount of counterfeits floating around online these days. Similarly, the SolarFlare cards look to be good options.. but since AMD/Xilinx bought them out, it appears their website has been decimated, making it incredibly hard to find any info on the older ex-server cards floating around, let alone any firmware/config related tools (unless someone here can point me in the right direction!). Reading online, sometimes it's necessary to either update the firmware back to a "stock" variant, or even reset configuration options.

From what I was able to see, Mellanox is pretty well supported in Linux, but I know there's potentially hassles for end users related to card configurations/setup (eg. InfiniBand)? The card is detected properly in SCALE, including the PCI interface (PCIe 3.0 x 4); link speed etc. It's already configured for ethernet instead of InfiniBand, but I'll take the necessary measures and update it to the latest firmware. Just to be certain, I'll put a fan blowing on the ConnectX-3, just to make sure it's not throttling due to heat.
 

Z0eff

Dabbler
Joined
Oct 21, 2023
Messages
17
Did you end up finding a way to figure out if an SSD is 8k native or not?
And did you figure out a way to benchmark that fit your needs?

Regarding your file transfers sometimes dipping to 230-240mb/s, perhaps that's when it was copying many small files instead of large files?
I'm in a similar situation and was wondering if you had learned anything since the last post. :smile:
 

heyitsjel

Dabbler
Joined
Sep 5, 2023
Messages
13
Did you end up finding a way to figure out if an SSD is 8k native or not?
And did you figure out a way to benchmark that fit your needs?

Regarding your file transfers sometimes dipping to 230-240mb/s, perhaps that's when it was copying many small files instead of large files?
I'm in a similar situation and was wondering if you had learned anything since the last post. :smile:
Hey mate, sorry for the late reply!

Haven't entirely worked it out yet, but I did find a way to benchmark IO performance natively on TrueNAS SCALE, using "fio".

Jump to my other thread here, method/results near the bottom:

Once my Thunderbolt 3 SFP+ network adapter arrives, I plan on doing some "real" tests at 10gig, from my main system to my NAS. I'll then mess with Crucial's block size (512 vs 4096), as well as varying ashift values (ie. currently on 12, but I'll try 13/14/15 etc).

I'll update the thread with results once I'm done, so at least anyone in future with a similar issue/s will have some answers hopefully.
 
Top