Slow scrub performance (RAIDZ1, 4x nvme drives)

heyitsjel · Nov 29, 2023

Hey there fellow TrueNAS SCALE users,

Basically, I've shifted to a RAIDZ1 array a while back, consisting of 4x 4TB NVME M.2 drives. This gives a usable capacity of about 10.4TB, of which I've used only about 2.5TB worth.
Anyhow, tonight while trying to scrub my old spindle disk array (one is in rough shape), in preparation for dumping the data to the NVME array, the system tripped off unexpectedly. Not exactly sure why - but I suspect it's a fault with one of the spindle drives, as they were unplugged until today, and the system has been running fine. Unplugged the spindle drives, and was able to reboot the system without issue.

As a precaution, I decided to scrub the NVME array, and noticed it's taking about 4 hours to scrub. Isn't this a bit slow for nvme's in a RAIDZ1 array? I'm not entirely clued in on the inner workings of scrub, but I'm assuming it would only need to scrub the actual data on the drives? (ie. approx. 2.5tb of the 10.4.tb total size)
Is this a "read" based process, which then only writes changes as it detects errors?

The drives are all Gen4 drives running at Gen3 speeds... so for read we should be talking in the 3000-3500 mb/s per drive theoretically. The slowest drives in the array are Crucial P3 Plus's (5000 read / 4200 write), while the other two drives are Lexar NM790's (7400 read / 6500 write). Even if you assumed they only performed at 1/5 of their rated speeds at Gen3 in read (ie. 3000mb/s x 20%) on average (ie. variety of file sizes/types), you're still talking an average of 600+ mb/s per drive (or closer to 2400mb/s for the array (roughly). Looking at stats for each drive, I can see they're actually reading between roughly 50mb/s, with spikes up to the 400mb/s (peaks around 450mb/s) for a short while, then dropping back to roughly 50mb/s for a while etc. Basically, averaging 95mb/s per drive... seems awfully slow to me?

Drives are *not* overheating; averaging about 45-49 Celsius each during the scrub. which is nowhere near the throttling values for nvme's (typically 70+). Highest temps I've seen are 54 Celsius.
Similarly, CPU load is averaging maybe 15% on average, with a peak up to 25% every now and then. Barely breaking a sweat at about 40 celsius.

System specs are as follows:

CPU: 10700K
Mobo: ASRock Z490M-ITX/AC
RAM: 32gb DDR4 3600MT
Storage: 2x Crucial P3 Plus 4tb; 2x Lexar NM790 4tb.
Boot: 2x 120 gb sata SSD's (intel & sandisk)
Network: Mellanox ConnectX-3 (MCX311A-XCAT) 10 GB; using a DAC directly into the switch. Systems on the network are either 2.5GbE or 10 GbE.

Any help is *greatly* appreciated!

heyitsjel · Dec 1, 2023

Update: the scrub took about 1h 48 minutes in the end... which is still terribly slow.

Taking the dataset it scrubbed (2.5TB) into account, this is about 390-400 mb/s combined speed (ie. each disk is doing about 95-100mb/s, as above).

Looking into slow write speeds... I've seen posts by people in similar scenarios with lack-luster nvme write performance, appearing to be due to Sync, and lack of a dedicated SLOG? Although, I'm not sure how much this would affect a scrub, unless significant changes were having to be made?

PhilD13 · Dec 1, 2023

Something hardware wise must be wrong. My 800GB mirrored ssd boot pool finishes in 40 seconds which I would think would be slower than m.2's and my main pool finishes in a little over 4 hours and it's spinning rust consisting of 2 vdevs 8 x 8TB drives each in a raid z2

Boot Pool 800GB mirror SSD's
scan: scrub repaired 0B in 00:00:40 with 0 errors

Pool1 Capacity 82.63 TiB Used: 24.24 TiB
scan: scrub repaired 0B in 04:14:05 with 0 errors

I have no additional cache or slog, zlog etc. I use 80% (82GB ish) of 128GB of memory for cache. system is in my sig.

grandmasterfluffles · Dec 2, 2023

400MB/s is about right. I returned these very same SSDs as the native memory is horrendously slow.

See their review here for more info

Relevant paragraphs:

"The good news is, the P3 Plus can absorb up to about 550GB of writes within its pSLC cache. This indicates that all of the QLC is capable of acting in single-bit mode for a total cache capacity that’s one-fourth of the flash. This cache will diminish in size proportionately as the drive is filled based on how much space is left free. A large, dynamic cache is a good way to hide weak native performance. The P3 Plus’s cache is ample to handle typical, bursty workloads.

The bad news is that the native performance is extremely poor. Speeds drop down from 4.4 GBps to 100 MBps. We know that this flash is about as fast as Intel’s 144-layer QLC, that is up to 40 MBps per die, but the use of a massive cache forces the drive to bottleneck as it must free up capacity by moving data over to QLC. The 670p has some static cache and also DRAM, so it doesn’t suffer quite as much. Within the compared drives we do see weak post-cache performance from the SN770, Rocket NVMe 4.0, and SN770 as well."

So yeah, after 550GB they're slower than spinning rust. I recommend Lexar NM790s as a better replacement. Much higher native speed and rated for more TB Writes.

grandmasterfluffles · Dec 2, 2023

Just noticed you have some NM790s! Have not run into such a slow speed with them but perhaps they're matching the array speed of the P3 Plus's?

heyitsjel · Dec 4, 2023

grandmasterfluffles said:
Just noticed you have some NM790s! Have not run into such a slow speed with them but perhaps they're matching the array speed of the P3 Plus's?

Yes that's what I was possibly thinking could be the problem... I'd seen that review previously, although the SLC/QLC cache issue is usually only for sustained sequential writes... ie. a proportion of flash runs in SLC mode, allowing a max of about 550gb to be written sequentially, before the speed drop...

Once the drive is relatively inactive/idle for a period, the controller shifts the data that's in the dynamic SLC bits, to the QLC bits... basically "offline" and un-noticeable to the end user; thereby freeing up most of the dynamic SLC cache. As the drive fills up, this dynamic SLC reduces in size to accommodate the file writes (ie. you've got less free space to actually use for the SLC). From the review: "On the bright side, the P3 Plus is very good at recovering its pSLC cache given sufficient idle time. Leaving some data in the cache can be beneficial for future reads on newly-written data."

While I don't fully understand the exact low level workings of ZFS, I've used 2.4 TiB of 10.4 TiB usable (total of 3.64 TiB x 4; or 14.56 TiB, in RAIDZ1). This leaves 8.04 TiB free...
So based on the above each drive has used about 1.72 TiB (of their 3.64 TiB)... which is 47% full as a worst case scenario. Now, this includes the overhead/reserved space that ZFS uses for its file system (1.04 TiB each drive?) - so it may not actually be written to the drive yet, given that they're not full? This is what I'm not 100% sure on, but I'd assume the logical side of the TrueNAS/file system just reserves the space and doesn't write empty bits/data.
Even if we said the drives are 50% full, and the dynamic cache is halved (eg. 275 Gb), I should still see decent write performance for that amount (or at least 100+gb)... I'm seeing these poor speeds consistently...

The only thing someone mentioned I can do is run fio to test speeds natively on the NAS, to see if it's a hardware issue...
Now the question is; if I use fio to do a drive test - will it use RAM as a temporary cache first? the system only has 32gb, so I figured I could just write a 50gb test file/s, as that would saturate it if such is the case...

I'll get back to you guys with the numbers :)

heyitsjel · Dec 5, 2023

So guys, some interesting results with IO testing using "fio" in shell...

Basically, I ran sequential read/write, and random read/write tests for all different block sizes... dramatic increases in performance are seen at 128k+ block sizes, leading me to believe it's due to the normal ZFS record size; the native size for the actual flash may be around that number (ie. between 64k and 256k) so ashift may need adjusting; or it's related to controller performance. Or I'm wrong about all of this.

Found a great fio guide at: https://arstechnica.com/gadgets/202...-disks-find-out-the-open-source-way-with-fio/

For anyone reading this in future, here's what I did to test IO performance of the pool/drives in SCALE:

Open shell; change directory to a folder you wish to use for testing, inside your dataset (eg: cd /mnt/POOL/DATASET/USER/folder... so for me it was: cd /mnt/NVME-16TB-Z1/Vault/jel/fiotest). This is where the test files "fio" creates will be written/read (ie. the active directory, called fiotest in my case). You want to be able to delete these easily, so hence placing them somewhere you're meant to be reading/writing.
You're then going to run fio, using a command like this:
- fio --name=seq-write128k --ioengine=posixaio --rw=randwrite --bs=128k --size=4g --numjobs=1 --iodepth=1 --runtime=30 --time_based --end_fsync=1
  - --name= This is the name of the file fio creates; so name it whatever you like... if you're doing sequential reads, you may want something like seq-read128k (ie. referring to 128k block size).
  - --ioengine= This is the mode fio uses to interact with the filesystem; with POSIX suitable for Windows/Macs/Linux/BSD. AIO stands for Asynchronous Input Output (ie. multiple operations can be queued up, and the OS chooses how to carry out those operations). For linux (and probably BSD), use posixaio.
  - --rw= This is the type of operation to perform. Options are as follows: read (performs a sequential read), write (performs a sequential write), randread (performs a random read), randwrite (performs a random write). There's a few others like rw (sequential read/write mix) and randrw (random read/write mix), but we'll just use separate sequential and random reads/writes.
  - --bs= This is your block size. Typical options: 4k, 8k, 16k, 32k, 64k, 128k, 256k, 512k, 1M etc.
  - --size= This is the size of the test file you wish to make. The example above is 4GB in size.
  - --numjobs= This is the number of jobs (or operations) you want to run at a time. In the example above, we're making a single file/operation, so it's put as 1. If you wanted to have multiple parallel operations (eg. 8), then you'd put 8, creating 8 separate files, but you need to see how it relates to iodepth below (ie. if you want it to run in parallel or sequentially).
  - --iodepth= This is basically how many operations the OS will handle at a single time. If you have it as one, it will process one at a time (ie. OS acknowledges receipt of every operation, prior to handling another). As we're only dealing with 1 file in this example, we only use 1. If you were wanting 8 parallel operations at a time, you'd put 8.
  - --runtime= This is an optional flag to enable a time based run in seconds. Essentially, even if it completes the 4gb write earlier than 60 seconds, it will continue by starting over and re-writing data until the time has elapsed. The file will still only be the same size as what you specified in --size=
  - --end_fsync= This flag allows the timer to continue running until the OS has reported all operations have been successfully completed (ie. written to disk).
- fio will then run, and finally spit out the results, with something like this:
  - read:[B] IOPS=7340[/B], [B]BW=7341MiB/s[/B] (7697MB/s)(430GiB/60001msec) slat (nsec): min=256, max=206834, avg=689.87, stdev=859.61 clat (usec): min=60, max=3436, avg=135.04, stdev=27.91 .. .. READ: bw=7341MiB/s (7697MB/s), 7341MiB/s-7341MiB/s (7697MB/s-7697MB/s), io=430GiB
    With the above showing the IOPS (in this case 7,340 IOPS, and the average write speed of 7341 MiB/s. If you're writing multiple files, you'll also get a min-max range as well.

heyitsjel · Dec 5, 2023

Now, here's my results:

Notes on the benchmarking:

File size = 4gb
Number of Jobs = 1 (single stream/operation)
IO Depth = 1 (ie. a single stream/operation)
Time = 30s
This should be a "worst case" scenario benchmark... 4k random read/write increases dramatically with more jobs/IO depth (ie. 1000-1500 MiB/s 4k random write, with say, IOdepth = 16 and Jobs = 16)

Based on the above, you can see the 4k performance is horrendously bad (for sequential and random) on my setup... but I would think most of the data on my NAS wouldn't be affected by this (photos; movies; documents etc.... most things are significantly larger than 4k!). Things only seem to really improve around the 128k block size mark... which is even more confusing, as that's the default ZFS record size.

I would have thought a scrub would be 128k sequential reads (mostly, given the record size), along with the odd random write if necessary? Seems strange the performance was so slow for the scrub...

In a few days, my Thunderbolt3 SFP+ adapter will be here, so I'll turn on jumbo frames and try copying from my main system to the NAS... should be a 10Gb connection, so we'll see the "real world" throughput of copying/reading data from the NAS.

Important Announcement for the TrueNAS Community.

Slow scrub performance (RAIDZ1, 4x nvme drives)

heyitsjel

Dabbler

heyitsjel

Dabbler

PhilD13

Patron

grandmasterfluffles

Cadet

grandmasterfluffles

Cadet

heyitsjel

Dabbler

heyitsjel

Dabbler

heyitsjel

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Slow scrub performance (RAIDZ1, 4x nvme drives)

Dabbler

Dabbler

Patron

Cadet

Cadet

Dabbler

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Slow scrub performance (RAIDZ1, 4x nvme drives)"

Similar threads