SOLVED Sanity check of performance, RaidZ2 server 50% of RaidZ1 server

flashdrive

Patron
Joined
Apr 2, 2021
Messages
264
Hello @Kuro Houou

I am also investigation in the same direction of yours:

... The only thing I haven't tested was striping two 3 Disk RaidZ1 Vdevs. ...

 

Kuro Houou

Contributor
Joined
Jun 17, 2014
Messages
193
So I took the plunge, I cleared my pool on my main server, threw in another 14TB and built a RaidZ1 7 disk pool. Results are not what I expected :(

I am still seeing very low speeds, seems like about half what my backup pool is capable of... These are all 7200rpm disks, I don't know what is going on here...

Backup Server:
Run status group 0 (all jobs):
READ: bw=735MiB/s (771MB/s), 735MiB/s-735MiB/s (771MB/s-771MB/s), io=128GiB (137GB), run=178353-178353msec

Main Server:

Run status group 0 (all jobs):
READ: bw=259MiB/s (272MB/s), 259MiB/s-259MiB/s (272MB/s-272MB/s), io=128GiB (137GB), run=505611-505611msec
 
Last edited:

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
Remember that ZFS was written with enterprise use in mind: Serving a large number of clients rather than serving one client at maximal throughput.
Raidz# is devised for storage efficiency, not for performance.
Having more vdevs will give you more performance, especially in IOPS—which is of limited use with a single client—, but the typical configuration for performance is a large stripe of mirrors. With 8 TB and, even worse, 14 TB drives, you are looking at three-way mirrors if you want resiliency on top of performance.
 

Kuro Houou

Contributor
Joined
Jun 17, 2014
Messages
193
Remember that ZFS was written with enterprise use in mind: Serving a large number of clients rather than serving one client at maximal throughput.
Raidz# is devised for storage efficiency, not for performance.
Having more vdevs will give you more performance, especially in IOPS—which is of limited use with a single client—, but the typical configuration for performance is a large stripe of mirrors. With 8 TB and, even worse, 14 TB drives, you are looking at three-way mirrors if you want resiliency on top of performance.

So just because a drive is large means that performance tanks for single read/write operations? I mean my 8TB drives are great in raid z1, right now the same exact config with 14TB drives cuts that performance in half? I didn’t know just by having larger drives kills performance like that.
 
Last edited:

Kuro Houou

Contributor
Joined
Jun 17, 2014
Messages
193
I think I am starting to figure something out... below are some results from some simple SMB file transfers. Simple test I know, but also real world.

Pool is 7 Disk RaidZ2

Dataset - temp1 - recordsize = 1M:
Read = 625MB/s
Write = 750MB/s

Dataset - temp128 - recordsize = 128K:
Read = 260MB/s
Write = 230MB/s

I checked my backup server, its datasets record sizes are 128K so still odd they perform better at 128K. The other thing I thought was weird was the dataset on my backup server had an ashift = 0, while my main server is ashift = 12. I think 12 is the default... not sure why the backup is reporting 0.. what does that mean?
 

awasb

Patron
Joined
Jan 11, 2021
Messages
415
a(lignment)shift=0 means ZFS doesn't properly align writes to sector boundaries of your storage devices. ashift=12 means alignment with (2^12=) 4096B = 4k sectors.
 

Kuro Houou

Contributor
Joined
Jun 17, 2014
Messages
193
a(lignment)shift=0 means ZFS doesn't properly align writes to sector boundaries of your storage devices. ashift=12 means alignment with (2^12=) 4096B = 4k sectors.

Can this be why it performs faster by chance? Is it a bad thing that I should look into fixing?
 

awasb

Patron
Joined
Jan 11, 2021
Messages
415
No. IIRC

Code:
ashift=0


means "autodetect". Usually this meant (if anything at all) inferior performance, if the auto detection failed and auto-values were too small. (IIRC ZFS felt back to 9, while most modern disks were / are 4k -> ashift=12).

But how did you evaluate the current settings?

Code:
zpool get all


shows ashift=0 over here as well (=setting), but

Code:
zdb -U /data/zfs/zpool.cache


shows ashift=12 (=actual value).
 

Kuro Houou

Contributor
Joined
Jun 17, 2014
Messages
193
But how did you evaluate the current settings?

I ran,

zpool get all | grep ashift

but then I ran the other command you said, and it shows ashift = 12... So I guess its all good.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
So just because a drive is large means that performance tanks for single read/write operations? I mean my 8TB drives are great in raid z1, right now the same exact config with 14TB drives cuts that performance in half? I didn’t know just by having larger drives kills performance like that.
It's not a matter of performance, it's a matter of resiliency. HDDs occasionally have an "Unrecoverable Read Error", at a rate of 1 for 1e14 or 1e15 bits. 12 TB is 0.96e14 bits… That doesn't mean there WILL be an URE when reading a whole drive, but the probability is not insignificant—even for Ultrastar DC which should be rated at 1 in 1e15 bits. Basically, as drives keep getting larger, the maths behind this article on raid5
now applies to mirrors as well; a 2-way mirror of large HDDs would, upon the loss of one drive, be at non-negligible risk of an unrecoverable error. Essentially, we lose one level of redundancy to UREs: Raid5/raidz1/2-way mirror cannot safely sustain the loss of any drive; raid6/raidz2/3-way mirror can safely lose one drive; raidz3/4-way mirror can safely lose two drives.
Now, ZFS is smarter than some RAID controllers and would not lose the entire pool on an URE, it would flag the affected file as irrecoverable. Depending on your service requirements and backup strategy, losing some data or having to restore the file manually may be acceptable. Just be warned that 2-way mirrors of large HDDs have a non-negligible probability of NOT handling the loss of one drive as gracefully as one would expect.

If you want performance, don't use raidz, stripe mirrors—lots of mirrors. Full stop.
But if you want redundancy/resiliency (which would be expected if you go for ZFS) AND high capacity ON TOP of high performance, the costs will be staggering.
 

Kuro Houou

Contributor
Joined
Jun 17, 2014
Messages
193
So I am leaning towards a raidz2 config again but with a record size of 512kb (1M) would work too and be a little more performant but I see 512 as a good balance as well.

In my tests I am showing Reads over 500MB/s and Writes over 650MB/s. So I think it’s looking like at least with these drives the best way to increase performance is increase record size.

I guess the HGST 8TB drives must have insane IOPS and work just as well as my 14TB drives with 128K record size. Still doing more tests so will report back a little later.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
Just remember that you're testing with a "single client", so real-world performance where there may be more users/clients of your pool could produce better results with 6 than 5.
I would like to re-iterate how important the "real world performance" part of this is. With parallel access (=many clients) a number of things kick in that change performance characteristics. Spawning separate threads, in addition to its startup time etc., will also affect things like context switching and cache hit/miss ratios on the CPU. In the case of storage that is probably less of a problem, relative to I/O still being the bottleneck, compared to e.g. in an application server scenario.

In either scenario it is far from trivial to model a test scenario such that it reflects the critical aspects of the real world. In this case I am not aware of any mentioning of the access protocol (NFS, iSCSI, SMB, etc.) in the thread. So whatever the requirement is, that is something that needs to be factored in. I will not disagree that the actual disk performance is the foundation and therefor deserves to be looked at in separation. But if the workload is anything but copying around large files, likely none of the findings will apply anyway.

My main job is roughly in the application server camp and not the storage side. I have taken the liberty to "translate" some of the performance tuning stuff I learned onto the storage side. The details will be different, of course. But I doubt that conceptually things will be worlds apart. I hope that helps, although I have to concede that it is not directly actionable. Happy to discuss further!
 

Kuro Houou

Contributor
Joined
Jun 17, 2014
Messages
193
I appreciate all the help throughout this thread. I guess I should explain my workload a bit more. It is mostly media based, media creation, editing, streaming, etc. This pool will be used over SMB shares to allow things like photo/video editing and content movement to and from my desktops and probably only 1 maybe 2 people using it at the same time. I have another pool on this NAS that hosts jails/applications and I plan on leaving those at the 128k recordsize. But with this pool, most files will be over 1MB, which means a 1MB recordsize makes the most sense from what I have read. I am putting the data I have collected on my main NAS server so you can see what kind of results I got.

The tests were a synthetic test as describe in my second post of this thread with the fio command as well as a SMB share read/write test which is a real world test. So basically testing both local as well as network read/write speeds.

The results are pretty clear and straightforward. Having a 512k or 1M recordsize is the only way to reach my goal of 500MB/s rear/write speeds, 128k is CRIPLING! It seems that having 7 disks provides a little more speed, at least at the local level, SMB R/W seems the same regardless of a 6 or 7 disks RaidZ2 which means I might need to tune my network settings some. I have decided to go with the 1M recordsize, while 512k and 1M are pretty much the exact same when it comes to SMB speeds, I am hoping with some network tuning I can get slightly better performance to match the local disk/synthetic speeds more closely and put my 10GbE network to good use ;)

You can see two tests at the bottom as well showing a 7 Disk Stripe config which kind of shows theoretical max, and a Stripe Mirror setup where I had 3 mirrors of two disks in each mirror set, striped together in a 6 disk array. I didn't get a chance to test them with 1M recordsize but I can only assume they would have done better, although I believe the Stripe Synthetic Write test is maxing out the disks at their stated speed of 250MB/s.

Last I still have no idea how my backup server with its 7 Disks in RaidZ1 performs so well at the 128K recordsize... maybe those disks just have insane IOPS, I do love the old HGST disks ;)

7 DisksSynthectic Read TestRead SMB Aio 0Synthetic Write TestWrite SMB
RaidZ2
1M8805501180740
512k686500905770
256k373270465430
128K204200233215
6 DisksRead TestRead SMB Aio 0Write TestWrite SMB
RaidZ2
1M830580974800
512k620500864780
256k415350461420
128K249230284245
Configuration# of DisksRecord SizeSyn WriteSMB Read
Stripe71281700637
Mirror 2disks x 36128733330
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
It is mostly media based, media creation, editing, streaming, etc. This pool will be used over SMB shares to allow things like photo/video editing and content movement to and from my desktops and probably only 1 maybe 2 people using it at the same time
If you plan to also edit directly on the NAS there is a chance that your tests, at least as I understand them, may be irrelevant. My understanding is that you test for linear transfer of large files only. Editing and browsing the timeline will likely be a different access pattern, where IOPS play a much bigger role. So I would recommend to add those things to your tests asap.
 

Kuro Houou

Contributor
Joined
Jun 17, 2014
Messages
193
If you plan to also edit directly on the NAS there is a chance that your tests, at least as I understand them, may be irrelevant. My understanding is that you test for linear transfer of large files only. Editing and browsing the timeline will likely be a different access pattern, where IOPS play a much bigger role. So I would recommend to add those things to your tests asap.

I did some small file transfer tests too, lots of files under 1MB.. not exactly the same as an edit test. and its speeds were still better then 128k record sizes.. not apples to apples like you say for editing.. so ill have to test that as well like you said.
 

Kuro Houou

Contributor
Joined
Jun 17, 2014
Messages
193
The other interesting thing I am noticing which I expected but is interesting to see anyways... small files tend to be like 1k bigger on the disk then in actual size. While big files are much smaller, about 6%, or 60MB for every 1GB.
 

Kuro Houou

Contributor
Joined
Jun 17, 2014
Messages
193
Got 1 - 1TB dataset transferred. Mostly large video files. Looks like its using up 965GiB on my backup server with 128k recordsize. but on my main server now with a 1M recordsize its only using 897GiB. That's like 70GB in extra file space! :)

Other issue I am finding, the Rsync job I created to move files from a backup dataset to a new one is painfully slow for some reason.. I think ill just do a restore through the replication task, then do an rsync locally from the restored dataset to the final dataset with the 1M recordsize. I wish the replication task would reset the record size on transfer, making my life a pain right now!
 

Kuro Houou

Contributor
Joined
Jun 17, 2014
Messages
193
Well just closing out this thread I guess at least for now. I finished my restore from backup! Everything is working great so far, Adobe Premiere is happy, Lightroom as well.. everything is working perfect with the new recordsize of 1M. Its only been a day or so but so far very happy with the results! I mean transferring files between computers is now 2-3X faster so I am happy! Oh and an added benefit, I now got 3TB of free extra space!!! Yeah that's how much more space I am saving with 1M record sizes! From 34TB of used space to 31TB!! Win Win!!!
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
verything is working great so far, Adobe Premiere is happy, Lightroom as well.. everything is working perfect with the new recordsize of 1M. Its only been a day or so but so far very happy with the results! I mean transferring files between computers is now 2-3X faster so I am happy! Oh and an added benefit, I now got 3TB of free extra space!!! Yeah that's how much more space I am saving with 1M record sizes! From 34TB of used space to 31TB!! Win Win!!!
Thanks for sharing the feedback and all your work along the way. I really commend your patience and perseverence to get to a result you're happy with.
 
Top