SOLVED Sanity check of performance, RaidZ2 server 50% of RaidZ1 server

flashdrive · Sep 17, 2021

Hello @Kuro Houou

I am also investigation in the same direction of yours:

Kuro Houou said:
... The only thing I haven't tested was striping two 3 Disk RaidZ1 Vdevs. ...

Maximize write performance to NAS with 6x HDD - single raidz1 versus vdev group of 2x 3HDD?

Hello, what would be the preferable option for a homelab setup with 6x HDD: a) single raiz1 b) vdev group of 2x 3HDD? How would one setup option b) to have only a single shareable storage space? Received advice...

www.truenas.com

Kuro Houou · Sep 19, 2021

So I took the plunge, I cleared my pool on my main server, threw in another 14TB and built a RaidZ1 7 disk pool. Results are not what I expected :(

I am still seeing very low speeds, seems like about half what my backup pool is capable of... These are all 7200rpm disks, I don't know what is going on here...

Backup Server:
Run status group 0 (all jobs):
READ: bw=735MiB/s (771MB/s), 735MiB/s-735MiB/s (771MB/s-771MB/s), io=128GiB (137GB), run=178353-178353msec

Main Server:
Run status group 0 (all jobs):
READ: bw=259MiB/s (272MB/s), 259MiB/s-259MiB/s (272MB/s-272MB/s), io=128GiB (137GB), run=505611-505611msec

Etorix · Sep 19, 2021

Remember that ZFS was written with enterprise use in mind: Serving a large number of clients rather than serving one client at maximal throughput.
Raidz# is devised for storage efficiency, not for performance.
Having more vdevs will give you more performance, especially in IOPS—which is of limited use with a single client—, but the typical configuration for performance is a large stripe of mirrors. With 8 TB and, even worse, 14 TB drives, you are looking at three-way mirrors if you want resiliency on top of performance.

Kuro Houou · Sep 19, 2021

Etorix said:
Remember that ZFS was written with enterprise use in mind: Serving a large number of clients rather than serving one client at maximal throughput.
Raidz# is devised for storage efficiency, not for performance.
Having more vdevs will give you more performance, especially in IOPS—which is of limited use with a single client—, but the typical configuration for performance is a large stripe of mirrors. With 8 TB and, even worse, 14 TB drives, you are looking at three-way mirrors if you want resiliency on top of performance.

So just because a drive is large means that performance tanks for single read/write operations? I mean my 8TB drives are great in raid z1, right now the same exact config with 14TB drives cuts that performance in half? I didn’t know just by having larger drives kills performance like that.

Kuro Houou · Sep 19, 2021

I think I am starting to figure something out... below are some results from some simple SMB file transfers. Simple test I know, but also real world.

Pool is 7 Disk RaidZ2

Dataset - temp1 - recordsize = 1M:
Read = 625MB/s
Write = 750MB/s

Dataset - temp128 - recordsize = 128K:
Read = 260MB/s
Write = 230MB/s

I checked my backup server, its datasets record sizes are 128K so still odd they perform better at 128K. The other thing I thought was weird was the dataset on my backup server had an ashift = 0, while my main server is ashift = 12. I think 12 is the default... not sure why the backup is reporting 0.. what does that mean?

awasb · Sep 19, 2021

a(lignment)shift=0 means ZFS doesn't properly align writes to sector boundaries of your storage devices. ashift=12 means alignment with (2^12=) 4096B = 4k sectors.

Kuro Houou · Sep 19, 2021

awasb said:
a(lignment)shift=0 means ZFS doesn't properly align writes to sector boundaries of your storage devices. ashift=12 means alignment with (2^12=) 4096B = 4k sectors.

Can this be why it performs faster by chance? Is it a bad thing that I should look into fixing?

awasb · Sep 19, 2021

No. IIRC

Code:

ashift=0

means "autodetect". Usually this meant (if anything at all) inferior performance, if the auto detection failed and auto-values were too small. (IIRC ZFS felt back to 9, while most modern disks were / are 4k -> ashift=12).

But how did you evaluate the current settings?

Code:

zpool get all

shows ashift=0 over here as well (=setting), but

Code:

zdb -U /data/zfs/zpool.cache

shows ashift=12 (=actual value).

Kuro Houou · Sep 19, 2021

awasb said:
But how did you evaluate the current settings?

I ran,

zpool get all | grep ashift

but then I ran the other command you said, and it shows ashift = 12... So I guess its all good.

Etorix · Sep 19, 2021

Kuro Houou said:
So just because a drive is large means that performance tanks for single read/write operations? I mean my 8TB drives are great in raid z1, right now the same exact config with 14TB drives cuts that performance in half? I didn’t know just by having larger drives kills performance like that.

It's not a matter of performance, it's a matter of resiliency. HDDs occasionally have an "Unrecoverable Read Error", at a rate of 1 for 1e14 or 1e15 bits. 12 TB is 0.96e14 bits… That doesn't mean there WILL be an URE when reading a whole drive, but the probability is not insignificant—even for Ultrastar DC which should be rated at 1 in 1e15 bits. Basically, as drives keep getting larger, the maths behind this article on raid5

Why RAID 5 stops working in 2009

The storage version of Y2k? No, it's a function of capacity growth and RAID 5's limitations.

www.zdnet.com

now applies to mirrors as well; a 2-way mirror of large HDDs would, upon the loss of one drive, be at non-negligible risk of an unrecoverable error. Essentially, we lose one level of redundancy to UREs: Raid5/raidz1/2-way mirror cannot safely sustain the loss of any drive; raid6/raidz2/3-way mirror can safely lose one drive; raidz3/4-way mirror can safely lose two drives.
Now, ZFS is smarter than some RAID controllers and would not lose the entire pool on an URE, it would flag the affected file as irrecoverable. Depending on your service requirements and backup strategy, losing some data or having to restore the file manually may be acceptable. Just be warned that 2-way mirrors of large HDDs have a non-negligible probability of NOT handling the loss of one drive as gracefully as one would expect.

If you want performance, don't use raidz, stripe mirrors—lots of mirrors. Full stop.
But if you want redundancy/resiliency (which would be expected if you go for ZFS) AND high capacity ON TOP of high performance, the costs will be staggering.

Kuro Houou · Sep 19, 2021

So I am leaning towards a raidz2 config again but with a record size of 512kb (1M) would work too and be a little more performant but I see 512 as a good balance as well.

In my tests I am showing Reads over 500MB/s and Writes over 650MB/s. So I think it’s looking like at least with these drives the best way to increase performance is increase record size.

I guess the HGST 8TB drives must have insane IOPS and work just as well as my 14TB drives with 128K record size. Still doing more tests so will report back a little later.

ChrisRJ · Sep 19, 2021

sretalla said:
Just remember that you're testing with a "single client", so real-world performance where there may be more users/clients of your pool could produce better results with 6 than 5.

I would like to re-iterate how important the "real world performance" part of this is. With parallel access (=many clients) a number of things kick in that change performance characteristics. Spawning separate threads, in addition to its startup time etc., will also affect things like context switching and cache hit/miss ratios on the CPU. In the case of storage that is probably less of a problem, relative to I/O still being the bottleneck, compared to e.g. in an application server scenario.

In either scenario it is far from trivial to model a test scenario such that it reflects the critical aspects of the real world. In this case I am not aware of any mentioning of the access protocol (NFS, iSCSI, SMB, etc.) in the thread. So whatever the requirement is, that is something that needs to be factored in. I will not disagree that the actual disk performance is the foundation and therefor deserves to be looked at in separation. But if the workload is anything but copying around large files, likely none of the findings will apply anyway.

My main job is roughly in the application server camp and not the storage side. I have taken the liberty to "translate" some of the performance tuning stuff I learned onto the storage side. The details will be different, of course. But I doubt that conceptually things will be worlds apart. I hope that helps, although I have to concede that it is not directly actionable. Happy to discuss further!

Kuro Houou · Sep 20, 2021

I appreciate all the help throughout this thread. I guess I should explain my workload a bit more. It is mostly media based, media creation, editing, streaming, etc. This pool will be used over SMB shares to allow things like photo/video editing and content movement to and from my desktops and probably only 1 maybe 2 people using it at the same time. I have another pool on this NAS that hosts jails/applications and I plan on leaving those at the 128k recordsize. But with this pool, most files will be over 1MB, which means a 1MB recordsize makes the most sense from what I have read. I am putting the data I have collected on my main NAS server so you can see what kind of results I got.

The tests were a synthetic test as describe in my second post of this thread with the fio command as well as a SMB share read/write test which is a real world test. So basically testing both local as well as network read/write speeds.

The results are pretty clear and straightforward. Having a 512k or 1M recordsize is the only way to reach my goal of 500MB/s rear/write speeds, 128k is CRIPLING! It seems that having 7 disks provides a little more speed, at least at the local level, SMB R/W seems the same regardless of a 6 or 7 disks RaidZ2 which means I might need to tune my network settings some. I have decided to go with the 1M recordsize, while 512k and 1M are pretty much the exact same when it comes to SMB speeds, I am hoping with some network tuning I can get slightly better performance to match the local disk/synthetic speeds more closely and put my 10GbE network to good use ;)

You can see two tests at the bottom as well showing a 7 Disk Stripe config which kind of shows theoretical max, and a Stripe Mirror setup where I had 3 mirrors of two disks in each mirror set, striped together in a 6 disk array. I didn't get a chance to test them with 1M recordsize but I can only assume they would have done better, although I believe the Stripe Synthetic Write test is maxing out the disks at their stated speed of 250MB/s.

Last I still have no idea how my backup server with its 7 Disks in RaidZ1 performs so well at the 128K recordsize... maybe those disks just have insane IOPS, I do love the old HGST disks ;)

7 Disks	Synthectic Read Test	Read SMB Aio 0	Synthetic Write Test	Write SMB
RaidZ2
1M	880	550	1180	740
512k	686	500	905	770
256k	373	270	465	430
128K	204	200	233	215

6 Disks	Read Test	Read SMB Aio 0	Write Test	Write SMB
RaidZ2
1M	830	580	974	800
512k	620	500	864	780
256k	415	350	461	420
128K	249	230	284	245

Configuration	# of Disks	Record Size	Syn Write	SMB Read
Stripe	7	128	1700	637
Mirror 2disks x 3	6	128	733	330

ChrisRJ · Sep 20, 2021

Kuro Houou said:
It is mostly media based, media creation, editing, streaming, etc. This pool will be used over SMB shares to allow things like photo/video editing and content movement to and from my desktops and probably only 1 maybe 2 people using it at the same time

If you plan to also edit directly on the NAS there is a chance that your tests, at least as I understand them, may be irrelevant. My understanding is that you test for linear transfer of large files only. Editing and browsing the timeline will likely be a different access pattern, where IOPS play a much bigger role. So I would recommend to add those things to your tests asap.

Kuro Houou · Sep 20, 2021

ChrisRJ said:
If you plan to also edit directly on the NAS there is a chance that your tests, at least as I understand them, may be irrelevant. My understanding is that you test for linear transfer of large files only. Editing and browsing the timeline will likely be a different access pattern, where IOPS play a much bigger role. So I would recommend to add those things to your tests asap.

I did some small file transfer tests too, lots of files under 1MB.. not exactly the same as an edit test. and its speeds were still better then 128k record sizes.. not apples to apples like you say for editing.. so ill have to test that as well like you said.

Kuro Houou · Sep 20, 2021

The other interesting thing I am noticing which I expected but is interesting to see anyways... small files tend to be like 1k bigger on the disk then in actual size. While big files are much smaller, about 6%, or 60MB for every 1GB.

Kuro Houou · Sep 20, 2021

Got 1 - 1TB dataset transferred. Mostly large video files. Looks like its using up 965GiB on my backup server with 128k recordsize. but on my main server now with a 1M recordsize its only using 897GiB. That's like 70GB in extra file space! :)

Other issue I am finding, the Rsync job I created to move files from a backup dataset to a new one is painfully slow for some reason.. I think ill just do a restore through the replication task, then do an rsync locally from the restored dataset to the final dataset with the 1M recordsize. I wish the replication task would reset the record size on transfer, making my life a pain right now!

Kuro Houou · Sep 22, 2021

Well just closing out this thread I guess at least for now. I finished my restore from backup! Everything is working great so far, Adobe Premiere is happy, Lightroom as well.. everything is working perfect with the new recordsize of 1M. Its only been a day or so but so far very happy with the results! I mean transferring files between computers is now 2-3X faster so I am happy! Oh and an added benefit, I now got 3TB of free extra space!!! Yeah that's how much more space I am saving with 1M record sizes! From 34TB of used space to 31TB!! Win Win!!!

sretalla · Sep 23, 2021

Kuro Houou said:
verything is working great so far, Adobe Premiere is happy, Lightroom as well.. everything is working perfect with the new recordsize of 1M. Its only been a day or so but so far very happy with the results! I mean transferring files between computers is now 2-3X faster so I am happy! Oh and an added benefit, I now got 3TB of free extra space!!! Yeah that's how much more space I am saving with 1M record sizes! From 34TB of used space to 31TB!! Win Win!!!

Thanks for sharing the feedback and all your work along the way. I really commend your patience and perseverence to get to a result you're happy with.

Important Announcement for the TrueNAS Community.

SOLVED Sanity check of performance, RaidZ2 server 50% of RaidZ1 server

Patron

Contributor

Wizard

Contributor

Contributor

Patron

Contributor

Patron

Contributor

Wizard

Contributor

Wizard

Contributor

Wizard

Contributor

Contributor

Contributor

Contributor

Powered by Neutrality

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Sanity check of performance, RaidZ2 server 50% of RaidZ1 server"

Similar threads