stripe vs. rZ2 cpu load and performance examples

Status
Not open for further replies.

doesnotcompute

Dabbler
Joined
Jul 28, 2014
Messages
18
Hi All,

I could use some help understanding what's going on. I am not sure if:

1) the metrics (Reporting > Disk) has a gui bug that is not showing disk reads
2) the common knowledge of "striping = less cpu than raidZ2" is not so true
2) the common knowledge of "striping = a lot more performance than raidZ2" is not so true

So, the kit:

* FreeNAS-9.2.1.7-RELEASE-x64
* head unit: Dell R710 (2u rack)
* cpu: dual socket each with a 4 core L5520 @ 2.27GHz
* 48GB ECC
* perc was removed
* m105 reflashed in IT mode to act as 9200i for A/B sas 8087 to dell backplane (bios kept so can serve as boot hba)
* 8x 2.5" drive chassis
* 1x 2.5" 160GB intel s3500 for OS/boot
* 1x 2.5" 160GB intel s3500 for SLOG (not always used, see below)
* 6x 2.5" 500GB crucial M4 for l2arc (not always used, see below)
* onboard quad gigE broadcom
* drac ent.
* dual powersupply
* Chelsio dual port 10Gbps SFP+ SR nic
* LSI 9200e with dual 8088 out the back, reflashed to IT mode, no boot bios
* dual sas cables to a 4U supermicro chassis, the 45 drive one (24 in the front and 21 bays in the back) SAS2 internals, dual powersupplies
* 25x 3TB Hitachi SAS drives

My goal is to make an NFS filer that can receive 10Gbps of backups target throughput. The Chelsio may be connected in active/passive to 2 different 10G switches (no lag/lacp possible as they are not in a stack). I'm still working on the vlanning but got the new supermicro jbod racked this weekend and wanted to play around with vols and configs. Here is what I found:

A) the cpu load on reads or writes in a 25 disk stripe (one huge raid0, @ 66.88TB) vs. four x 6 drive raidZ2 (ie: 4 data drives + 2 parity drives x 4 of these = 24 drives @ 42.8 TB) was negligible

B) the performance on reads or writes in a 25 disk stripe (one huge raid0, @ 66.88TB) vs. four x 6 drive raidZ2 (ie: 4 data drives + 2 parity drives x 4 of these = 24 drives @ 42.8 TB) was surprisingly close

So, not sure if the size of the "who would build a stripe that big" stripe made it not realistic, or what. And I'm thinking the first item, that the gui must not be reporting reads, as I did dd with a 5TB file size at the end and got no disk hits doing reads (on this or any test)only writes, and since 5TB can't fit in a 45GB arc, it must be a gui bug...

anyways, here are some results, between each set of tests i destroyed the volumes and recreated with the drive configs noted below and all used a sparse 10TB folder rebuilt each time:

First Set of Tests - no SSDs used, other than the boot SSD, all 25 Hitachi's in a big stripe:

* no extra slog
* no l2arc ssds
* 25x 3TB Hitachi
* 25 drive stripe

***write speed:***


[root@freenas] /mnt/burnin-r0/testset# dd if=/dev/zero of=temp.dat bs=1024k count=500k

512000+0 records in
512000+0 records out
536870912000 bytes transferred in 223.346941 secs (2403753146 bytes/sec)
[root@freenas] /mnt/burnin-r0/testset#

2,403,753,146 / 1024 (KB) / 1024 (MB) / 1024 (GB) * 8 (Gbps) = 17.9 Gbps

***read speed:***

dd if=temp.dat of=/dev/null bs=1048576

[root@freenas] /mnt/burnin-r0/testset# dd if=temp.dat of=/dev/null bs=1048576
512000+0 records in
512000+0 records out
536870912000 bytes transferred in 102.841927 secs (5220350565 bytes/sec)
[root@freenas] /mnt/burnin-r0/testset#

5,220,350,565 / 1024 (KB) / 1024 (MB) / 1024 (GB) * 8 (Gbps) = 38.894 Gbps

Notes: cpu sub/at 30% write, sub 10% read

Ok, so fast. But, it's a huge stripe, so it should be fast. Now for the same with the volume rebuilt using the s3500 for SLOG and the 6x M4's for l2arc, but still a 25 drive stripe:

* 1x s2500 ssd slog
* 6x m4 SSD l2arch
* 25x 3TB Hitachi
* 25 drive stripe

***write speed:***

# dd if=/dev/zero of=temp.dat bs=1024k count=500k

[root@freenas] /mnt/burnin2-r0/testset# dd if=/dev/zero of=temp.dat bs=1024k count=500k
512000+0 records in
512000+0 records out
536870912000 bytes transferred in 235.378452 secs (2280883859 bytes/sec)

2,280,883,859 = 16.99Gbps

^^ cpu @ 35%


***read speed:***

dd if=temp.dat of=/dev/null bs=1048576


[root@freenas] /mnt/burnin2-r0/testset# dd if=temp.dat of=/dev/null bs=1048576
512000+0 records in
512000+0 records out
536870912000 bytes transferred in 103.162683 secs (5204119322 bytes/sec)
[root@freenas] /mnt/burnin2-r0/testset#

5,204,119,322 = 38.77Gbps

^^ < 10% cpu

Ok, slightly less CPU and a touch more speed without the ssd slog and ssd l2arc. Is the DD workload so basic that it doesn't work out the slog (ie: does it require sync writes?) and maybe my file was too small to hit the l2arc?

Now, on to the 24 (not 25) raidZ2 sets, 4 of them. each with 6 drives (6x4=24) each set having 4 drives equivalent for data and 2 for parity. Again, the first pass without any SSD for slog or l2arc:

* 24x 3TB Hitachi
* 6 drive RaidZ2 x 4

***write speed:***

# dd if=/dev/zero of=temp.dat bs=1024k count=500k

[root@freenas] /mnt/rZ2testvol/testdata# dd if=/dev/zero of=temp.dat bs=1024k count=500k
512000+0 records in
512000+0 records out
536870912000 bytes transferred in 234.426093 secs (2290149981 bytes/sec)
[root@freenas] /mnt/rZ2testvol/testdata#

2,290,149,981 = 17.06Gbps

^^ cpu closer to 37% cpu

***read speed:***

dd if=temp.dat of=/dev/null bs=1048576

[root@freenas] /mnt/rZ2testvol/testdata# dd if=temp.dat of=/dev/null bs=1048576
512000+0 records in
512000+0 records out
536870912000 bytes transferred in 102.800031 secs (5222478117 bytes/sec)
[root@freenas] /mnt/rZ2testvol/testdata#

5,222,478,117 = 38.91Gbps

Hmm, I thought it was going to be much worse.

And now the 4x 6-drive raidZ2 but with SSD for slog and l2arc:


* 1x s2500 ssd slog
* 6x m4 SSD l2arch
* 24x 3TB Hitachi
* 6 drive RaidZ2 x 4

***write speed:***

# dd if=/dev/zero of=temp.dat bs=1024k count=500k

[root@freenas] /mnt/rZ2testvol2/testdata# dd if=/dev/zero of=temp.dat bs=1024k count=500k
512000+0 records in
512000+0 records out
536870912000 bytes transferred in 231.970939 secs (2314388667 bytes/sec)
[root@freenas] /mnt/rZ2testvol2/testdata#

2,314,388,667 = 17.24Gbps

***read speed:***

dd if=temp.dat of=/dev/null bs=1048576


[root@freenas] /mnt/rZ2testvol2/testdata# dd if=temp.dat of=/dev/null bs=1048576
512000+0 records in
512000+0 records out
536870912000 bytes transferred in 102.828751 secs (5221019475 bytes/sec)
[root@freenas] /mnt/rZ2testvol2/testdata#

5,221,019,475 = 38.90Gbps

Here the performance was equivalent (but not lower, unlike striping) when adding the ssd-slog and l2arc.

Interestingly, the most significant thing i did observe as a delta between the 25 drive stripe and the four 6 drive raidZ2's is this:

individual drive "disk i/o (da_)"
* about 125KB/s when when the 25 disk stripe/raid0 was used (regardless of ssds or not)
* about 350KB/s when the 4x 6-drive RaidZ2 vdevs were used (regardless of ssds or not)

So the actual Hitachis seem to be doing a lot more work when the vdev is raidZ2 vs. striped. Does raidZ2 like raid5 or 6 need to read before write as well?

Attached are pics showing the slog's utilization during a write test, like above but with a 5.3TB file. And a shot of the graph showing the CPU side by side over time, raid0/strip on the left, raidZ2 on the write, both writes (read-activity in the middle)

is DD single threaded and i'm core-bound so not able to show the limits of the raidZ2 vs. stripe with 25/24 disks? Why is raidZ2 so close (not really complaining, just surprised to be honest).



. freenas post1.jpg slog during 5TB write.jpg
 

enemy85

Guru
Joined
Jun 10, 2011
Messages
757
Have u disabled compression?
 

doesnotcompute

Dabbler
Joined
Jul 28, 2014
Messages
18
No, it's on by default so I left it on. Dedupe is not selected, though. I had time to rebuild the volume once, here are the specs and results:

((without compression: rZ2testvol3)

*** write speed on a 24 drive 4x 6 drive RaidZ2 setup***
[root@freenas] /mnt/rZ2testvol3/testdata# dd if=/dev/zero of=temp.dat bs=1024k count=500k
512000+0 records in
512000+0 records out
536870912000 bytes transferred in 707.812707 secs (758492899 bytes/sec)

758,492,899 = 723.355MB/s = 5.786 Gbps

^^ cpu usage @ 60%
^^ 3TB disks now at 55MB/s on average
^^ slog ssd now at 96MB/s on avg
^^ l2arch ssd reads now at 6MB/s on avg

looks much more realistic without compression!
 

enemy85

Guru
Joined
Jun 10, 2011
Messages
757
That's why i asked u about! :)
I'm just curious to see the results of the stripe with compression disabled...
 

doesnotcompute

Dabbler
Joined
Jul 28, 2014
Messages
18
i am as well (curious about stripe results without compression). I had time for only 1 test earlier:

((without compression: r0testvol4)

*** write speed on a 25 drive stripe setup***

dd if=/dev/zero of=temp.dat bs=1024k count=500k
512000+0 records in
512000+0 records out
536870912000 bytes transferred in 514.835418 secs (1042801045 bytes/sec)
[root@freenas] /mnt/r0testvol4/testdata#

1,042,801,045 = 994.49MB/s = 7.77Gbps

** 3TB disks at 60MB/s on avg
** cpu @ 80% at first then down to 33%
** slog @ 152MB/s on avg
** l2arc @ 6MB/s


*** read speed***

I ran: dd if=temp.dat of=/dev/null bs=1024k

then, went and did other things. Came back and the gui was unresponsive and ssh hung up. I used the drac card for remote console and was surprised to see it had rebooted. (no power issues, this is in a datacenter). So I check the screen and post card/nic/other bios is a single line:

"This is a NAS data disk and can not boot system. system halted."

So i check the 2 LSI cards:
1) the 8i is a higher boot priority than 8e (it is set to boot first)
2) on the 8i it sees the s3500 ssd boot drive without issue (all 8 drives are recognized fine)

So I reboot again and same result.
 

doesnotcompute

Dabbler
Joined
Jul 28, 2014
Messages
18
oddly, i could install OI, smartOS, nexenta on this drive. but not freenas, same message as above.

so trying to install freenas on the other 160GB s3500, so far so good, will retry the last test round.
 
J

jpaetzel

Guest
You are not hitting GUI bugs. ZFS has special handling for writing zeros. The disks aren't actually doing any I/O.

As a general rule of thumb ZFS can write at about 2.5GB/sec max and read at 4GB/sec max...and that's with an all flash pool. Spinning disk is slower. If you're seeing results much above that you are testing memory performance. (Also, bits/sec numbers are what networking people use. Storage types tend to find bit/sec numbers hard to deal with!)

I wrote a fairly comprehensive post about testing ZFS with iozone (a tool included in FreeNAS) dd can hit all sorts of bottlenecks that have nothing to do with your storage subsystem.

As far as the boot issues...the message you are seeing is due to the system trying to boot off one of the data drives, not the drive that has the FreeNAS install on it. The fact that you can boot other operating systems leads me to believe that the drive is good, although if the drive is good why you are having boot order issues is a bit of a mystery. I'm not familiar with the Dell platform you are using, but in general trying to boot off HBAs with large numbers of drives on them can be somewhat tricky. It's generally easier to just boot off a USB device.

As far as the spontaneous reboot, check the DRAC to see if it logged any hardware issues. Otherwise check in /data/crash to see if FreeNAS kernel paniced. If there's a crash file there go ahead and attach it to this thread. There are a lot of reasons why an x86 box will pull the ripcord and reset the system, some hardware, some software.
 

doesnotcompute

Dabbler
Joined
Jul 28, 2014
Messages
18
Thank you for the reply.

about: "but in general trying to boot off HBAs with large numbers of drives on them can be somewhat tricky. " the HBA with the boot drive has only 8 drives on it (9200-8i) but understood. right the question is why can other flavors boot off that drive but not freenas. i can boot of the neighboring identical 160GB intell s3500 without issue.

Interestingly, using this new install i did another stripe write test (to build a file to read) and the test hangs almost immediately (but at least the host doesn't reboot). i am using this suspect s3500 as my slog device not sure if that's a factor. more to follow, we're testing the drive to see if it looks good or not, it is still seen as a log device in my device/drives listing, so it's not offline.
 

doesnotcompute

Dabbler
Joined
Jul 28, 2014
Messages
18
ok, following up on:

"i am as well (curious about stripe results without compression). I had time for only 1 test earlier:
((without compression: r0testvol4)"

to recap, this is:
(internal to the 2U server head)
boot on 160GB s3500 ssd
slog on a different 160GB s3500 ssd
l2arc on 6x 500GB m4 mlc ssd (crucial)
(external via 9200-8e in supermicro jbod)
25 drive stripe (hitachi 3TB sas drives)

*** write speed on a 25 drive stripe setup***

[root] /mnt/r0testvol4/testdata# dd if=/dev/zero of=temp.dat bs=1024k count=500k
512000+0 records in
512000+0 records out
536870912000 bytes transferred in 436.468935 secs (1230032355 bytes/sec)
[root] /mnt/r0testvol4/testdata#

1,230,032,355 / 1024 = K / 1024 = M == 1173.05MB/s = 9.16Gbps


** 3TB disks write at 55-60MB/s on avg
** cpu @ 35%
** slog write @ 80MB-130MB/s on avg
** l2arc read @ 6MB/s


*** read speed***

[root] /mnt/r0testvol4/testdata# dd if=temp.dat of=/dev/null bs=1024k
512000+0 records in
512000+0 records out
536870912000 bytes transferred in 471.120457 secs (1139561876 bytes/sec)
[root] /mnt/r0testvol4/testdata#

1,139,561,876 / 1024 = K / 1024 = M == 1086.77MB/s = 8.49Gbps


** 3TB disks read at 40-45MB/s on avg
** cpu @ 15-20%
** slog write @ 0MB/s on avg
** l2arc read @ 6-7MB/s

so - writing faster than reading?

....and no hang ups. will try the other testing methods next
 

doesnotcompute

Dabbler
Joined
Jul 28, 2014
Messages
18
...

As far as the boot issues...the message you are seeing is due to the system trying to boot off one of the data drives, not the drive that has the FreeNAS install on it. The fact that you can boot other operating systems leads me to believe that the drive is good, although if the drive is good why you are having boot order issues is a bit of a mystery. I'm not familiar with the Dell platform you are using, but in general trying to boot off HBAs with large numbers of drives on them can be somewhat tricky. It's generally easier to just boot off a USB device.

As far as the spontaneous reboot, check the DRAC to see if it logged any hardware issues. Otherwise check in /data/crash to see if FreeNAS kernel paniced. If there's a crash file there go ahead and attach it to this thread. There are a lot of reasons why an x86 box will pull the ripcord and reset the system, some hardware, some software.

So it happened again, using a different ssd, after working for a day, even though it's an 8i with only 8 drives on it.

So I reinstalled the OS on a usb drive and made it my first boot device. Now, I'm running several app level replication jobs against the 2 NFS shares... and I notice a gap in the network/disk Reporting graphs, checking uptime reports about 7 hours of uptime (now 16:45 in NYC). So around 10am it rebooted. there is no 'crash' directory in the /data directory. I'm running FreeNAS-9.2.1.7-RELEASE-x64.
 
J

jpaetzel

Guest
Check the logs in the DRAC. Spontaneous reboots are often a hardware issue.
 
Status
Not open for further replies.
Top