New NAS build ZIL drive 100% used poor performance

Status
Not open for further replies.

robertmehrer

Dabbler
Joined
Sep 25, 2018
Messages
35
I just built a new NAS for Seeding large amounts of data quickly.

Specs are:

FreeNAS-11.2-BETA3
Dell R510
98251MB of RAM
Intel(R) Xeon(R) CPU E5620 @ 2.40GHz 16 Cores
12 x 12TB Ironwolf Pro Drives 250MBps https://www.seagate.com/internal-hard-drives/hdd/ironwolf/
Evo 960 1TB NVME ZIL
Evo 960 1TB NVME L2ARC
(i know this is a no no) Perc H700 with 12 Raid 0 Disks presented to FreeNAS and setup in a stripe in FreeNAS
10GB Network Connection
No Compression
No Dedupe

iSCSI connection to windows machine as a mount point for commvault seeding.

For whatever reason i cannot get the drives to write at more than 40MBps, the ZIL drive nvd0 is showing 100% busy all the time, the L2ARC not at all being used and the network traffic is erratic. I can burst to like 6Gbps on the 10GB NIC but i cant get a constant write rate to this NAS. Running IOMeter and i can get like 1033Mbps to the NAS but not sustained.

I have another NAS with 36 Consetallation E drives that are 150MBps and 2 500gb Velociraptors setup as a L2ARC drive and i can get a constant 3GBps to that NAS...

I cant figure out why this new NAS with newer and faster hardware cant sustain writes and is erratic... or why the ZILK drive is 100% busy and such High I/O pending...
 

Attachments

  • disk operations.PNG
    disk operations.PNG
    228.5 KB · Views: 473
  • L2ARC-ZIL.PNG
    L2ARC-ZIL.PNG
    216.7 KB · Views: 469
  • latency.PNG
    latency.PNG
    241.1 KB · Views: 382
  • Spinning.PNG
    Spinning.PNG
    158.4 KB · Views: 370
Last edited:

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
First, toss the POS H700 in the trash. Second if you absolutely insist on using it, disable any and all read ahead and caching.
ZIL (in your case because its a separate from the pool device it's a SLOG), This is only used or sync writes and will never give you more throughput than the backing pool. Meaning if its only used for writing and "seeding" large files, it's almost pointless. Also a 1TB drive for the SLOG is like using a dump truck to move a hand full of rocks... You will NEVER use more than 32GB..l Likely never more than 12.5GB (max ingest rate per second times 5 second and I'm guessing you have a dual port 10gbe card)
L2ARC, This again may do little to nothing for large sequential reads. The impact on the performance also depends on the record size for your datasets.
Compression may be advantageous depending on the data being stored.
Commvault looks like backup software....
When you say a stripe do you mean like RAID0?!?! I hope you not storing backups on a pure stripe pool! That's just insane!
can't figure out why this new NAS with newer and faster hardware can't sustain writes and is erratic...
Perc H700...
You literally have one of the least save setups of this size that I have EVER seen.

Take a look at my sig, I can sustain 1.5GB per second reads and easily sustain 180MB/s writes. All with no SLOG or L2ARC
 
Joined
Feb 2, 2016
Messages
574
I don't know where to start.

* IOPS: you don't have any.

* Perc H700: ugh. For under $100, you can replace that with ZFS-recommended HBA.

* SSD SLOG may be helpful for bursty writes. If you're streaming large writes or regularly exceeding the SLOG's cache limit (hint: it'll never use anywhere close to the 1TB you've presented), you're still write-limited by the underlying conventional drives. Long story short, in your use case, SLOG isn't helping and may be hurting.

* No compression is often worse than lz4 compression for performance. The less data moving to and from the drives, the faster. Even when...

* The E5620 was released in 2009/2010? I'm a fan of retro hardware and generally suggest lower CPU power for FreeNAS but you may be the exception.

I don't have any experience with Commvault but my guess is that it does a lot of random reads and random writes. That requires IOPS and you have, probably, 200 of them at the most. Which ain't a lot.

Cheers,
Matt
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I'm going to be blunt, you've made a number of errors, some trivial, some critical.

No Compression

Unless you're 100% positive that your data is completely incompressible, use LZ4. This can be set after the fact but will only impact new writes. zfs set compression=lz4 PoolName

Evo 960 1TB NVME ZIL

Poor SLOG choice. No power loss prevention, low-endurance TLC NAND, 1TB is way too large unless you've created a small partition to allow for significant wear-leveling, and what do you need an SLOG for if you're focused on read/seed performance? edit: OP might be talking about seeding a cloud backup solution - http://documentation.commvault.com/commvault/v11/article?p=9274.htm - in this case, the array is ingesting a huge amount of sequential writes and will then physically be shipped to a datacenter

(i know this is a no no) Perc H700 with 12 Raid 0 Disks presented to FreeNAS and setup in a stripe in FreeNAS

This is very likely your problem and needs to be addressed right away. The PERC is probably trying to cache writes to all of the individual RAID0s at the controller level while having disabled the drive cache, and it's just getting absolutely hammered.

I'll wager that your performance skyrockets if you replace this with a proper HBA; with 2x NVMe SSD and a 10Gbps NIC I imagine you don't have a free slot for it now (the "storage slot" won't accept an HBA with non-Dell firmware) - edit: that's the R710, not the R510, my mistake

I'm betting you could ditch the SLOG as you likely don't need sync writes to this dataset since if I'm reading correctly you've set up a twelve drive stripe.

Russian Roulette is supposed to be played with one chamber full, not one chamber empty.
 
Last edited:

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
(the "storage slot" won't accept an HBA with non-Dell firmware)
It does on my NX3100 (AKA R510). Im currently using an IBM LSI 2008 based card.
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
Can we feature this on the how not to FreeNAS page?:D
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Its backup software. LONG writes and lots of random reads (depending on the restores).
I think the OP is talking about the process of "seeding" the data into a cloud backup solution via Commvault:

http://documentation.commvault.com/commvault/v11/article?p=9274.htm

If you're going to go through the trouble of filling an appliance with data and physically shipping it off somewhere, you'd better use a solution with at least a modicum of redundancy.

@robertmehrer - can you do a zpool status and post the result inside of CODE tags here so we can verify this?

It does on my NX3100 (AKA R510). Im currently using an IBM LSI 2008 based card.

Right, R710 complains - R510 is okay with it.
 

robertmehrer

Dabbler
Joined
Sep 25, 2018
Messages
35
I'm going to be blunt, you've made a number of errors, some trivial, some critical.



Unless you're 100% positive that your data is completely incompressible, use LZ4. This can be set after the fact but will only impact new writes. zfs set compression=lz4 PoolName



Poor SLOG choice. No power loss prevention, low-endurance TLC NAND, 1TB is way too large unless you've created a small partition to allow for significant wear-leveling, and what do you need an SLOG for if you're focused on read/seed performance?



This is very likely your problem and needs to be addressed right away. The PERC is probably trying to cache writes to all of the individual RAID0s at the controller level while having disabled the drive cache, and it's just getting absolutely hammered.

I'll wager that your performance skyrockets if you replace this with a proper HBA; with 2x NVMe SSD and a 10Gbps NIC I imagine you don't have a free slot for it now (the "storage slot" won't accept an HBA with non-Dell firmware) - but I'm betting you could ditch the SLOG as you likely don't need sync writes to this dataset since if I'm reading correctly you've set up a twelve drive stripe.

Russian Roulette is supposed to be played with one chamber full, not one chamber empty.

We needed to be able to write 60+ TBs fast to the device and offload that much just as quick. The issue is we seed large amounts of government data and the change rate is very high with body cam footage etc. So the faster we can move the data the smaller the deltas.

We write to the device from a large 3PAR array and write to a very Large 3PAR array...

Yea its mostly Audio and Video Body cam footage and interview data.

Which Drive should we use for the ZIL/SLOG ? Reviews i read were saying the Evo 960 was good for that.

Im currently trying to find the easiest drop in HBA controller and its looking the H200 is a better replacement for the H700? Can you confirm this?

ANY help would be greatly appreciated.
 

robertmehrer

Dabbler
Joined
Sep 25, 2018
Messages
35
I think the OP is talking about the process of "seeding" the data into a cloud backup solution via Commvault:

http://documentation.commvault.com/commvault/v11/article?p=9274.htm

If you're going to go through the trouble of filling an appliance with data and physically shipping it off somewhere, you'd better use a solution with at least a modicum of redundancy.

@robertmehrer - can you do a zpool status and post the result inside of CODE tags here so we can verify this?



Right, R710 complains - R510 is okay with it.


We are not seeding to the cloud, we are the cloud. We are a Large offsite storage provider. The clienst Metro-E connections cant sustain the delta rates for xfering the data over the wire for initial seeds so we used smaller nas's to move the data physically then kick off deltas over the wire. It just happens that this server is 60+ TBs with mount points. Commvault just seeds the BU job to a Disk in windows and we move that disk to a media agent on our side and let commvault ingest it to our SANs.

Code:
config:																															 
																																   
	   NAME										  STATE	 READ WRITE CKSUM													
	   Seed										  ONLINE	   0	 0	 0													
		 gptid/077f808a-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
		 gptid/081c32e9-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
		 gptid/08b2c59b-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
		 gptid/09564426-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
		 gptid/0a11fb47-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
		 gptid/0ac4d74b-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
		 gptid/0b80c2ae-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
		 gptid/0c4d8b1b-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
		 gptid/0d093309-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
		 gptid/0ddbe011-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
		 gptid/0ea3a2ab-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
		 gptid/0f8810a4-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
	   logs																														
		 gptid/d44f9ab6-bdaf-11e8-9b9b-842b2b597411  ONLINE	   0	 0	 0													
	   cache																													   
		 gptid/e53ca929-bdaf-11e8-9b9b-842b2b597411  ONLINE	   0	 0	 0													
																																   
errors: No known data errors																										
																																   
  pool: freenas-boot																												
 state: ONLINE																													 
  scan: none requested																											 
config:																															 
																																   
	   NAME		STATE	 READ WRITE CKSUM																					 
	   freenas-boot  ONLINE	   0	 0	 0																					
		 mfid0p2   ONLINE	   0	 0	 0																					 
																																   
errors: No known data errors				
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
I think the OP is talking about the process of "seeding" the data into a cloud backup solution via Commvault:
So he's going to ship the whole server? I don't know if the H700 stores any metadata on the disks that would cause issues in another server.
We write to the device from a large 3PAR array and write to a very Large 3PAR array...
I'm sorry to hear that.
Which Drive should we use for the ZIL/SLOG ? Reviews i read were saying the Evo 960 was good for that.
None. The L2ARC cache is reset on reboot. If the server is shutdown after writing or before reading, it wont do anything.
I'm currently trying to find the easiest drop in HBA controller and its looking the H200 is a better replacement for the H700? Can you confirm this?
Only if you crossflash it with the LSI firmware. Search for the guides.
If your running the pool as a giant stipe and using a non PLP (power loss protected) SLOG, there's no point in having a SLOG. Just set the dataset to sync disabled and watch your write speed sky rocket.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
We needed to be able to write 60+ TBs fast to the device and offload that much just as quick. The issue is we seed large amounts of government data and the change rate is very high with body cam footage etc. So the faster we can move the data the smaller the deltas.

We write to the device from a large 3PAR array and write to a very Large 3PAR array...

Huge sequential transfers means you want to gun for a RAIDZ2 or similar setup, this will use two drives for parity. Edit: you are the cloud. Neat. Definitely use RAIDZ2 even if it's an "intermediary storage point" because the last thing you want to do is corrupt the data and have to physically grab it from source again.

Yea its mostly Audio and Video Body cam footage and interview data.

If you're storing that kind of data, you need to be using some kind of parity. Use RAIDZ2 inside of FreeNAS. Compression might not actually be needed here unless your capture hardware (body cameras/etc) use a particularly lazy codec, if they're H264 or similar then they should be compressed enough.

Which Drive should we use for the ZIL/SLOG ? Reviews i read were saying the Evo 960 was good for that.

Given your use case, you're loading the array with huge sequential reads and writes, so the benefits of both SLOG and L2ARC are pretty much negated entirely here - you're not doing small random writes that need to be responded to quickly, and you're never reading the same data more than once.

Set recordsize=1M, remove the SLOG and L2ARC devices entirely, do sync=disabled if necessary (but iSCSI's defaults for Windows should be fine) and watch it fly.

Im currently trying to find the easiest drop in HBA controller and its looking the H200 is a better replacement for the H700? Can you confirm this?

Confirmed, but as mentioned you will want to crossflash to the official LSI firmware, there are guides available on this forum for the steps necessary. Changing the controller will necessitate destruction and recreation of the pool.
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
We are not seeding to the cloud, we are the cloud. We are a Large offsite storage provider. The clienst Metro-E connections can't sustain the delta rates for xfering the data over the wire for initial seeds so we used smaller nas's to move the data physically then kick off deltas over the wire. It just happens that this server is 60+ TBs with mount points. Commvault just seeds the BU job to a Disk in windows and we move that disk to a media agent on our side and let commvault ingest it to our SANs.

Code:
config:																															
																																  
	   NAME										  STATE	 READ WRITE CKSUM													
	   Seed										  ONLINE	   0	 0	 0													
		 gptid/077f808a-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
		 gptid/081c32e9-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
		 gptid/08b2c59b-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
		 gptid/09564426-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
		 gptid/0a11fb47-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
		 gptid/0ac4d74b-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
		 gptid/0b80c2ae-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
		 gptid/0c4d8b1b-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
		 gptid/0d093309-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
		 gptid/0ddbe011-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
		 gptid/0ea3a2ab-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
		 gptid/0f8810a4-bda9-11e8-a473-842b2b597411  ONLINE	   0	 0	 0													
	   logs																														
		 gptid/d44f9ab6-bdaf-11e8-9b9b-842b2b597411  ONLINE	   0	 0	 0													
	   cache																													  
		 gptid/e53ca929-bdaf-11e8-9b9b-842b2b597411  ONLINE	   0	 0	 0													
																																  
errors: No known data errors																										
																																  
  pool: freenas-boot																												
 state: ONLINE																													
  scan: none requested																											
config:																															
																																  
	   NAME		STATE	 READ WRITE CKSUM																					
	   freenas-boot  ONLINE	   0	 0	 0																					
		 mfid0p2   ONLINE	   0	 0	 0																					
																																  
errors: No known data errors				
Yep. You just confirmed that you have no use for the L2ARC. Also if you care about the clients data at all (law suite anyone?) you will want at the MINIMUM (and this is still stupid) a RAIDz1. After all, it doesn't matter how fast you can move corrupted data. As for the SLOG again this write once and ship process suggests you would be better off without it and setting the dataset to sync disable.
 

robertmehrer

Dabbler
Joined
Sep 25, 2018
Messages
35
I don't know where to start.

* IOPS: you don't have any.

* Perc H700: ugh. For under $100, you can replace that with ZFS-recommended HBA.

* SSD SLOG may be helpful for bursty writes. If you're streaming large writes or regularly exceeding the SLOG's cache limit (hint: it'll never use anywhere close to the 1TB you've presented), you're still write-limited by the underlying conventional drives. Long story short, in your use case, SLOG isn't helping and may be hurting.

* No compression is often worse than lz4 compression for performance. The less data moving to and from the drives, the faster. Even when...

* The E5620 was released in 2009/2010? I'm a fan of retro hardware and generally suggest lower CPU power for FreeNAS but you may be the exception.

I don't have any experience with Commvault but my guess is that it does a lot of random reads and random writes. That requires IOPS and you have, probably, 200 of them at the most. Which ain't a lot.

Cheers,
Matt


Why dont you give solutions as well as badgering...

This was thrown together with gear we had laying around to try and solve a problem quickly. Commvault isnt actually writing the BU data. Its just copying the BU to a second location as a seed. Like VEEAM does, its just an Aux Copy of the already created backup.

Which HBA do you recommend then?

The underlying drives have a max of 250MBps with 12 drives in a stripe... we should be seeing in thousands range...

From the attached screenshot you can see the NAS is capable of performing, it just wont sustain these speeds.
 

Attachments

  • seedTest.png
    seedTest.png
    37.1 KB · Views: 1,324

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I'd also suggest that if file-level storage (SMB) vs. block-level (iSCSI) is an option, you might get better results from ZFS prefetching on your reads, as well as being able to take advantage of larger recordsizes (FreeNAS defaults for zvols are 16K)
 

robertmehrer

Dabbler
Joined
Sep 25, 2018
Messages
35
Huge sequential transfers means you want to gun for a RAIDZ2 or similar setup, this will use two drives for parity. Edit: you are the cloud. Neat. Definitely use RAIDZ2 even if it's an "intermediary storage point" because the last thing you want to do is corrupt the data and have to physically grab it from source again.



If you're storing that kind of data, you need to be using some kind of parity. Use RAIDZ2 inside of FreeNAS. Compression might not actually be needed here unless your capture hardware (body cameras/etc) use a particularly lazy codec, if they're H264 or similar then they should be compressed enough.



Given your use case, you're loading the array with huge sequential reads and writes, so the benefits of both SLOG and L2ARC are pretty much negated entirely here - you're not doing small random writes that need to be responded to quickly, and you're never reading the same data more than once.

Set recordsize=1M, remove the SLOG and L2ARC devices entirely, do sync=disabled if necessary (but iSCSI's defaults for Windows should be fine) and watch it fly.



Confirmed, but as mentioned you will want to crossflash to the official LSI firmware, there are guides available on this forum for the steps necessary. Changing the controller will necessitate destruction and recreation of the pool.


AWESOME info thank you!

Keep in mind this Server will be used for all kinds of seeding so should we ditch the caches all together?

How do i set those flags?
Set recordsize=1M, remove the SLOG and L2ARC devices entirely, do sync=disabled if necessary (but iSCSI's defaults for Windows should be fine) and watch it fly.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Evo 960 1TB NVME ZIL
Evo 960 1TB NVME L2ARC
Just box those up and send them to me, you don't need them... Instead of wasting money like this, you should have come and asked for advice first.
(i know this is a no no)
Then why did you do it?
Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
Part of the problem. Probably memory bandwidth. That is an old / slow processor, LGA 1366, for trying to do 10Gb networking.
10GB Network Connection
What is the network card?
why the ZILK drive is 100% busy and such High I/O pending...
IO is pending because the system can't access the disks properly because you decided to use a hardware RAID controller. Bad system design.
 
Last edited by a moderator:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Why don't you give solutions as well as badgering...

Indeed. Badgering is my job, folks, it's in my username. ;)

To be fair, OP, you did yourself admit to one of the biggest no-no's of ZFS (using a HW RAID controller) so some flak should be expected - but people don't need to be rude about it.

AWESOME info thank you!

Keep in mind this Server will be used for all kinds of seeding so should we ditch the caches all together?

If it's all the same workload of "huge sequential write, physically move the box, huge sequential reads to send it to the final resting place" then yes, ditch the caches altogether. You don't read the same data multiple times (L2ARC benefit) and you aren't writing small blocks and demanding a quick ACK back that it was completed (SLOG benefit) so pull them (and take them home for your gaming machine!)

How do i set those flags?
Set recordsize=1M, remove the SLOG and L2ARC devices entirely, do sync=disabled if necessary (but iSCSI's defaults for Windows should be fine) and watch it fly.

In your case, if you're sharing ZVOLs via iSCSI, the recordsize value (actually volblocksize) can't be changed after it's initially set. You'd have to make a new zvol/extent/etc and check the "Advanced" options when you make the ZVOL. But see below instead.

Is connecting this server via SMB an option? It's probably better overall as far as performance/tunability.

You'd do zfs set recordsize=1M Poolname/Dataset as well as zfs set sync=disabled Poolname/Dataset to get the two of those, although SMB by default does async writes under Windows so the latter shouldn't be necessary.

All of this advice is, of course, assuming that you take that H700 out and replace it with an H200, flashed with the appropriate LSI IT firmware:

https://techmattr.wordpress.com/201...-flashing-to-it-mode-dell-perc-h200-and-h310/
 
Status
Not open for further replies.
Top