Large differences in performance of same data to identical machines.

TrumanHW

Contributor
Joined
Apr 17, 2018
Messages
197
Not only are some writes different (slower) to the (better?) machine ... but when also running slower:
They don't do that prototypical TrueNAS speed shuffle that's typical of performance constrained by HD limits.

TrueNAS Configs:

Ver: TrueNAS Core 12.0 u7
Vol: RAIDz2 (8 devs)
Cfg: Dedupe (never enabled)
Cfg: Dataset type: SMB
I believe all settings are identical


Common to Both TN Servers:

Dell PowerEdge T320 Server
CPU: (1) E5-2430v2 6c, (HT) - 2.5GHz | 3.3GHz (Turbo)
RAM: (6) 8GB PC3-10600R | 48GB 1333MHz
HDD: (8) HGST UltraStar SAS 7,200rpm (not shingled)
NIC: (1) SFP+ 10Gb Network card
Testing Client:

16" MBPr i9, 32GB, 4TB PCIe SSD
OS : MacOS Monterey
NIC: Sonnet - SFP+ (10GbE) to TB3
(unlikely bottleneck - Sonnet SFP+ has been working excellently)


(I think that any / all system-differences constitute upgrades within the 80TB array)

80TB Unit (Slower Writes + Faster Reads)
32TB Unit (Slower Reads + Faster Writes)
Installed RAM48 GB 1333MHz ECC48 GB 1333MHz ECC
SLOG / Intent LogRadian 8GB RMS-200None
8x HGST 7,2K (UltraStar)10TB IBM (IBM-ESXS HUH721010AL4200)4TB (HGST HUS726040AL4214)
SAS ControllerDell H200 HBA SAS 2LSI 9200i SAS Card

Some transfers (on both) get down to a range of KB/sec. (which is even less than 6x IOPS min. Is that still an IOPS limit?).
(the above test was the only instance in which I'm speaking of files that were apps, or small files, etc.)



Attempts to Remedy Performance Disparities:

Recreated Pool with 1mb Records -- (no help).
Ensured all settings were identical when recreating Pool.


All tests Performed:
• Via same network
• From same source
• Of same data
• To (near) identical machines:

Writing same Data to both:
• to 32TB ( –SLOG) zVol: 350 - 550 MB/sec
• to 80TB ( +SLOG) zVol: 171 - 181 MB/sec

Reading same Data to both:
• to 32TB ( –SLOG) zVol: 400 - 600 MB/sec
• to 80TB ( +SLOG) zVol: 550 - 800 MB/sec


Are these HBAs: Dell H200 HBA vs LSI 9208i ...roughly equivalent performers..?


Testing, Day 2:

Write of all large-video files to the 32 TB hovers @ ~ 95mb/s ( ±5 mb/sec)​
Write of all large-video files to the 80 TB hovers @ ~ 165mb/s ( ±5 mb/sec)​
(though both are from the same source they're both performed exclusively)​
Reports Dashboard:
Not working on either system​
(Though, Dashboard still provide some info on both ..?)
Additional Steps / Tests:
Install TrueNAS Scale (via SATADOM to test differences from Core vs Scale)​
All 'Data Sets' were configured (or reconfigured) as 'SMB networks' and shared as SMB...​
This choice was made because I'd been told that "SMB is a superior protocol / faster" (perhaps in TrueNAS / FreeNAS) by a mod; still true?​

Thanks, and I hope I've provided adequate info.
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Did you by some chance explicitly disable the sync setting on either the pool or the SMB server settings on the 32TB machine?

MacOS does use sync writes over SMB so it would limit it to the SLOG device speed limit.
 

TrumanHW

Contributor
Joined
Apr 17, 2018
Messages
197
Actually, today I'm trying that which has improved the average results drastically...with some transfers to the 32TB machine being between

- 200MB/sec - to - 600MB/sec ... others have been between
- 520MB/sec - to - 750MB/sec (not bad at all).

Interesting info re: macOS + SLOG interaction.
Is that's bc macOS split off of FreeBSD is ~2004 kernel without integrating FreeBSD kernel updates since..?
(sorry if I butchered that question)

Any workarounds for the macOS + SLOG limitations? That was actually a question I was already thinking to ask as, upon disabling Sync:
- the 80TB had initially faster performance with the SLOG (though it was only 1 test of ~40GB) ... though:
- the 32TB unit'd peaks were lower (though that's the general case) it had a faster avg time hitting fewer performance 'air bubbles.'
(I can re-test that a few more times -- of course with the exact same data to both to confirm it if you feel it's worth it) ...

Though with Sync disabled, the performance has (thus far) been quite satisfying.
How long is Data "vulnerable" for when Sync is disabled..?
Is it a finite (known) period..?
Vary based on transfer size?
Is there some articulable policy I can give, like, "if the system is running normally for 2 min after a transfer, that data is safe." ..?

Also, I've read a SLOG introduces a "vulnerability" of up to 5 sec. Is that correct ..? Though, I know I have the capacitors on the radian SLOG.

Thanks

PS:

Before I create another post re: my questions about creating a video-centric ZFS machine...

Can I "trick" TrueNAS to keep data smaller than an L2ARC in the L2ARC (if not the ARC) by transferring it ~5x in a row or something?
Say an editor has a 200GB project and less than 200GB free (and won't be using proxies for some reason), thus, using ZFS as a scratch disk.
Would copying the project data from the array ~5-10x (and copying no other data) load that data to the (Optane) L2ARC?
I picked up a couple p5800x -- but may instead grab a couple Optane 900p for this ...
Even 4K video doesn't need more than 400MB/s ... which technically the spinning Pool would do, but the latency would be a big issue.
This of course is using the SFP+ interface via Thunderbolt 3 ... which makes me also wonder if a DAC is okay, or if that'd warrant actual fiber.

Alternate ideas involved creating a small + riskier ( single parity / z1 ) NVMe array that was sync'd to a neighboring RAIDz2 array.
If not locally (PCIe limits), physically next to each other, limiting risk to the time required to get the NVMe partition synced with the z2 array.
In case the L2ARC were too temperamental, using the NVMe array as a "scratch Dataset" or as a Tiered Pool to make ingest limited by NVMe:
Which then syncs to the z2 array. I just need to find a sync protocol which syncs [data] but not [deletes] (incremental, not differential).

Use the NVMe array as a project repo or a (temp) staging spot which is synchronized with the main array while minimizing attendance of ingest.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Interesting info re: macOS + SLOG interaction.
Is that's bc macOS split off of FreeBSD is ~2004 kernel without integrating FreeBSD kernel updates since..?
(sorry if I butchered that question)

Any workarounds for the macOS + SLOG limitations?

Not entirely sure why it is that MacOS defaults to leveraging the SMB sync behaviour. To disable it without disabling for the pool, go to SMB auxiliary settings (Services -> SMB -> Little Wrench Icon) and set strict sync = no

@anodos may have some additional information to provide re: SMB behavior and performance. The output of testparm -s in a CODE block comment may be helpful here as well.

Though with Sync disabled, the performance has (thus far) been quite satisfying.
How long is Data "vulnerable" for when Sync is disabled..?
Is it a finite (known) period..?
Vary based on transfer size?
Is there some articulable policy I can give, like, "if the system is running normally for 2 min after a transfer, that data is safe." ..?

Also, I've read a SLOG introduces a "vulnerability" of up to 5 sec. Is that correct ..? Though, I know I have the capacitors on the radian SLOG.

Any asynchronous transfer introduces that potential "vulnerable period" where data is written only to RAM, not stable storage. Enabling sync writing means that the data is written to stable storage - but this often comes at a significant performance cost. Using a fast SLOG device allows you to regain some of (but not all) of that speed. The default tunables in TrueNAS give you a window of about five seconds of "dirty data" - with that said, if you're transferring a file that's dozens of GB in size, you'd want the entire thing to be safely written, and then five or ten additional seconds after the data finishes transfer. But generally speaking if your server crashes mid-transfer, you ought to be checking on any files you transferred recently and/or comparing file hashes, because it's likely the client system (eg: Windows/MacOS) will abort the transfer (resulting in a half-copied and broken file) before the server-side comes up again.

PS:

Before I create another post re: my questions about creating a video-centric ZFS machine...

Can I "trick" TrueNAS to keep data smaller than an L2ARC in the L2ARC (if not the ARC) by transferring it ~5x in a row or something?
Say an editor has a 200GB project and less than 200GB free (and won't be using proxies for some reason), thus, using ZFS as a scratch disk.
Would copying the project data from the array ~5-10x (and copying no other data) load that data to the (Optane) L2ARC?

Generally speaking, trying to fool ARC/L2ARC in this manner doesn't work well unless the cache one you're trying to fool is empty. If it's completely empty, it will try to pick the "best candidate" out of your rigged options "wow, these video files have a big MFU/MRU score, better take them" but anything else will likely fail the test of "is it an actual usage pattern"

I picked up a couple p5800x -- but may instead grab a couple Optane 900p for this ...
Even 4K video doesn't need more than 400MB/s ... which technically the spinning Pool would do, but the latency would be a big issue.
This of course is using the SFP+ interface via Thunderbolt 3 ... which makes me also wonder if a DAC is okay, or if that'd warrant actual fiber.

Alternate ideas involved creating a small + riskier ( single parity / z1 ) NVMe array that was sync'd to a neighboring RAIDz2 array.
If not locally (PCIe limits), physically next to each other, limiting risk to the time required to get the NVMe partition synced with the z2 array.
In case the L2ARC were too temperamental, using the NVMe array as a "scratch Dataset" or as a Tiered Pool to make ingest limited by NVMe:
Which then syncs to the z2 array. I just need to find a sync protocol which syncs [data] but not [deletes] (incremental, not differential).

Use the NVMe array as a project repo or a (temp) staging spot which is synchronized with the main array while minimizing attendance of ingest.
P5800X is hugely overkill for this kind of workload but definitely won't be short on performance. Almost a shame to limit it to 10Gbps networking rather than making it local to each station.

Using two separate pools with the "active data" and a "backup pool" would work. You'd want to do a scheduled file-level (rsync/scp) sync task between them though as the ZFS send/recv will replicate, which I understand you're trying to avoid.
 

TrumanHW

Contributor
Joined
Apr 17, 2018
Messages
197
Using a fast SLOG allows you to mitigate some of (but not all) of that speed.
AH, so a SLOG can only mitigate the performance cost of 'SYNC' ..? Great (succinct) info.
Default TrueNAS Tunables give you a window of about five seconds of "dirty data."
I've not dived into this yet as I think it'd take me a while to feel like I had a hand on it.

If your server crashes mid-transfer, you ought to be checking any recently transferred files and/or comparing file hashes.
More great info in that, I realized I don't know which hash command ZFS uses in order to audit data.
I can look it up if you don't know it by heart.
Ah - I looked it up and now remember it said it checksums the blocks (which makes sense ... I guess it uses a B-Tree and nodes, etc.

I'm guessing therefore I can use any hash (MD5, SHA, etc) as any will test for consistency and it doesn't matter what I use.

Generally speaking, trying to fool ARC/L2ARC in this manner doesn't work well unless the cache one you're trying to fool is empty. If it's completely empty, it will try to pick the "best candidate" out of your rigged options "wow, these video files have a big MFU/MRU score, better take them" but anything else will likely fail the test of "is it an actual usage pattern"
Great point and info. Thanks. Kinda like the way fake boobs don't fool the human brain. It's well tuned to recognize "authenticity."

P5800X is hugely overkill for this kind of workload but definitely won't be short on performance. Almost a shame to limit it to 10Gbps networking rather than making it local to each station.
Local to each station? Go on please... (Okay, I'll look at the 280GB 900p)

I actually have a pair of SFP28 NICs (PCIe so, Windows only).
Options for laptops (specifically, Macs via Thunderbolt 3 to SFP28) get expensive (though I'm always looking for a deal).

Even QSFP+ products... I doubt would work (except on a Hackintosh or PC via Linux or Win) as it's unlikely they'd have Mojave 'kexts' (drivers).
I have an SFP28 QSFP28 (Dell S5148F-ON) -- but it needs the SWITCH FANS for it (Z9100-ON / Dell P/N: 03CH15) which I bought cheaply.


Using two separate pools with the "active data" and a "backup pool" would work.
You'd want to do a scheduled file-level (rsync/scp) sync task between them though as the ZFS send/recv will replicate, which I understand you're trying to avoid.
Ah, replication doesn't have a differential operating mode I take it (though I can't quite recall if rsync does either).

Man, I really appreciate all your comments and will be researching all of the points you made irrespective as to whether you're able to reply, and, if you're unable to, I will provide the answers to the questions I have so that anyone else can see the answers to the questions I had.

If nothing else, maybe a small suggestion of what you'd like me to look into for the p5800x

Thank you so very much for your time helping me with all of these questions. I'm really grateful.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
AH, so a SLOG can only mitigate the performance cost of 'SYNC' ..? Great (succinct) info.
Yep. Nothing will ever be faster than async, but it's not safe - so you make it safe with sync, then add an SLOG fast enough to make it not painful.

I've not dived into this yet as I think it'd take me a while to feel like I had a hand on it.
Default are fine for most, but I'll try my hand at explaining it succinctly again:
ZFS starts to flush a transaction when it hits either 64MB of "pending data" or 5 seconds have passed
It will allow a maximum of 10% of your RAM or 4GB (whichever is smaller) in "dirty" or "pending" writes
Once you hit 60% of your maximum, it's going to start artificially delaying things in order to avoid getting swamped.

More great info in that, I realized I don't know which hash command ZFS uses in order to audit data.
I can look it up if you don't know it by heart.
Ah - I looked it up and now remember it said it checksums the blocks (which makes sense ... I guess it uses a B-Tree and nodes, etc.

I'm guessing therefore I can use any hash (MD5, SHA, etc) as any will test for consistency and it doesn't matter what I use.
Correct again. ZFS does checksums at the record level, you want to know if the entire file is okay, hence do your checking at the file level with whatever algorithm and program you want.

Great point and info. Thanks. Kinda like the way fake boobs don't fool the human brain. It's well tuned to recognize "authenticity."
ZFS isn't fooled by "bags of sand" ;)

Local to each station? Go on please... (Okay, I'll look at the 280GB 900p)
I just mean that the P5800X is capable of multiple gigabytes a second ... it seems a shame to use multiples and then limit them by something that's measuring in gigabits. Put them directly into the PCs doing the editing and watch it be stupidly fast, then periodically have them rsync back to the RAIDZ2 spinning-rust pool over the network.

I actually have a pair of SFP28 NICs (PCIe so, Windows only).
Options for laptops (specifically, Macs via Thunderbolt 3 to SFP28) get expensive (though I'm always looking for a deal).

Even QSFP+ products... I doubt would work (except on a Hackintosh or PC via Linux or Win) as it's unlikely they'd have Mojave 'kexts' (drivers).
I have an SFP28 QSFP28 (Dell S5148F-ON) -- but it needs the SWITCH FANS for it (Z9100-ON / Dell P/N: 03CH15) which I bought cheaply.
Not too versed in the MacOS side (and even less the Hackintosh stuff) - if I'm gonna run not-Windows it's going to be a Linux or a BSD.

Ah, replication doesn't have a differential operating mode I take it (though I can't quite recall if rsync does either).
You can use snapshots if you want to add a "lag time" between the deletes and being able to recover them, but I believe you're still best here to work at the file level. And rsync by default only copies and overwrites, it won't delete on the destination if the file is missing from the source (because it won't think to look for it.)

Man, I really appreciate all your comments and will be researching all of the points you made irrespective as to whether you're able to reply, and, if you're unable to, I will provide the answers to the questions I have so that anyone else can see the answers to the questions I had.

If nothing else, maybe a small suggestion of what you'd like me to look into for the p5800x

Thank you so very much for your time helping me with all of these questions. I'm really grateful.
Happy to help.
 

TrumanHW

Contributor
Joined
Apr 17, 2018
Messages
197
Dude, your answers have been awesome and beyond helpful.
While I won't let my ego keep me from learning from knowledgeable turds;
Your help has been doubly appreciated in that you've not been even remotely condescending, for which I truly thank you.


So, SLOGs allow a max of 10% of RAM or 4GB (which ever's less) in "dirty" / "pending" writes.
If 4GB is the max that can be used in a SLOG ... why do people EVER talk about adding a larger ZIL ?
(Sure, overprovisioning, etc., but eg, the Radian RMS-200, (no max write amount & has power loss-protection) ... 4GB is it ..?)

And, seems like something other than (or in addition to) IOPs causes some slow transfers when arrays capable of ~800MB/s transfers sub 1MBs.
Do you think that's true..? (again, we're talking without dedupe) ...

Any reason not to make another boot volume with TrueNAS Scale to try it out ..? (as in, it won't effect the Pool, right ..?)

While I (personally) do not get bolt-ons (not fooled) ...

Apparently, transfers using NFS cause it to be loaded into the L2ARC ..?
(you'll see in that thread he says "obviously" with that image on top of P.2)
 

Jessep

Patron
Joined
Aug 19, 2018
Messages
379
So, SLOGs allow a max of 10% of RAM or 4GB (which ever's less) in "dirty" / "pending" writes.
If 4GB is the max that can be used in a SLOG ... why do people EVER talk about adding a larger ZIL ?
(Sure, overprovisioning, etc., but eg, the Radian RMS-200, (no max write amount & has power loss-protection) ... 4GB is it ..?)
This is tuneable, HoneyBadger is referencing default behaviour.

This thread references a fair amount about it.

 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
This is tuneable, HoneyBadger is referencing default behaviour.

This thread references a fair amount about it.


Correct. All of the values of "4GB" "10%" "60%" are the defaults, and most of them can be changed; although they're only changeable at a system-wide granularity. I'd like to see them be pool-specific, so that you can have a one pool tuned to be more responsive, and the other one set up for bulk ingest with throughput being more important than latency.

I also wrote a fair bit in this thread as well, probably retreading some ground that you're already familiar with @TrumanHW but it might be worth a read for the summary again.


Apparently, transfers using NFS cause it to be loaded into the L2ARC ..?
(you'll see in that thread he says "obviously" with that image on top of P.2)

He's specifically talking about ARC in that instance, and writes do live "in the ARC" so to speak. However, the process by which they get populated in L2ARC is a bit more complicated - check with the post by @jgreco and continue scrolling down for some more details on the nitty-gritty, but in essence "your L2ARC is only as good as your ARC can be" - if ARC is too small and/or is very fresh, ZFS can't really make informed decisions about what's valuable to put in L2ARC.


L2ARC is also a "dumb" ring buffer - it has no concept of MRU/MFU once data is in there. It's just FIFO, with the additional asterisk that a read hit on a record evicted to L2ARC will pull data back into ARC ... which could then later be ejected from ARC and written back to L2ARC at the "head" of the queue (... it's the circle of writes, and it moves us all ...)
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
"your L2ARC is only as good as your ARC can be"

This is exactly it. What you WANT is for your L2ARC to get populated by the least interesting bits that you would LIKE to keep around if only you had the space.

Think about it like this:

Identifying which bits you would LIKE to keep around doesn't happen correctly if 98% of ARC blocks only have one hit; what you want are to evict the ARC entries that have TWO or THREE hits, because you can be reasonably sure that they weren't JUST the result of a one-time file access.

Obviously it is more complex in practice, but the point is that when the ARC is too small, data is discarded too quickly for a pattern of access to become evident. In this model, it's really rather difficult to have "too much" RAM. Only "entire pool in RAM" is really too much. :smile:
 

TrumanHW

Contributor
Joined
Apr 17, 2018
Messages
197
Correct. All of the values of "4GB" "10%" "60%" are the defaults and most can be changed; although only system-wide.
I'd prefer they were pool-specific, so you can have pool-based tuning set for bulk ingest where throughput is more important than latency.

I also wrote a fair bit in this thread as well, probably retreading some ground you're familiar with but may be worth a re-read.

Thanks; I'll def. check that out.
Weirdly I got luck to catch quite a few files with files-discrepancies (and don't work when replayed) after copying them.
(I've had no power loss / random reboots etc., and thus – shouldn't be a SYNC issue I'd presume ..? )

These files were transferred & verified yesterday via macOS (using Forklift & or Finder) all of which seemed to have successfully "copied."
But today I re-audited ... (after another transfer) and found discrepancies of data copied yesterday.

I'm not talking about .DS_stor, or minute differences ... but differences in many media (video) files.
Though I've since fixed the issues I'd def. like to know where the errors might have emerged from.

This occurred with data written to BOTH volumes and thus, doesn't seem like it could possibly be a:
- bad SAS controller
- bad cable
- and of course I'm using ECC ram.


As to why I'm not using rsync or a TrueNAS (FreeNAS) replication task:


My initial use of FN was by importing data from QNAP (I think I SSH'd in and used rsync push?)
This retained the permissions from QNAP which restricted manipulating the data ... precluding
- moving [some] folders
- renaming [some] files & folders

(after a 2 YEAR quest to identify why it was god AWFUL slow in which I finally learned dedupe was the culprit)...
I'd hoped a TrueNAS replication task might mitigate those issues (nope).
In trying to fix duplications, etc., it improved after using chmod via the FreeBSD shell.

Finally, manually copying to & from my Mac & back to TrueNAS has fixed these issues, for which I’ve used

Forklift ... and would generally choose from Forklift, CCC (Carbon Copy Cloner) or in some cases, ditto

In the off chance anyone ever reads this and could benefit from my reasoning / rationale (or better, someone has superior suggestions):


I use Forklift when sync’ing folders (to a folder)
  • I.e., when it’s a selective transfer (partial).
  • especially when it’s a lot of data and almost completed (missing just a small amount) as it’s a HUGE time savings.
  • it provides a side-by-side GUI interface.
  • allows comparing the data by modification date or size
  • allows PAUSING a transfer to perform another transfer at full speed without stopping (but pausing).
  • Forklift provides sec-by-sec updated transfer speeds (diagnostically great).

sudo ditto is also great as it’ll copy data without stopping even if you’d’ve needed a password OR it was corrupt.
  • sudo ditto [source] [destination] ...
  • often skipped Library folders or items (esp. things like iCloud documents).
  • often skipped Pictures (not always, but often enough to where I had to manually audit).

While CCC (Carbon Copy Cloner) will also do this, there are just times in which I prefer Forklift.

  • CCC copies system folders & elements in your Home Folder ( [User Folder] for Mac people):
  • even if they're in use (locked)
  • even if it has special permissions
  • CCC can create bootable backups
  • and can create a single IMAGE like: a sparsebundle, DMG, etc
  • supports scheduling / multiple tasks.

But I'd never seen references to people using anything like CCC to perform network tasks …
Backing up TBs ‘difficult’ client-data which kept failing over 1GbE was sufficient impetus to connect the dots.
I discovered the use of CCC which wound up being a breath of fresh air (before I knew of Forklift).
I now always use both methods ... & hope it’s either useful for someone else, or, I get a better recommendation!

:)
 

TrumanHW

Contributor
Joined
Apr 17, 2018
Messages
197
PS ...
Anyone know a command I can use in macOS (it's also "based on" FreeBSD) which will tell me how many:
4K files
8K files
etc. ... so I can try to calculate why a folder is moving slowly ?
When a sparsebundle transfers -- should it transfer as though it's a single file ..?
Or does the FS treat it like it's the 8MB bands of which it's comprised?

It's also faster to transfer data from a TN server ... TO my laptop and then to the other TN server ... then:
Transferring it (via my system) directly from one TN server to another.

I get that it's not the case that "my system is telling it to go from TN-1 to TN-2" ...
But that my computer has to essentially act like a hub (or proxy) ... but it's still much less than half as fast.

Instead of downloading at ~500 - 600 MB/sec ... then uploading at 400 - 500 MB/sec ....
It'll copy the data from Server 1 to Server 2 at 80 MB/sec ...
I don't doubt that it's normal. I'm just observing that it sucks. :)
 
Top