10Gbe - 8Gbps with iperf, 1.3Mbps with NFS

LearnLearnLearn · Jan 6, 2022

Sure, I'll work on trying to get that 10G card working on my other TN server in a while.

I added the specs to my sig about the new server I've been working in with you, doesn't it show or that wasn't enough?
This is the info I put in there which I thought was fairly detailed.
TN 12.00U6 running on a Dell R620 with 2x E5-2660 (2.20GHz) and 90GB RAM.
Ten HGST Zeus, 800GB S842E800M2 SSD Drives. SLOG on Optane 32GB. OS on 64GB M.2.

No network card?
This new server is a Dell R620 with four built in 1G NICS and one 10Gbase-SR Qlogic card. Not sure of the model at the moment.
The other system is an IBM x3550 I believe, with 32G of memory and external storage. It has two 146G SAS drives used in a mirror for the OS and has the same Qlogic 10G card installed. This is the system I want to get the 10G card working on so we can run this test but I've not had any luck getting that card an IP. That card is connected to the switch at 10G and TN sees it as a valid interface, just can't assign it an IP. Once I do that, I can test between two 10G systems.

NugentS · Jan 6, 2022

In your signature spec you don't mention HBA's (if any) or NIC's (which you do mention in the post above) which is the point I was making. Please complete the spec fully.

In addition - trawling the forum about QLogic and TrueNAS Core does bring up some interesting posts (which may or may not be relevent) and implies that they are less than ideal depending on the model number........ for example

LearnLearnLearn · Jan 6, 2022

I have no way of knowing unless I pull the systems out of the rack. The cards would be QLE8142 or QLE8152.
As we saw in iperf tests, they were able to get into the 9+Gbps transfer rates with the one installed on the system we are trying to build in this post.

On the other system, I searched the logs and don't see any errors but I can't seem to assign an IP to the card.
Seems I have to configure different subnets to allow more than two interfaces. Still looking around.

LearnLearnLearn · Jan 6, 2022

I have an FTP transfer that's been running for two days on the old TN server and its NFS share that I don't dare stop. It's TB's in size and does restart but only if the IP is the same. I'll change the TN server to use the 10G once the transfer is completed so I can run your test suggestion. No idea how long that will take.

Is my sig complete now?

NugentS · Jan 6, 2022

Signature
How are the disks attached to the motherboard, SATA or via an HBA, and if so which HBA and I assume flashed to IT mode (which is I think how this thread started)

You can't have two cards / interfaces on the same subnet which may be the problem you are facing. Actually you can but you will see inconsistent and strange behaviour - so don't do it.

We shall just have to wait for the FTP to finish

LearnLearnLearn · Jan 6, 2022

Ok, done. Any better or did I miss something else?
Yes, it's a subnet issue which I can't play with until this transfer is done.

LearnLearnLearn · Jan 6, 2022

So I got my replacement drive today but it still showed the same error in the logs.
>truenas.local smartd 3877 - - Device: /dev/da2, SMART Failure: WARNING: ascq=0x5

I decided to throw everything away and create one mirror of 5 drives each with SLOG.
The odd thing is that since re-creating the pool, I've not seen the above error in the log so far. Maybe it will show up later.

Code:

# zpool status -v pool01
  pool: pool01
 state: ONLINE
config:
        NAME                                            STATE     READ WRITE CKSUM
        pool01                                          ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/ccc709c7-6f6b-11ec-af66-90b11c1dd891  ONLINE       0     0     0
            gptid/cd0a69ed-6f6b-11ec-af66-90b11c1dd891  ONLINE       0     0     0
            gptid/cd0f6010-6f6b-11ec-af66-90b11c1dd891  ONLINE       0     0     0
            gptid/cd589c56-6f6b-11ec-af66-90b11c1dd891  ONLINE       0     0     0
            gptid/cd5dbf16-6f6b-11ec-af66-90b11c1dd891  ONLINE       0     0     0
          raidz2-1                                      ONLINE       0     0     0
            gptid/cca75602-6f6b-11ec-af66-90b11c1dd891  ONLINE       0     0     0
            gptid/cce4621f-6f6b-11ec-af66-90b11c1dd891  ONLINE       0     0     0
            gptid/ccd96e66-6f6b-11ec-af66-90b11c1dd891  ONLINE       0     0     0
            gptid/cd608916-6f6b-11ec-af66-90b11c1dd891  ONLINE       0     0     0
            gptid/cd69a141-6f6b-11ec-af66-90b11c1dd891  ONLINE       0     0     0
        logs
          gptid/cc6f0ef9-6f6b-11ec-af66-90b11c1dd891    ONLINE       0     0     0
errors: No known data errors

However, I ran the test as we did before and the result is quite different. 7.9Gbps.

Code:

# fio --bs=128k --direct=1 --directory=/mnt/pool01/io --gtod_reduce=1 --ioengine=posixaio --iodepth=1 --group_reporting --name=randrw --numjobs=12 --ramp_time=10 --runtime=60 --rw=randrw --size=256M --time_based
randrw: (g=0): rw=randrw, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=posixaio, iodepth=1
...
Run status group 0 (all jobs):
   READ: bw=948MiB/s (994MB/s), 948MiB/s-948MiB/s (994MB/s-994MB/s), io=55.6GiB (59.7GB), run=60067-60067msec
  WRITE: bw=945MiB/s (991MB/s), 945MiB/s-945MiB/s (991MB/s-991MB/s), io=55.5GiB (59.6GB), run=60067-60067msec

I'll test again over ESX tomorrow and see what I get before moving on.

NugentS · Jan 7, 2022

Now try it with sync=always - and you will see (hopefully) the difference that the SLOG makes to speeds. Then we can go back down that rabbithole.

LearnLearnLearn · Jan 7, 2022

That was with sync always. Can I disable sync without removing the SLOG?
Also, is it faster because it's only a two mirror pool?

And, the d2 drive error is showing up again. Kind of odd that it's the same slot, same drive. Wonder if burning the controller to IT didn't do that port or if there is a hardware problem with the server on that port. I'll have to look at that too now.

UPDATE: Same test with sync disabled on the io dataset, 8.8Gbps.

Run status group 0 (all jobs):
READ: bw=1009MiB/s (1058MB/s), 1009MiB/s-1009MiB/s (1058MB/s-1058MB/s), io=59.2GiB (63.5GB), run=60031-60031msec
WRITE: bw=1009MiB/s (1058MB/s), 1009MiB/s-1009MiB/s (1058MB/s-1058MB/s), io=59.2GiB (63.5GB), run=60031-60031msec

NugentS · Jan 7, 2022

Well thats a decent speed - now you have to get access to that speed from outside the NAS.

Try moving a disk from a known good position to the "bad" position and the "bad" disk to the known good position. ZFS won't care as it recognises the disks from the GPTID. If the fault moves, its the drive. If the fault stays it the slot / cable / hardware etc

LearnLearnLearn · Jan 7, 2022

Sure, I can try that. Does TN handle hot swap since it handles the disks itself in IT mode or do I have to shut it down?
Info found online is a little ambiguous. It says if the hardware supports it. The card was a RAID card before it was converted to IT so, not sure what the info means. I would guess yes.

Right now, I'm testing from ESX again.

What's interesting is that looking at reporting and disks, that disk is showing the same numbers as the others in terms of MBps transfer speed, all seem low at around 125MiB. The specs on these are over 500MBps.

I also notice the system is only using 12GB of memory so why do I need to have so much memory in this? I had 32 to begin with, now have 90 and it's not being used so far. This system won't be under constant pressure.

Mounted NFS 'backups' with sync.
ESX shows a top around 51Mbps and reporting for NFS is showing 7+MiB max.

Mounted 'io' NFS share with sync disabled.
ESX shows a top around 60Mbps and reporting for NFS is showing 8.8MiB.

So, the dataset with sync disabled is faster for some reason.

Then I mounted the NFS shares on a vm on the same ESX host.

Mounted NFS 'backups' with sync.
I got 5Gbps.
Mounting the NFS 'io' share without sync.
I got over 900MiB, it even hit 1.2GiB or 10.3Gbps which doesn't make sense since the NIC is a 10G nic?

NugentS · Jan 7, 2022

Hot swap is a function of the hardware bays and the OS. The OS is fine - dunno about the bays - but if the bays are hotswap bays you are good to go. However it will cause a scrub probably yanking a disk out whilst live. I suggest a shutdown, disk swap and boot - ZFS shouldn't then care

All the spare memory is used as ARC, aka file cache. You actually have to be using the NAS to serve data for the ARC to be used. This is mine after its been up and serving files for err 18 days

Sync disabled will always be faster. Its the fastest it can possibly go, writes are acknowldedged immediately, after being written to RAM. However if you have a power event / crash then data will be at risk as it was in RAM
Sync enabled is safe, in that writes are not acknowledged until a write is written to permanent storage - which is slow (relatively)

However if you add a high speed (I don't want to use the word cache, but) cache to the process then with sync=enabled the writes are written to a low latency, high performance "cache" first, then they can be acknowledged. This is called a SLOG and needs to be very high speed, low latency at low queue depths, long endurance, and not vanish on loss of power. Its only ever written to in a steady state and never read. The only time its read is at system boot to see if any writes have been lost in the power down event. Thus the perfect SLOG devices are (in order [My opinion]):
NVDIMM ( I guess)
Radian Memory Systems RMS devices
Optane 4800X (you may need to be sitting down)
Optane 900p (I use these)
Other Optanes (better for HDD SLOGs)
STEC ZeusRAM (old)
Some other enterprise SSD's (kinda a catchall)

Annoyingly Intel seems to have EOL's proper Optane consumer products for some reason. Honestly the best out there are the Radian Memory Systems devices, almost infinite endurance, RAM speeds but rare as rocking horse shit and not sold retail for some reason unless you want several hundred.

A SLOG (by default) only needs to hold 5 seconds of cache (actually ZFS Intent Log (ZIL)) after which it is flushed to disk so doesn't need to be big which is why the RMS devices are perfect. Optanes are good, but they do wear out eventually. The M10's should be treated as disposable as the endurance isn't great and they aren't that quick (They are OK). If you don't have a SLOG, then the writes are written to the ZIL on permanent storage first (in a temporary location), acknowledged and are then flushed from there to a permanent location as part of a transaction batch

My takeaway from your numbers is that its working. Direct from ESX its crap, but from a VM those numbers look good although I don't know how you got them (ie what test). It makes sense that sync=enabled is slower, but it is safer. I suspect you will need a faster SLOG to get that better (much more expensive) but that can be added / substituted later.

If it helps: IX on SLOG/ZIL

LearnLearnLearn · Jan 7, 2022

Hmm, interesting. I just left TN running and swapped the drives. Sure enough, the error keeps showing up with /dev/da2.

However, now I'm not convinced I'm even changing the right drive. It looks like drives and slots don't match up in terms of bays.
I pulled the drive just below what I think is da2 (physical bay 3) but as we can see below, the log shows da5 and da9 being affected then the da2 error shows up again.

Code:

Jan  7 11:31:09 truenas mps0: mpssas_prepare_remove: Sending reset for target ID 24
Jan  7 11:31:09 truenas mps0: mpssas_prepare_remove: Sending reset for target ID 29
Jan  7 11:31:09 truenas da5 at mps0 bus 0 scbus0 target 24 lun 0
Jan  7 11:31:09 truenas da5: <STEC S841E800M2 E4R9>  s/n STM0001A8C80 detached
Jan  7 11:31:09 truenas da9 at mps0 bus 0 scbus0 target 29 lun 0
Jan  7 11:31:09 truenas da9: <STEC S842E800M2 E4T1>  s/n STM00019F574 detached
Jan  7 11:31:09 truenas (da9:mps0:0:29:0): Periph destroyed
Jan  7 11:31:09 truenas (da5:mps0:0:24:0): Periph destroyed
Jan  7 11:31:09 truenas mps0: No pending commands: starting remove_device
Jan  7 11:31:09 truenas mps0[16069]: Last message 'No pending commands:' repeated 1 times, suppressed by syslog-ng on truenas.local
Jan  7 11:31:09 truenas mps0: Unfreezing devq for target ID 24
Jan  7 11:31:09 truenas mps0: Unfreezing devq for target ID 29
Jan  7 11:31:13 truenas 1 2022-01-07T11:31:13.458163-08:00 truenas.local smartd 17774 - - Device: /dev/da2, SMART Failure: WARNING: ascq=0x5
Jan  7 11:31:16 truenas 1 2022-01-07T11:31:16.422592-08:00 truenas.local savecore 17785 - - /dev/ada0p3: Operation not permitted
Jan  7 11:31:16 truenas 1 2022-01-07T11:31:16.714065-08:00 truenas.local savecore 17787 - - /dev/ada0p3: Operation not permitted
Jan  7 11:31:58 truenas 1 2022-01-07T11:31:58.106645-08:00 truenas.local smartd 17887 - - Device: /dev/da2, SMART Failure: WARNING: ascq=0x5
Jan  7 11:31:58 truenas da5 at mps0 bus 0 scbus0 target 29 lun 0
Jan  7 11:31:58 truenas da5: <STEC S842E800M2 E4T1> Fixed Direct Access SPC-4 SCSI device
Jan  7 11:31:58 truenas da5: Serial Number STM00019F574
Jan  7 11:31:58 truenas da5: 600.000MB/s transfers
Jan  7 11:31:58 truenas da5: Command Queueing enabled
Jan  7 11:31:58 truenas da5: 763097MB (1562824368 512 byte sectors)
Jan  7 11:31:58 truenas da5: quirks=0x20<NO_UNMAP>
Jan  7 11:32:07 truenas 1 2022-01-07T11:32:07.828180-08:00 truenas.local smartd 17991 - - Device: /dev/da2, SMART Failure: WARNING: ascq=0x5
Jan  7 11:32:07 truenas da9 at mps0 bus 0 scbus0 target 24 lun 0
Jan  7 11:32:07 truenas da9: <STEC S841E800M2 E4R9> Fixed Direct Access SPC-4 SCSI device
Jan  7 11:32:07 truenas da9: Serial Number STM0001A8C80
Jan  7 11:32:07 truenas da9: 600.000MB/s transfers
Jan  7 11:32:07 truenas da9: Command Queueing enabled
Jan  7 11:32:07 truenas da9: 763097MB (1562824368 512 byte sectors)
Jan  7 11:32:07 truenas da9: quirks=0x20<NO_UNMAP>

THEN I SWAPPED THEM BACK, yup, da5 and da9 again.

Jan  7 11:36:17 truenas mps0: mpssas_prepare_remove: Sending reset for target ID 29
Jan  7 11:36:17 truenas mps0: mpssas_prepare_remove: Sending reset for target ID 24
Jan  7 11:36:17 truenas da5 at mps0 bus 0 scbus0 target 29 lun 0
Jan  7 11:36:17 truenas da5: <STEC S842E800M2 E4T1>  s/n STM00019F574 detached
Jan  7 11:36:17 truenas da9 at mps0 bus 0 scbus0 target 24 lun 0
Jan  7 11:36:17 truenas da9: <STEC S841E800M2 E4R9>  s/n STM0001A8C80 detached
Jan  7 11:36:17 truenas (da9:mps0:0:24:0): Periph destroyed
Jan  7 11:36:17 truenas (da5:mps0:0:29:0): Periph destroyed
Jan  7 11:36:17 truenas mps0: No pending commands: starting remove_device
Jan  7 11:36:17 truenas mps0[16069]: Last message 'No pending commands:' repeated 1 times, suppressed by syslog-ng on truenas.local
Jan  7 11:36:17 truenas mps0: Unfreezing devq for target ID 29
Jan  7 11:36:17 truenas mps0: Unfreezing devq for target ID 24
Jan  7 11:36:21 truenas 1 2022-01-07T11:36:21.353937-08:00 truenas.local smartd 18135 - - Device: /dev/da2, SMART Failure: WARNING: ascq=0x5
Jan  7 11:36:24 truenas 1 2022-01-07T11:36:24.101150-08:00 truenas.local savecore 18153 - - /dev/ada0p3: Operation not permitted
Jan  7 11:36:24 truenas 1 2022-01-07T11:36:24.334113-08:00 truenas.local savecore 18155 - - /dev/ada0p3: Operation not permitted

LearnLearnLearn · Jan 7, 2022

I think we posted at the same time so you might want to look at my comment just before yours as I updated it.
That's a lot of good info so I wish this thread was not so long. I hope people can find use in it once it's all said and done.

The server will be in a DC which has redundant power. I bought the suggested Optane device as comments seemed to imply it was the best out there and the main options in terms of price was size.

Now it seems I don't even need it. I'm not going to run vms off of this storage, the only 'live' data will be web pages shared between load balanced web/app servers and some of it centralized backup that all ESX servers can get at.

I'm a little unsure of where to go from here. I need some safety in terms of data, rebuild speeds, that sort of thing but mostly just reliable and fast shared storage. Maybe I can test using 9000 MTU again at some point but right now, I'd like to clean everything up and come up with a final pool.

To me, this dual 5 drive mirror gives me a little over 4TB using the 10 800GB drives installed. That's reasonable but maybe there is another config that can give me the same speed, storage space and be even a little safer without giving anything up.

Also, I wonder now if I should remove the 64GB M.2 card I'm using for the OS and put the Os on the Optane card if I'm not going to end up using it.

NugentS · Jan 7, 2022

TooManyProjects said:
I think we posted at the same time so you might want to look at my comment just before yours as I updated it.
That's a lot of good info so I wish this thread was not so long. I hope people can find use in it once it's all said and done.

The server will be in a DC which has redundant power. I bought the suggested Optane device as comments seemed to imply it was the best out there and the main options in terms of price was size.

Now it seems I don't even need it. I'm not going to run vms off of this storage, the only 'live' data will be web pages shared between load balanced web/app servers and some of it centralized backup that all ESX servers can get at.

I'm a little unsure of where to go from here. I need some safety in terms of data, rebuild speeds, that sort of thing but mostly just reliable and fast shared storage. Maybe I can test using 9000 MTU again at some point but right now, I'd like to clean everything up and come up with a final pool.

To me, this dual 5 drive mirror gives me a little over 4TB using the 10 800GB drives installed. That's reasonable but maybe there is another config that can give me the same speed, storage space and be even a little safer without giving anything up.

Also, I wonder now if I should remove the 64GB M.2 card I'm using for the OS and put the Os on the Optane card if I'm not going to end up using it.

SLOG only effects sync writes so SMB type traffic is not effected. Its mostly NFS and iSCSI and only writes. If your traffic is mostly reads then don't worry about it. Note - do NOT use sysc=disabled on a live pool - your data is at risk in the event of a power outage / system crash / kernel panic. Redundant power does help here, but isn't 100% effective, as what happens if the TN box itself throws a wobbly

If its fast enough without the SLOG then don't use one, its not as if the SSD's are slow

Why are you backing up to SSD, surely a bunch of big HDD's would be a better backup target (and use the Optane SLOG you have) as (within reason), why care about backup speed. As for serving web pages - if you have enough memory (ARC) then the webpages will be served from ARC, so again you could be using HDD's potentially. You need to know the size of your working set.

Welcome to the world of ZFS (TrueNAS) pool design

Mirrors = high IOPS, quick resilver, dunno whether sequential access is slower or faster than Zn
RAIDZn = Low IOPS, slow resilver, quite good at sequential access.
If that wasn't complex enough now add:
L2ARC, or L2ARC metadata only
Special vdevs
Others that probably aren't relevent

In your current pool you have 2 vdevs of RaidZ2 - so IOPS of 2 disks (SSD's have high IOPS anyway) and fair sequential access. If you lose a disk then the vdev will take a long time to resilver (relative, SSD's resilver a lot quicker than HDD's because they are faster and smaller), but you can lose 2 disks in a vdev and still keep the pool.

Based on your comments I suspect you have got caught up in numbers or what you want, rather than what you need.

LearnLearnLearn · Jan 7, 2022

Yes, you're right, I've gotten caught up in this and lost sight of the original intent.
From this thread, I ended up thinking I needed more speed and more safety. Somehow, based on some of the input, I felt that I had to follow suggestions and upgrade this and that.

However, this thread, while costing me delays in completing my build also turned into learning more about truenas and how I can better utilize what I have or will have.

The thread has left me with some unknowns but more clarity in terms of how ZFS works at least when using a controller. I now know to use a controller in IT mode to allow ZFS to handle all of the redundancy. I never bothered with any of this since I was using (still using) external storage with my other TN server. I'll likely start a new post on that since it's too much for this one :).

The goal is to have a central storage to move big files around when I need to. Sometimes, it's just easier to have something centralized rather than copying from one machine to another repeatedly. As mentioned, I also need to serve data to web/app servers.

I do have long term storage as well. For example, all ESX servers have 4 to 6 slots. I use two slots in RAID1 for ESX OS and the other slots for backup drives.

The one thing that is the most important is reliability. We currently use a Linux box with an NFS share to share the web/app data. Linux is reliable but it's not a purpose built box like TN is. I wanted to put something purpose built into the mix and I knew that my FN and TN machines have always stayed up until the hardware itself failed.

So that's it. 1, reliability. 2, live pages being served. 3, reasonable amount of storage space for the number of drives I've invested into this system.
Kinda not sure if I should be looking at iSCSI storage instead of NFS as I tihnk ESX can use it.

Not sure how I feel about the SSD drives now but I can use the HDD's as backup drives which makes it ok.

Basically, I just need to decide on the best pool etc at this point and it's all done.

NugentS · Jan 7, 2022

So what drives do you have?
You just added iSCSI which means you need IOPS (assuming you want to run live machines from this server) = Mirrors on SSD's for the best responsiveness.
So two pools:
1. HDD's, RAIDZ2/3 for long term reliability and bulk storage, SMB, NFS Storage with Optane M10 SLOG
2. SDD's, Mirrors for IOPS, iSCSI

Warning - the iSCSI storage is gonna be expensive (in storage terms). Mirrors = 50% utilisation. Then create the zvol which (for best performance) shouldn't be > 50% = 25% raw storage utilisation = Ouch (10 HGST 800GB SSD's, mirrored = 8000GB = 2TB of useable iSCSI space)

You can consider a better Optane SLOG for the SSD's - but thats for the future.

Re ESX - I run a couple of ESXi servers - they both use central storeage via iSCSI but I also have a couple of TB of SSD's in each (consumer SSD's in RAID 5) so I can copy the VM's to the ESXi boxes in case I need to work on the TN server. I call it my swing storage.

LearnLearnLearn · Jan 7, 2022

>What drives do you have?
I'm not sure what the question is? I listed them in my sig :). I changed out all of the HDD's for SSD's based on the suggestions in this thread.
The HDD's would have been SAS 1TB drives.

NFS is pretty much all I need. I only mentioned iSCSI because TN supports it so to have it configured and potentially available on TN would have been a bonus if I ever use it.

NugentS · Jan 7, 2022

TooManyProjects said:
>What drives do you have?
I'm not sure what the question is? I listed them in my sig :). I changed out all of the HDD's for SSD's based on the suggestions in this thread.
The HDD's would have been SAS 1TB drives.

NFS is pretty much all I need. I only mentioned iSCSI because TN supports it so to have it configured and potentially available on TN would have been a bonus if I ever use it.

I just looked up your server. I can see the issue - its not really a NAS given its only 10*2.5" bays - well its not a bulk NAS. I don't know Dell's well, but I guess you might want to consider an external shelf - you appear to have 3 PCIe slots available (2*16, 1*8) but I can't remember (over 10 pages) how many you have used. (1 for NIC, maybe one for an HBA)

LearnLearnLearn · Jan 7, 2022

Two of the three PCI slots are in use. One for the 10G HBA, the other for the M.2 adapter.
Not sure what you mean by external shelf. If you mean additional storage, I don't need huge amounts for this install :).

I think I have everything I need in this server, just need to come up with a final config, test and done.
Also need to decide if I'm going to use the Optane or not.
Everyone seems to have left this long thread LOL.

Important Announcement for the TrueNAS Community.

10Gbe - 8Gbps with iperf, 1.3Mbps with NFS

Patron

MVP

Patron

Patron

MVP

Patron

Patron

MVP

Patron

MVP

Patron

MVP

Patron

Patron

MVP

Patron

MVP

Patron

MVP

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "10Gbe - 8Gbps with iperf, 1.3Mbps with NFS"

Similar threads