Reads slow to a crawl when writes hit the array

greg_905 · Jun 1, 2023

Hello,

I recently started a new gig and inherited a ZFS filesystem using TrueNAS core as the host OS. This is used for archive / Second copy of production data. The person who set it up put all the disks into a single vdev. I want to rebuild the vdev layout but first need to get the archive data on there there to two sets of tape before destroying the file system. To keep things short and not get into details, we write to this array all day long for days at a time, with brief interruptions.

Anytime we are writing to the array, the read performance falls through the floor. For example, when we are not writing to the array, we see about 700MBs to the tape library - over NFS (everything is NFS here). When we start writing to the array that drops to 30MBs or lower. While the writes are heavier than the reads, I expected they wouldn't cause as heavy a read performance hit as they are. See the graph below to see the trend. Green is into the array, blue is out to the backup server.

Is this expected behaviour or seem normal? Its been awhile since I administrated a ZFS file system, I don't recall this being a thing. Our other in house file servers (Isilon and a Qumulo) can read and write without this drastic of a hit in either direction.

mb: SSG-640SP-E1CR60 X12 Single Node 60-bay Supermicro Storage Server
cpu: Intel Xeon Gold 6326
ram: 128Gigs
HDs: 58(in one vdev, zraid3) x 18TB, 2x hotspares - 2 x 1.6TB NVMe cache, 1 x same for logs
system drives: 2xmirror Micron 5300 PRO 240GB
NIC: Intel X710 dual 10Gbit

jgreco · Jun 1, 2023

You have 58 drives in one RAIDZ3 vdev? Holy s***balls. You should have several vdevs of no more than 12 drives in each vdev. You will be in significant trouble if you ever need to replace a disk.

Arwen · Jun 1, 2023

I was about the say the same thing as our resident Grinch... but he beat me by mere seconds!

HoneyBadger · Jun 1, 2023

greg_905 said:
over NFS (everything is NFS here).

Does that include the incoming writes to the TrueNAS array? If so, can you provide some specifics about the client(s) that are writing to this TrueNAS system?

58 drives in a single RAIDZ3 will definitely cause some "concerns" as well.

sretalla · Jun 2, 2023

Also, if you're using NFS (which implies sync writes most of the time), you'll need to think about the IOPS load that creates for your pool and if something like 300-900 IOPS is enough to deal with your periods of writing... having a SLOG may be of some small benefit in your current case, but with only 1 or even if you change to 3 RAIDZ VDEVs, you're likely to be significantly lacking the IOPS to cope with sustained sync writing. (potentially you'll have some advantage if doing massive sequential writing, but in that case, you should also tune the recordsize up to max to reduce the IOPS).

greg_905 · Jun 2, 2023

jgreco said:
You have 58 drives in one RAIDZ3 vdev? Holy s***balls. You should have several vdevs of no more than 12 drives in each vdev. You will be in significant trouble if you ever need to replace a disk.

As stated, I inherited it and am trying to evac the data to rebuild it. Fully aware it is far less than optional. Takes awhile to run a petabyte of data to tape...

greg_905 · Jun 2, 2023

HoneyBadger said:
Does that include the incoming writes to the TrueNAS array? If so, can you provide some specifics about the client(s) that are writing to this TrueNAS system?

58 drives in a single RAIDZ3 will definitely cause some "concerns" as well.

Agreed, the 58 drive vdev is keeping me up at night. There is data on there not on tape or else where yet. When I started here one of the guys on my team was showing me around the network and when we came onto this I said "that guy should be sued.".

There is only two clients accessing the file system now, both over NFSv3. One is the backup server with the tape lib doing the reads, the other server is doing all the writes. The backup server is a 64 physical core machine and the client doing the writes 26 physical core system. Both are connected to the network 2x25Gib, the TrueNAS is dual 10Gibt MLAG'd to a pair of switches. Both are linux (Rocky 9.x). The system writing is doing so via 5 threads/steams, the tape lib is at 4 streams. A time is turned off, rsize is 131072 and wsize is 8192.

I'm working with a vendor to either borrow, lease or buy another petabyte system - if that goes through I'm going to import the volume and copy it all local to the new array (and back again if we are not keeping it). But until then I'm keen to get that data onto tape and would like to rate limit the writes or give priority to reads, thought I'd check in here to see if anyone has any ideas as its concerning. At the rate the backups are going, it'll be a month or so before they are done. One of the options I was considering is setting the switch ports to 1Gbit on the writer side. Which I'll end up doing if there isn't a more elegant solution to throttle them or give priority to reads.

take care,
-g

jgreco · Jun 2, 2023

greg_905 said:
As stated, I inherited it and am trying to evac the data to rebuild it. Fully aware it is far less than optional. Takes awhile to run a petabyte of data to tape...

Not blaming you, sir. I am simply in awe and terror. I would not wish that setup on anybody, and I will burn some vendor swag as an offering to the IT gods on your behalf.

greg_905 · Jun 2, 2023

sretalla said:
Also, if you're using NFS (which implies sync writes most of the time), you'll need to think about the IOPS load that creates for your pool and if something like 300-900 IOPS is enough to deal with your periods of writing... having a SLOG may be of some small benefit in your current case, but with only 1 or even if you change to 3 RAIDZ VDEVs, you're likely to be significantly lacking the IOPS to cope with sustained sync writing. (potentially you'll have some advantage if doing massive sequential writing, but in that case, you should also tune the recordsize up to max to reduce the IOPS).

Thanks Sretalla. Its currently set to 128, which was another bad choice due to the highly sequential IO profile it deals with - imo.. its a dumpster fire...

greg_905 · Jun 2, 2023

jgreco said:
Not blaming you, sir. I am simply in awe and terror. I would not wish that setup on anybody, and I will burn some vendor swag as an offering to the IT gods on your behalf.

understood, and thanks. Its just stressful as I 'own it' now.. lol @ the swag. I went to a storage vendors event the other night and they were giving out monogrammed swag, with my name on it. The other week another wanted to fly us to Mexico for a trapped and cornered sales event. no thanks! I was thinking "so this is where our money goes".. be well jgreco and thanks.

HoneyBadger · Jun 2, 2023

greg_905 said:
There is only two clients accessing the file system now, both over NFSv3. One is the backup server with the tape lib doing the reads, the other server is doing all the writes. The backup server is a 64 physical core machine and the client doing the writes 26 physical core system. Both are connected to the network 2x25Gib, the TrueNAS is dual 10Gibt MLAG'd to a pair of switches. Both are linux (Rocky 9.x). The system writing is doing so via 5 threads/steams, the tape lib is at 4 streams. A time is turned off, rsize is 131072 and wsize is 8192.

I'm working with a vendor to either borrow, lease or buy another petabyte system - if that goes through I'm going to import the volume and copy it all local to the new array (and back again if we are not keeping it). But until then I'm keen to get that data onto tape and would like to rate limit the writes or give priority to reads, thought I'd check in here to see if anyone has any ideas as its concerning. At the rate the backups are going, it'll be a month or so before they are done. One of the options I was considering is setting the switch ports to 1Gbit on the writer side. Which I'll end up doing if there isn't a more elegant solution to throttle them or give priority to reads.

take care,
-g

Since it's a sequential workload, the smaller wsize of 8K on the NFS client side might be hurting things, depending on what model the NVMe log devices are (assuming they're in the write stream here) and the dataset record size can be adjusted up to 1M fairly easily, although that will only impact new writes.

The 58-wide Z3 is likely the biggest contributor to the poor performance as the laws of physics are getting in the way of trying to coordinate that many HDD arms at a given time to try to handle simultaneous read/writes. We could try pushing out the transaction sizing as well, but that's only likely to just delay the onset of slow performance.

Regarding working with a vendor on additional storage, feel free to shoot me a DM - I promise I won't try to fly you anywhere.

greg_905 · Jun 2, 2023

HoneyBadger said:
Since it's a sequential workload, the smaller wsize of 8K on the NFS client side might be hurting things, depending on what model the NVMe log devices are (assuming they're in the write stream here) and the dataset record size can be adjusted up to 1M fairly easily, although that will only impact new writes.

Regarding working with a vendor on additional storage, feel free to shoot me a DM - I promise I won't try to fly you anywhere.

I didn't think you could change the record size on a ZFS filesytem once created. Am I wrong there or is this a new feature? I'll do some reading.
thanks HoneyBadger.
-g

HoneyBadger · Jun 2, 2023

greg_905 said:
I didn't think you could change the record size on a ZFS filesytem once created. Am I wrong there or is this a new feature? I'll do some reading.
thanks HoneyBadger.
-g

A zvol's block size can't be changed, but a filesystem or dataset can. It only impacts newly written data (net new or overwrites to existing) though, so anything already written at 128k won't be rewritten in 1M granularity for example.

sretalla · Jun 3, 2023

HoneyBadger said:
anything already written at 128k won't be rewritten in 1M granularity for example.

Although you can force that to happen by some form of re-writing... my preferred way is the in-place rebalancing script: https://github.com/markusressel/zfs-inplace-rebalancing but there are a few other methods that can be used (I'll let you look those up if interested)

Important Announcement for the TrueNAS Community.

Reads slow to a crawl when writes hit the array

greg_905

Dabbler

jgreco

Resident Grinch

Arwen

MVP

HoneyBadger

actually does care

sretalla

Powered by Neutrality

greg_905

Dabbler

greg_905

Dabbler

jgreco

Resident Grinch

greg_905

Dabbler

greg_905

Dabbler

HoneyBadger

actually does care

greg_905

Dabbler

HoneyBadger

actually does care

sretalla

Powered by Neutrality

Similar threads

Important Announcement for the TrueNAS Community.

Reads slow to a crawl when writes hit the array

Dabbler

Resident Grinch

MVP

actually does care

Powered by Neutrality

Dabbler

Dabbler

Resident Grinch

Dabbler

Dabbler

actually does care

Dabbler

actually does care

Powered by Neutrality

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Reads slow to a crawl when writes hit the array"

Similar threads