Reads slow to a crawl when writes hit the array

greg_905

Dabbler
Joined
Jun 1, 2023
Messages
17
Hello,

I recently started a new gig and inherited a ZFS filesystem using TrueNAS core as the host OS. This is used for archive / Second copy of production data. The person who set it up put all the disks into a single vdev. I want to rebuild the vdev layout but first need to get the archive data on there there to two sets of tape before destroying the file system. To keep things short and not get into details, we write to this array all day long for days at a time, with brief interruptions.

Anytime we are writing to the array, the read performance falls through the floor. For example, when we are not writing to the array, we see about 700MBs to the tape library - over NFS (everything is NFS here). When we start writing to the array that drops to 30MBs or lower. While the writes are heavier than the reads, I expected they wouldn't cause as heavy a read performance hit as they are. See the graph below to see the trend. Green is into the array, blue is out to the backup server.

1685645876627.png


Is this expected behaviour or seem normal? Its been awhile since I administrated a ZFS file system, I don't recall this being a thing. Our other in house file servers (Isilon and a Qumulo) can read and write without this drastic of a hit in either direction.



mb: SSG-640SP-E1CR60 X12 Single Node 60-bay Supermicro Storage Server
cpu: Intel Xeon Gold 6326
ram: 128Gigs
HDs: 58(in one vdev, zraid3) x 18TB, 2x hotspares - 2 x 1.6TB NVMe cache, 1 x same for logs
system drives: 2xmirror Micron 5300 PRO 240GB
NIC: Intel X710 dual 10Gbit
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
You have 58 drives in one RAIDZ3 vdev? Holy s***balls. You should have several vdevs of no more than 12 drives in each vdev. You will be in significant trouble if you ever need to replace a disk.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
I was about the say the same thing as our resident Grinch... but he beat me by mere seconds!
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
over NFS (everything is NFS here).
Does that include the incoming writes to the TrueNAS array? If so, can you provide some specifics about the client(s) that are writing to this TrueNAS system?

58 drives in a single RAIDZ3 will definitely cause some "concerns" as well.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Also, if you're using NFS (which implies sync writes most of the time), you'll need to think about the IOPS load that creates for your pool and if something like 300-900 IOPS is enough to deal with your periods of writing... having a SLOG may be of some small benefit in your current case, but with only 1 or even if you change to 3 RAIDZ VDEVs, you're likely to be significantly lacking the IOPS to cope with sustained sync writing. (potentially you'll have some advantage if doing massive sequential writing, but in that case, you should also tune the recordsize up to max to reduce the IOPS).
 

greg_905

Dabbler
Joined
Jun 1, 2023
Messages
17
You have 58 drives in one RAIDZ3 vdev? Holy s***balls. You should have several vdevs of no more than 12 drives in each vdev. You will be in significant trouble if you ever need to replace a disk.
As stated, I inherited it and am trying to evac the data to rebuild it. Fully aware it is far less than optional. Takes awhile to run a petabyte of data to tape...
 

greg_905

Dabbler
Joined
Jun 1, 2023
Messages
17
Does that include the incoming writes to the TrueNAS array? If so, can you provide some specifics about the client(s) that are writing to this TrueNAS system?

58 drives in a single RAIDZ3 will definitely cause some "concerns" as well.
Agreed, the 58 drive vdev is keeping me up at night. There is data on there not on tape or else where yet. When I started here one of the guys on my team was showing me around the network and when we came onto this I said "that guy should be sued.".

There is only two clients accessing the file system now, both over NFSv3. One is the backup server with the tape lib doing the reads, the other server is doing all the writes. The backup server is a 64 physical core machine and the client doing the writes 26 physical core system. Both are connected to the network 2x25Gib, the TrueNAS is dual 10Gibt MLAG'd to a pair of switches. Both are linux (Rocky 9.x). The system writing is doing so via 5 threads/steams, the tape lib is at 4 streams. A time is turned off, rsize is 131072 and wsize is 8192.


I'm working with a vendor to either borrow, lease or buy another petabyte system - if that goes through I'm going to import the volume and copy it all local to the new array (and back again if we are not keeping it). But until then I'm keen to get that data onto tape and would like to rate limit the writes or give priority to reads, thought I'd check in here to see if anyone has any ideas as its concerning. At the rate the backups are going, it'll be a month or so before they are done. One of the options I was considering is setting the switch ports to 1Gbit on the writer side. Which I'll end up doing if there isn't a more elegant solution to throttle them or give priority to reads.

take care,
-g
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
As stated, I inherited it and am trying to evac the data to rebuild it. Fully aware it is far less than optional. Takes awhile to run a petabyte of data to tape...

Not blaming you, sir. I am simply in awe and terror. I would not wish that setup on anybody, and I will burn some vendor swag as an offering to the IT gods on your behalf.
 

greg_905

Dabbler
Joined
Jun 1, 2023
Messages
17
Also, if you're using NFS (which implies sync writes most of the time), you'll need to think about the IOPS load that creates for your pool and if something like 300-900 IOPS is enough to deal with your periods of writing... having a SLOG may be of some small benefit in your current case, but with only 1 or even if you change to 3 RAIDZ VDEVs, you're likely to be significantly lacking the IOPS to cope with sustained sync writing. (potentially you'll have some advantage if doing massive sequential writing, but in that case, you should also tune the recordsize up to max to reduce the IOPS).
Thanks Sretalla. Its currently set to 128, which was another bad choice due to the highly sequential IO profile it deals with - imo.. its a dumpster fire...
 

greg_905

Dabbler
Joined
Jun 1, 2023
Messages
17
Not blaming you, sir. I am simply in awe and terror. I would not wish that setup on anybody, and I will burn some vendor swag as an offering to the IT gods on your behalf.
understood, and thanks. Its just stressful as I 'own it' now.. lol @ the swag. I went to a storage vendors event the other night and they were giving out monogrammed swag, with my name on it. The other week another wanted to fly us to Mexico for a trapped and cornered sales event. no thanks! I was thinking "so this is where our money goes".. be well jgreco and thanks.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
There is only two clients accessing the file system now, both over NFSv3. One is the backup server with the tape lib doing the reads, the other server is doing all the writes. The backup server is a 64 physical core machine and the client doing the writes 26 physical core system. Both are connected to the network 2x25Gib, the TrueNAS is dual 10Gibt MLAG'd to a pair of switches. Both are linux (Rocky 9.x). The system writing is doing so via 5 threads/steams, the tape lib is at 4 streams. A time is turned off, rsize is 131072 and wsize is 8192.


I'm working with a vendor to either borrow, lease or buy another petabyte system - if that goes through I'm going to import the volume and copy it all local to the new array (and back again if we are not keeping it). But until then I'm keen to get that data onto tape and would like to rate limit the writes or give priority to reads, thought I'd check in here to see if anyone has any ideas as its concerning. At the rate the backups are going, it'll be a month or so before they are done. One of the options I was considering is setting the switch ports to 1Gbit on the writer side. Which I'll end up doing if there isn't a more elegant solution to throttle them or give priority to reads.

take care,
-g

Since it's a sequential workload, the smaller wsize of 8K on the NFS client side might be hurting things, depending on what model the NVMe log devices are (assuming they're in the write stream here) and the dataset record size can be adjusted up to 1M fairly easily, although that will only impact new writes.

The 58-wide Z3 is likely the biggest contributor to the poor performance as the laws of physics are getting in the way of trying to coordinate that many HDD arms at a given time to try to handle simultaneous read/writes. We could try pushing out the transaction sizing as well, but that's only likely to just delay the onset of slow performance.

Regarding working with a vendor on additional storage, feel free to shoot me a DM - I promise I won't try to fly you anywhere.
 

greg_905

Dabbler
Joined
Jun 1, 2023
Messages
17
Since it's a sequential workload, the smaller wsize of 8K on the NFS client side might be hurting things, depending on what model the NVMe log devices are (assuming they're in the write stream here) and the dataset record size can be adjusted up to 1M fairly easily, although that will only impact new writes.


Regarding working with a vendor on additional storage, feel free to shoot me a DM - I promise I won't try to fly you anywhere.

I didn't think you could change the record size on a ZFS filesytem once created. Am I wrong there or is this a new feature? I'll do some reading.
thanks HoneyBadger.
-g
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I didn't think you could change the record size on a ZFS filesytem once created. Am I wrong there or is this a new feature? I'll do some reading.
thanks HoneyBadger.
-g
A zvol's block size can't be changed, but a filesystem or dataset can. It only impacts newly written data (net new or overwrites to existing) though, so anything already written at 128k won't be rewritten in 1M granularity for example.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Top