Hardware recommendation for 700 concurrent FTP users

geek0000

Dabbler
Joined
Feb 19, 2014
Messages
13
Currently I have TrueNAS-13.0-U5.3 on a Intel(R) Atom(TM) CPU C3558 @ 2.20GHz with 32GB ECC RAM and 4x 4TB WD RED HDDs configured in RAID 1+0

The organization uses a garbage ERP software that dumps files to the storage using FTP (large number of small files around 500KB-2MB per file.
Concurrent users are about 700 and which maxes out at the CPU 100% (despite protesting that this device is not capable of handling so many concurrent connections, management had me change "simultaneous clients to 700". The result is not surprising.

However, I wanted to get expert recommendations here before I blindly go out shopping.

1) Is the 100% CPU usage indeed due to the high FTP load? (seems evident it is from the TOP command - screenshot attached)
note: system was restarted 5 minutes before taking the screenshots
2) What kind of CPU/RAM/Storage I should get for this purpose?
3) Could the slow mechanical HDDs be contributing to the maxing out of CPU due to I/O wait?

Anything else I should look for before I decide on switching out hardware?

Thanks!
 

Attachments

  • Truenas-CPU-usage-graph.png
    Truenas-CPU-usage-graph.png
    27.2 KB · Views: 68
  • Truenas-CPU-usage.png
    Truenas-CPU-usage.png
    52.9 KB · Views: 65
Joined
Jun 15, 2022
Messages
674
First you should read the TrueNAS resources, because ZFS on RAID will typically lose data permanently, eventually. You have bigger things to worry about that will impact the system far more substantially.

Second you should consider getting a new job as management is likely to run the location you're at into the ground eventually, and it appears good employees are likely to seek better employment elsewhere and be compensated accordingly. If you're a "good" employee you should consider the benefits of working with people you can learn from and grow with as a team and become a "great" employee.

Third, I dont see a problem with CPU usage as long as you have adequate cooling, which is unlikely though possible. Many tasks (such as video editing) use 100% of several resources, it translates into the user(s) waiting on the computer instead of the computer waiting for something to do, which is and has been a normal situation in IT since the invention of the computer. People may have to wait, so what, that's not your problem, that's a resource allocation decision that falls back on management. Until management decides there's a problem and how much it's costing them (called "opportunity cost"), how can a viable budget be determined for solving the problem?

Without constraints for solving a problem, the problem is not understood. For example, if i tell you i need a car to go to work and back, want you to determine what's it going to cost, my question is flawed: it makes the assumption I need a car, but public (or other) transportation may meet my needs better. Notice we haven't addressed what class of vehicle is desired yet because the problem isn't fully defined and the constraints therefore aren't known. Shopping for any transportation is destined to net a poor outcome without knowing all the requirements, or at least the important ones.

Regardless, the scope of system required hasn't been defined, management is asking you to Band Aid the situation and will likely hold you accountable for any failures or sub-optimal results and not credit you for any successes. There appear to be several high-risk points of failure in the system described, replacing it with a faster system won't mitigate that. We want you to be successful in the big picture, what do you want?
 
Last edited:

geek0000

Dabbler
Joined
Feb 19, 2014
Messages
13
First you should read the TrueNAS resources, because ZFS on RAID will typically lose data permanently, eventually.
I am sorry. I think I misspoke. What I meant was I have a zfs striped mirror (which would be an approximate equivalent of raid 1+0 am guessing)
People may have to wait, so what, that's not your problem, that's a resource allocation decision that falls back on management. Until management decides there's a problem and how much it's costing them (called "opportunity cost"), how can a viable budget be determined for solving the problem?
Well they don't want people to wait at all. They're complaining everything is slow, so when I checked the truenas box, it was itself very slow to respond and on seeing a constant 100% usage I figured the hardware is not suited for so many concurrent connections and thought replacing this box with a higher spec would solve this problem
 
Joined
Jun 2, 2019
Messages
591
Insecure FTP? Welcome to the 1970’s!

I’m guessing there is no backup system, disaster recovery, or incident response plan.
 

geek0000

Dabbler
Joined
Feb 19, 2014
Messages
13
Insecure FTP? Welcome to the 1970’s!
Yes! Hence I started out by saying "garbage ERP" they send user uploaded files to the FTP server (instead of using object storage or something modern)
They won't even switch to SFTP because "our software does not support SFTP"
I’m guessing there is no backup system, disaster recovery, or incident response plan.

Nope. zilch zip nada. I've been begging for a secondary truenas box for replication since over 2 years but I was shutdown
 
Joined
Jun 2, 2019
Messages
591
Time to find a new job
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
3) Could the slow mechanical HDDs be contributing to the maxing out of CPU due to I/O wait?
It looks like the big time sink is just FTP. 96% CPU in userland... How did a C3558 end up doing this horror show of a job? That said, I wouldn't dream of implementing something like this with spinning rust, it's just not worth to "save" a few bucks (at the cost of endless hours waiting or dealing with issues).
2) What kind of CPU/RAM/Storage I should get for this purpose?
Good question, 700 users is a lot, even if the basic workload is simple. C3000 goes up to 16 cores. Xeon-D is beefier and might also work.
Any downtime you can use to throw a synthetic load on the thing (i.e. a known number of representative clients) and ramp it up to figure out what the limit is with a C3558? If not, you're kinda stuck going big or risking the problem remaining...
Nope. zilch zip nada. I've been begging for a secondary truenas box for replication since over 2 years but I was shutdown
Are you actually going to get any budget for even the primary?
 

geek0000

Dabbler
Joined
Feb 19, 2014
Messages
13
It looks like the big time sink is just FTP. 96% CPU in userland... How did a C3558 end up doing this horror show of a job?
It was NOT. it was deployed for just sqldumps of a main database and little bit of CIFS shares. One day they say "Hey, this has FTP right? Imma use the hell out of it. :tongue:"
C3000 goes up to 16 cores. Xeon-D is beefier and might also work.
Basically Lot of core + good clock speed. Duly noted.
Are you actually going to get any budget for even the primary?
Gonna tell them, here's what we need (will make a huge spec system), take it or live with your problems. I believe they will allot it. They only allot budget to something in IT when shit hits the fan.

What about the disk? Do you think I need all flash storage? or will a striped mirror of 4 wd red spinning disks will suffice?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
One day they say "Hey, this has FTP right? Imma use the hell out of it. :tongue:"
Run away from anyone excited about FTP servers, it's never a good sign.
 

Tony-1971

Contributor
Joined
Oct 1, 2016
Messages
147
Hello,
With load average equal to 96 you probably want an Epyc CPU with a lot of core.
But probably the next bottlneck will be having only two vdes with HDD.
Best Regards,
Antonio
 

Tony-1971

Contributor
Joined
Oct 1, 2016
Messages
147
A load average of 96 (per minute) means that there are 96 processes that want to run in a 4 core system (so 92 processes are waiting for a free core). The other values are 128 processes per 5 minutes and 212 processes per 15 minutes.
Very high load average: if this is normal than 16 core are not enough (but better than having only 4).

With a striped mirror of NVME SSD you have a lot more IOPS than with HDD, so it is better.

Best Regards,
Antonio
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Whats the network card in this? Can you post your full existing hardware please as per forum rules
Are these users local (LAN), or are they (some of) them remote - WAN based? I am thinking of bandwidth into the box, packet loss and retries just clogging things up
 

geek0000

Dabbler
Joined
Feb 19, 2014
Messages
13
Whats the network card in this? Can you post your full existing hardware please as per forum rules
Intel(R) Atom(TM) CPU C3558 @ 2.20GHz , Quad LAN with Intel® C3000 SoC, 1GbE, 32GB RAM
Are these users local (LAN), or are they (some of) them remote
All local users. They connect to a one server that is in the same LAN as the freenas box

I can look for the network graph on full load and see how much its being used. But last I checked, it didn't even max out 1G
 

Tony-1971

Contributor
Joined
Oct 1, 2016
Messages
147
With 700 concurrent users if you want a network speed of 1MByte / s for each user you must have a 10Gb network.
Best regards,
Antonio
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
I would be looking at the following:
1. Much better CPU, more and faster cores. More cores is more important than faster cores
2. More memory - probably least important issue. Make sure whatever you get has the capability of more memory - as memory is easy to add
3. Faster disks, either smaller HDD's in lots of vdevs OR largeish SATA SSD's probably in a couple of vdevs. NVMe would be better than SATA SSD, but even SATA SSD should be a major improvement. Note that much faster disk subsystem will need multiple 1Gb or 10Gb+ to feed it
4. 10Gb NIC - of course this depends on the overall network infrastructure. Perhaps combining multiple 1Gb into a LAGG interface might be worth it.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
3. Faster disks, either smaller HDD's in lots of vdevs OR largeish SATA SSD's probably in a couple of vdevs
Nah, SSDs all the way for this one. For such a small total capacity, there's just no point.
The only catch is that SATA SSDs these days are either hilariously expensive or ridiculously dodgy:
  1. Samsung 870 Evo have had major reliability issues
  2. WD Blue have seen sneaky component replacements that may hinder performance (possibly even moved from DRAM to DRAM-less architecture)
  3. MX500 have had major reliability issues and sneaky component replacements (though mostly maintaining performance)
That leaves enterprise SATA, which gets expensive quickly. If you can get away with a larger platform (e.g. something 1U with full 2.5" U.2 front bays, à la Dell R6515 or R650 or Supermicro A+ line), it's the better option overall for more cores, more memory, and more SSD options.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
I was thinking a lot of vdevs for HDD's - but yeah - SSD's would be a much better idea
 

geek0000

Dabbler
Joined
Feb 19, 2014
Messages
13
Thanks a lot for your recommendations
I'll go with 4x PCIe NVME SSDs in that case.
Any specific popular model comes to mind?
 
Top