Update Nvidia driver

mm0nst3r

Dabbler
Joined
Sep 5, 2021
Messages
33
Is it theoretically possible to update nvidia drivers to the current version on scale?
Yes I know, it's not supported, not recommended, won't survive update etc., but even on Cobia RC current gen of GPU is not supported and my workload requires compute power upgrade, so it either update driver or migrate to Proxmox.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Is it theoretically possible to update nvidia drivers to the current version on scale?
Yes I know, it's not supported, not recommended, won't survive update etc., but even on Cobia RC current gen of GPU is not supported and my workload requires compute power upgrade, so it either update driver or migrate to Proxmox.

Write-up the entire problem.

Which GPU?
Which driver?
 

mm0nst3r

Dabbler
Joined
Sep 5, 2021
Messages
33
All latest ADA generation Nvidia GPUs are not supported except for L4 and L70. The whole RTX series and L70s.

Specifically we are updating RTX A4000 cards for RTX 4000 ADA.
It is supported in the version 535.113.01

Cobia comes with 535.54.03-2
Bluefin with 515.65.01
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Hmm.. don't know if its an easy update and would require testing.
How many systems?
Sounds like you have a team.. any developer skills?
 

mm0nst3r

Dabbler
Joined
Sep 5, 2021
Messages
33
Three systems. The team has some limited developer skills. I guess we might hire someone - just please advise where to dig.
I understand it is not just enabling apt and adding the right repository if we need apps working?
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
You can report a bug... it may get fixed in dragonfish, unless its simple. Worthwhile to do anyway.

If you have a developer that can work with discord channel.... and do a special SW build. It might work easily and just require your testing of special image.

If you are/were an iX customer.... you'd contact your sales rep and we'd see what we can offer.
 

mm0nst3r

Dabbler
Joined
Sep 5, 2021
Messages
33
You can report a bug... it may get fixed in dragonfish, unless its simple. Worthwhile to do anyway.

If you have a developer that can work with discord channel.... and do a special SW build. It might work easily and just require your testing of special image.

If you are/were an iX customer.... you'd contact your sales rep and we'd see what we can offer.

Thnx I have created the suggestion ticked on Jira.
If anyone will read this thread and is interested please vote for it to get IX attention.

But I must say, that as it is now Truenas exclusively support in Cobia NVIDIA GPUs that are already discontinued at the moment RC1 is released. I am pretty sure that unless you update Nvidia drivers to the current version - by the moment Cobia will be released - the only way to get the GPU it supports will be used stuff from Ebay. Which is kinda bad PR, to release the system that can only work with outdated used hardware.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Thnx I have created the suggestion ticked on Jira.
If anyone will read this thread and is interested please vote for it to get IX attention.

But I must say, that as it is now Truenas exclusively support in Cobia NVIDIA GPUs that are already discontinued at the moment RC1 is released. I am pretty sure that unless you update Nvidia drivers to the current version - by the moment Cobia will be released - the only way to get the GPU it supports will be used stuff from Ebay. Which is kinda bad PR, to release the system that can only work with outdated used hardware.

It is a difficult problem. The driver we used was latest as of June in 2023. We've then spent 4 months testing the software to get quality. There will be a way to get an early version of next release, but it won't have the same quality.

Nvidia should update their drivers for new hardware much earlier. The driver you wanted was released September this year.
 

mm0nst3r

Dabbler
Joined
Sep 5, 2021
Messages
33
It is a difficult problem. The driver we used was latest as of June in 2023. We've then spent 4 months testing the software to get quality. There will be a way to get an early version of next release, but it won't have the same quality.

Nvidia should update their drivers for new hardware much earlier. The driver you wanted was released September this year.

I understand, but they have also updated their whole RTX and DC lines of GPUs in August. Also the new driver has the same major version as the one in Cobia - it shouldn't break anything. Please at least try it - it might just work with no changes.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
I understand, but they have also updated their whole RTX and DC lines of GPUs in August. Also the new driver has the same major version as the one in Cobia - it shouldn't break anything. Please at least try it - it might just work with no changes.
Unfortunately, "just working" is not good enough for it to become part of a release. Each release needs about 4 weeks of internal testing and 1,000 community deployments to find any corner cases and even approach the quality required. If someone wants to develop a special build, we are OK with that as a test vehicle. All the tools are available.
 

mm0nst3r

Dabbler
Joined
Sep 5, 2021
Messages
33
Unfortunately, "just working" is not good enough for it to become part of a release. Each release needs about 4 weeks of internal testing and 1,000 community deployments to find any corner cases and even approach the quality required. If someone wants to develop a special build, we are OK with that as a test vehicle. All the tools are available.
What’s good in the release thoroughly tested but working exclusively with old equipment, totally outclassed having 50% of performance with 200% of power draw compared to the modern generation? Your own GPU server line will not work with Cobia. Great part of you development of the new App system will be wasted, because it becomes unusable for any workload that relies on GPU. I doubt any enterprise will lurk Ebay to buy weak inefficient, but compatible equipment.

I understand it’s a major inconvenience, but even software giants can’t ignore NVIDIA hardware update cycles. Please just bring my points to your colleagues, support of the last gen of NVIDIA hardware is really crucial.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Great part of you development of the new App system will be wasted, because it becomes unusable for any workload that relies on GPU.
Could you give an example of such workloads? I wonder what on earth I could use on a NAS that would require a GPU ... I read about "transcoding", but don't use it. Is there anything else?

iX needs a clear business case to justify a huge effort for that task.
 

mm0nst3r

Dabbler
Joined
Sep 5, 2021
Messages
33
Could you give an example of such workloads? I wonder what on earth I could use on a NAS that would require a GPU ... I read about "transcoding", but don't use it. Is there anything else?

iX needs a clear business case to justify a huge effort for that task.
Transcoding is not only using Plex for your pirated movies. For instance we ingest RAW video footage (2Gb/sec per camera, and even for interview you have at least three of them simultaneously) to the flash array, then it goes through temporal noise reduction (very GPU intensive process), then it gets transcoded to Prores codec for storage and goes to multi Pb HDD array. When requested by editors - it gets transcoded back to x265 (space efficient loosy format) to be transfered over internet to their remote workstations for editing and results get archived back to HDD array in Prores. Before Scale it took storage server, transcoder server, 100Gb/s NICs, 100GB/s switch, fiber, all iSCSI shenanigans. Now It's all hyper converged in a dual Epyc Supermicro with 6 GPUs and 2 9400-16e HBA with 8 JBODs. Storage and data integrity is well managed by Truenas while all workloads are containerised to simple docker images with direct access to the storage on it's full speed. Due to direct storage access (no iSCSI, no network) containers are simple and whole setup is almost stateless. Complexity and qualification required to maintain it is dramatically lower. In our case specifically x265 is being replaced industry-wise with AV1 codec - so we must replace our GPUs because Ampere generation simply does not support it.

I know VFX and 3D specialists also work with huge multi Pb data arrays with GPU workload and converging all into one system reduces costs of networking dramatically and simplifies the setup software-wise. I am not sure how AI and science people set up their workloads, but pretty much anything manageable by one GPU filled system that works with huge data arrays benefits from converging into one server - you just skip the whole high speed network stack.

Also our employee bought out our old decommissioned storage server with 6Pb, filled it with 2 GPUs and as far as I know it is now CHIA crypto farm - which is also a good market I guess.

Overall the thing is the whole app and containerisation support in Truenas is really wasted without GPU support and not supporting the whole professional line of NVIDIA of all companies is just not wise. By the way starting with ADA generation NVIDIA forbade it's AIB partners to make any consumer "blower style" GPUs - meaning all "gamer" cards are now oversized and would not fit any rack server. And the whole last gen Pro NVIDIA line released even before Truenas RC1 is not supported.

I hope it justifies the effort - which in my opinion should not be this huge - since it is a minor version update within the same long time support major version - meaning nothing should change in theory.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
@mm0nst3r While your use case does want a later NVIDIA driver, this problem of constant upgrade is going to exist forever. If it's not a new Linux kernel to support some new CPU, Ethernet chip or fan controller, it is as you say, the GPU driver.

This shows the difference between stock Linux distros that a user can upgrade, and an appliance firmware for a specific purpose, (aka TrueNAS SCALE). iXsystems is attempting to produce stable, feature rich and open source products, that they can sell to the Enterprise with matching hardware. Too many support calls from Enterprise customers reduces both profit and the ability to develop the software further. (Because the developers are having to fix customer experienced bugs at priority over new features or software updates.)

iXsystems is not a first tier hardware builder that gets notified by chip vendors and would get pre-release software or support information. iX ends up being reliant on both the Linux distro they based SCALE on, as well as some other projects release schedules, (like OpenZFS & Samba).

The only real solutions to this are;
  • Wait for new SCALE release
  • Use SCALE nightlies, (which also might need to wait a bit)
Neither might be pleasant. But, I'd rather have a well tested and reliable release than a constantly moving target that some Linux distros have become.

I know it is not what you want to hear. In my opinion, push for the SCALE Nightlies to include the new NVIDIA driver. Then move any affected SCALE NAS to that Nightly train. Whence you have the NVIDIA driver, no need to update further, unless to fix a specific bug. Well, until a production release of Cobia, (or DragonFish), that includes your NVIDIA driver is released.
 

mm0nst3r

Dabbler
Joined
Sep 5, 2021
Messages
33
@mm0nst3r While your use case does want a later NVIDIA driver, this problem of constant upgrade is going to exist forever. If it's not a new Linux kernel to support some new CPU, Ethernet chip or fan controller, it is as you say, the GPU driver.

I am not talking about constant driver updates at all. Nvidia updates its PRO GPU line once in 3-4 years. The driver used in Cobia RC1 only supports RTX GPU line released in April 2021 and earlier - all discontinued at this moment, the 2023 line is not supported. It means you will not be able to buy any new GPU compatible with Cobia (apart from H100 for 30k) in a couple of months when stock leftovers are sold. Even if you don't need neither compute power nor newer features. The update to the current NVIDIA driver version will make Cobia useful for another 3 years without any further driver updates.

Well, until a production release of Cobia, (or DragonFish), that includes your NVIDIA driver is released.

I started this thread in hope that IX will include the new driver in release, because apparently the issue is not obvious to people not working with GPU workloads. I obviously do not plan to use RC in production - but it doesn't mean we don't test as soon as it is available.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
@mm0nst3r Thanks for your detailled explanation. I see for the very first time an absolutely convincing advantage of SCALE over CORE. If the driver issue can be fixed, that is.

Great you took the time - you might want to add that info to the Jira issue.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
I am not talking about constant driver updates at all.
...
Understood.

But I AM talking about constant driver updates. Well, perhaps 5 or so forum posts per year have been about un-supported hardware. Like Intel 12th or 13th generation CPUs with the P&E cores. Or 2.5Gbps Ethernet chip support. Or new AMD CPU thermal sensor reads.

Later kernels will have such support baked in, but people want to use their hardware immediately. And they are annoyed that such supposedly trivial things are not already supported.

...
Nvidia updates its PRO GPU line once in 3-4 years. The driver used in Cobia RC1 only supports RTX GPU line released in April 2021 and earlier - all discontinued at this moment, the 2023 line is not supported. It means you will not be able to buy any new GPU compatible with Cobia (apart from H100 for 30k) in a couple of months when stock leftovers are sold. Even if you don't need neither compute power nor newer features. The update to the current NVIDIA driver version will make Cobia useful for another 3 years without any further driver updates.

I started this thread in hope that IX will include the new driver in release, because apparently the issue is not obvious to people not working with GPU workloads. I obviously do not plan to use RC in production - but it doesn't mean we don't test as soon as it is available.
Now your explanation makes more sense.

I do find it hard to believe that NVIDIA does not have some recent graphics cards are that are not supported by the old driver. Perhaps your reference is directly to the PRO GPU line... and not some of the consumer cards.

But, I don't follow NVIDIA GPUs, other than they exist and have NVIDIA supplied drivers. So NVIDIA could very well have "moved on".
 

mm0nst3r

Dabbler
Joined
Sep 5, 2021
Messages
33
Now your explanation makes more sense.

I do find it hard to believe that NVIDIA does not have some recent graphics cards are that are not supported by the old driver. Perhaps your reference is directly to the PRO GPU line... and not some of the consumer cards.

But, I don't follow NVIDIA GPUs, other than they exist and have NVIDIA supplied drivers. So NVIDIA could very well have "moved on".

Though the driver in RC1 does support the current generation of Nvidia consumer cards - since 2022 Nvidia forbade it's AIB to make "blower-style" consumer GPUs. They are all oversized since then and made specifically to make sure no Nvidia consumer card would fit into any rack system or case. If current driver stays in release - it means no Nvidia GPUs in Truenas until Dragonfish release for any serious enterprise setup.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
It is a difficult problem. The driver we used was latest as of June in 2023. We've then spent 4 months testing the software to get quality. There will be a way to get an early version of next release, but it won't have the same quality.

Nvidia should update their drivers for new hardware much earlier. The driver you wanted was released September this year.

@mm0nst3r
We've decided to look at whether this is possible for 23.10.1
It will depend on whether the driver works well with current Linux version
If it gets into this version.. it would also be available as a Nightly
@Kris Moore - this is the GPU driver thread.
 
Top