ESXi ECC detection in TrueNAS

tess

Cadet
Joined
Jul 2, 2023
Messages
4
I am currently testing out a few hypervisors for my server build. I have tested Proxmox and XCP-ng. Both of those hypervisors pass through ECC data to the VM. And truenas detects it.

However ESXi the hypervisor recommended by IX-systems does not appear to do this. As far as I know this means that with uncorrectable errors, Truenas will just happily keep writing data. This would obviously be less than ideal. Is it possible to enable ECC data passthrough? Or is my worry not true?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Moderator note: moved to a more appropriate forum

With uncorrectable errors, all of Proxmox, XCP-ng, and ESXi do nothing useful; TrueNAS is unable to "react" to ECC errors, and your sole recourse is to halt the machine and replace your memory with properly functioning memory. The ECC memory event reporting offered by ESXi exists merely because a standalone bare metal host might not have other mechanisms to catch and report it. In a hypervisor environment, VMware expects that system event log errors will reported upstream to vSphere, or will be detected and reported by the platform IPMI/BMC/DRAC/iLO in whatever manner you want.

If Proxmox and XCP-ng are implementing their own reporting capability for this, good on them, because obviously lots of hobbyists are running hypervisors on platforms where that should never be done. This is probably not really an advantage of those hypervisor platforms. You're better off having a platform controller if you're running a hypervisor, as it can handle a variety of issues including memory error reporting.

I may be able to ask some contacts in VMware's Office of the CTO about this; I like playing "stump 'em" but this feels like the kind of thing where they won't be anxiously running to the development team to suggest it as a new feature for ESXi.
 

tess

Cadet
Joined
Jul 2, 2023
Messages
4
Moderator note: moved to a more appropriate forum

With uncorrectable errors, all of Proxmox, XCP-ng, and ESXi do nothing useful; TrueNAS is unable to "react" to ECC errors, and your sole recourse is to halt the machine and replace your memory with properly functioning memory. The ECC memory event reporting offered by ESXi exists merely because a standalone bare metal host might not have other mechanisms to catch and report it. In a hypervisor environment, VMware expects that system event log errors will reported upstream to vSphere, or will be detected and reported by the platform IPMI/BMC/DRAC/iLO in whatever manner you want.

If Proxmox and XCP-ng are implementing their own reporting capability for this, good on them, because obviously lots of hobbyists are running hypervisors on platforms where that should never be done. This is probably not really an advantage of those hypervisor platforms. You're better off having a platform controller if you're running a hypervisor, as it can handle a variety of issues including memory error reporting.

I may be able to ask some contacts in VMware's Office of the CTO about this; I like playing "stump 'em" but this feels like the kind of thing where they won't be anxiously running to the development team to suggest it as a new feature for ESXi.
Sorry I completely overlooked this forum section! Thanks for moving it.

I see, my system does have IPMI, BMC etc which will notify me of ECC errors. And I did verify ESXi does detect the ECC memory correctly.

I do realize IX systems recommends ESXi and therefore will probably still be a better solution. It does feel "wrong" though. I do get that the reporting is more about telling the user they should replace/tune the RAM etc. But Truenas has good facilities to deal with ECC and therefore I think letting a hypervisor notify a VM is a good idea.

Now I'm no expert on hypervisors so maybe that's not a good idea at all. But if you do want to ask them, hey I would not mind hearing what they think ;)
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Sorry I completely overlooked this forum section! Thanks for moving it.

No worries. You were actually posting in the old FreeNAS section, where your post wasn't going to get much attention.

It does feel "wrong" though. I do get that the reporting is more about telling the user they should replace/tune the RAM etc.

That's the basic bit right there.

But Truenas has good facilities to deal with ECC and therefore I think letting a hypervisor notify a VM is a good idea.

Possibly, but it is also dependent on operating systems doing something rational with the information. VMware is still largely Windows focused, even if it is also the best hypervisor for supporting lots of other more obscure operating systems. It would be a much more compelling argument if the VM could do something meaningful with the data, but since correctable errors are corrected, and uncorrectable errors lead to a situation where there isn't really a good resolution, it seems like a bit of a "so-what".

Now I'm no expert on hypervisors so maybe that's not a good idea at all. But if you do want to ask them, hey I would not mind hearing what they think ;)

I pained a bunch of Equinix folks in the old days by asking them tough questions about their "superior" building construction. It used to be that they converted standard industrial park buildings into data centers. The site that eventually became "DC3" started out in 21711 Filigree Court as DC1, with a parking lot and a walk-in street level entryway. One of my first questions for them was to go into depth when they boasted about their superior building construction. When pushed, I pointed out that I had video of a tornado (taken a mile away at AOL HQ), and wanted to know how the facility had been hardened. They eventually admitted that the building was "up to local building codes" which came as little shock to me as it was a single-story tilt-up steel concrete slab on grade building. That means, among other things, "potentially vulnerable to flooding". It is also directly in the glide path to Dulles, which means there are planes overhead constantly.

I probably was a great annoyance to another salesman who was trying to move space at DC3/21715 when I asked about fire suppression capabilities in the battery bank in front of a customer they were trying to sell to.

I like asking hard questions. :smile:
 

tess

Cadet
Joined
Jul 2, 2023
Messages
4
No worries. You were actually posting in the old FreeNAS section, where your post wasn't going to get much attention.



That's the basic bit right there.



Possibly, but it is also dependent on operating systems doing something rational with the information. VMware is still largely Windows focused, even if it is also the best hypervisor for supporting lots of other more obscure operating systems. It would be a much more compelling argument if the VM could do something meaningful with the data, but since correctable errors are corrected, and uncorrectable errors lead to a situation where there isn't really a good resolution, it seems like a bit of a "so-what".



I pained a bunch of Equinix folks in the old days by asking them tough questions about their "superior" building construction. It used to be that they converted standard industrial park buildings into data centers. The site that eventually became "DC3" started out in 21711 Filigree Court as DC1, with a parking lot and a walk-in street level entryway. One of my first questions for them was to go into depth when they boasted about their superior building construction. When pushed, I pointed out that I had video of a tornado (taken a mile away at AOL HQ), and wanted to know how the facility had been hardened. They eventually admitted that the building was "up to local building codes" which came as little shock to me as it was a single-story tilt-up steel concrete slab on grade building. That means, among other things, "potentially vulnerable to flooding". It is also directly in the glide path to Dulles, which means there are planes overhead constantly.

I probably was a great annoyance to another salesman who was trying to move space at DC3/21715 when I asked about fire suppression capabilities in the battery bank in front of a customer they were trying to sell to.

I like asking hard questions. :smile:
I like that mentality! I do tend to do the same. If you ever ask and have an answer let me know! For now I'll keep testing some things. I really do like EXSi, so that'll probably be it. But I'll be honest both Proxmox and XCP-ng worked nicely too. XCP-ng had some trouble with SR-iov, Proxmox mostly "just" worked.

But ESXi just makes it all so easy. And they have the seal of approval of IX-systems.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
But ESXi just makes it all so easy. And they have the seal of approval of IX-systems.

Well, actually, that may or may not be true. The push for virtualization really started here in the forums at a time when iXsystems wasn't really all that interested in it. ESXi has been very solid for virtualization on certain platforms for at least a decade and I've been working to keep that option and the proper way to do it clear for all that time now.
 
Top