Truenas Crash/Reboot, is HW problem or Corrupted OS?

BakaPhoenix

Cadet
Joined
Mar 18, 2023
Messages
6
Hello,
This week i found my truenas unreachable so took a monitor and plugged it in and saw that got stuck on boot on some i915 vga error, i fixed it by adding "i915.modeset=0" in the grub boot and booted but then it went into a kernel panic related to ram.
I then checked the ram and saw that i might have some defective ram, removed the problematic ram and now the system boots, but then i saw that the nas keeps on rebooting after some hours of activity (thought was maybe reaching ram limits and tried adding 2 more mismatching ram to see if was that the problem but it still reboot) and sometime i get an error message about python and nothing works aside the web ui until i reboot again.

Code:
Core files for the following executables were found: /usr/bin/python3.9 (Sat May 6 07:47:24 2023). Please create a ticket at https://ixsystems.atlassian.net/ and attach the relevant core files along with a system debug. Once the core files have been archived and attached to the ticket, they may be removed by running the following command in shell: 'rm /var/db/system/cores/*'.
2023-05-06 08:02:21 (Europe/Rome) 


i went to /var/log/message and grabbed the logs from the last restart (there is no log on why it restart, just see the new boot logs) and put here as attachment.


HW Specs are:

Intel i7 3770k
Asus Maximus V Gene
Nvidia GTX 1650 Super
4x8 GB DDR3 1866 MHz Ram (in the logs was running with 2x8 + 2x4 with freq lowered to 1300 to check if there was some isntability with higher freq ram)
64 GB Samsung 830 as boot Drive
4x4TB Raidz1 Pool
1x2TB Time machine pool
Corsair RM550x PSU

Can someone confirm me the problem is HW? I tried to lower ram speed and increase VCCDDR to 0.85v to give some more stability to the ram but with no solution.
I do belive the problem with the igpu is somewhat related to everything as the cpu is slowly dying but before i go and buy new parts i want to be sure the problem is HW since if i repalce stuff i'm gonna upgrade some stuff aswell so the monetary investment is gonna be higher than expected.
 

Attachments

  • Untitled.txt
    81.6 KB · Views: 158

indivision

Guru
Joined
Jan 4, 2013
Messages
806
How long was it running for before going down?

If it's a new setup, I would say that suggests that it was never stable.

How do you know that the "CPU is slowly dying"? That sounds pretty grim and unlikely to result in a stable system. Maybe there is something even more basic that is damaging multiple components like your Ram (bad power source, bad PSU, MB issues).

My hunch is that, yes, you're trying to run software on a broken computer.
 

BakaPhoenix

Cadet
Joined
Mar 18, 2023
Messages
6
I set it up on the star of february, until it went down and didn't boot up anymore becasue stuck on boot with something like "vfio-pci vgaarb: changed vga decodes".

Now i tested another set of ram is being running without problems for almost 24 hours, maybe was the ram being defective (and since the igpu use ram as vram was failing too).

The ram i replaced also faield a memtest so must likely the problem was teh ram
 

neofusion

Contributor
Joined
Apr 2, 2022
Messages
159
If you've been running with bad RAM a long time there's not telling what errors have been written to disk after being stored in that same bad RAM.

I would consider the state of the OS to be undefined and any data stored to your drives to be suspect.
 

indivision

Guru
Joined
Jan 4, 2013
Messages
806
If you've been running with bad RAM a long time there's not telling what errors have been written to disk after being stored in that same bad RAM.

I would consider the state of the OS to be undefined and any data stored to your drives to be suspect.

This.

Especially not being ECC?!
 
Top