12.0-U8 Randomly locks up | Threadripper - ASrock Mobo - 256G

lightingman117

Dabbler
Joined
Mar 15, 2022
Messages
16
TR-3970x, ASRock TRX40D8-2N2T, 256GB-SS32GBDDR4ECC3200 @ 3200 (slight OC), 512TB 970Pro (mirror), 1000w Redundant Plat, SAS3008, X710, 12x 16TB EX16

I recently upgraded to U8 to see if some of the AMD specific bug fixes would help.

I have run HCI MemTest up to 300%
I have run windows 10 on it with TM5 (Test Mem, extreme config)
Latest BIOS/IPMI available from ASRock
Usage is minimal ~100Mib/s at most

I have no crashes unless running FreeBSD/TrueNAS, so to me, the hardware seems in order.

---

I added tunables:
watchdogd_enable|no|RC|disable IPMI watchdog
kern.corefile|/var/coredumps/%U/%N.core|SYSCTL|store logs in debu file

But neither seem to do anything.

---

---

I've poured over the logs and don't see or notice anything.
When it locks up, it locks up so hard that nothing works and a hard reboot is required.
Seems to happen every 1wk-4wks.

What logs are safe to upload (no user data) and beneficial?
 

Attachments

  • TrueNAS2_boot log.txt
    57.5 KB · Views: 221
Last edited:

1337Hacker

Dabbler
Joined
Oct 22, 2017
Messages
27
Jesus that's a system, especially that 512TB 970Pro :wink:

I take it you hot swapped a drive on the 5th?
When did you hard reset the box?

Pull up the terminal and post what you see
Code:
cd /data/crash
ls -al
 

LarsR

Guru
Joined
Oct 23, 2020
Messages
719
did you make the bios changes for ryzen cpu?
Disable amd cool&quiet erp-ready und global c-states
 

lightingman117

Dabbler
Joined
Mar 15, 2022
Messages
16
Jesus that's a system, especially that 512TB 970Pro :wink:

I take it you hot swapped a drive on the 5th?
When did you hard reset the box?

Pull up the terminal and post what you see
Code:
cd /data/crash
ls -al
Yeaaa, *sheepishly grins* New to TrueNAS, a bit overkill I guess. 128GB drives probably woulda been 'fine' for the boot-pool/system drive.

Disk6: probably, I had a 2nd vdev/tank not used that I was waiting on new drives to make it a full 6 drives. I don't remember exactly when that occurred though.
Crash: Apr 6 08:31:24 is when it came back up. So sometime between the config load at midnight and ~8am it crashed.

I ran the code, I don't see how I didn't find this as I poked in all the folders in shell (timidly tho as I'm new to FreeBSD/Linux CLI stuffs).
I compressed the folder, I'm trying to figure out how to move it onto my computer for viewing instead of in MC... So far no luck with copying to the var/log directory and downloading debug logs, copying to /mnt/storage repo, copying to remote host...

Edit: Copying to var/log & downloading debug logs worked. (was looking in wrong folder).

did you make the bios changes for ryzen cpu?
Disable amd cool&quiet erp-ready und global c-states

Shortly after I made this post I found a similar thread to this one discussing C-States:

I found global c-states in the bios; but didn't think to do cool & quiet or erp-ready.
I will do this now.

Been up 5 days, but that's normal.
I also moved the 2x 50TB vdevs over to another older intel based TrueNAS build for the time being (they need to be in production).
 
Last edited:

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Given its a server and not a games machine. I would remove the OC on the memory for even a fraction increased stability
 

lightingman117

Dabbler
Joined
Mar 15, 2022
Messages
16
Given its a server and not a games machine. I would remove the OC on the memory for even a fraction increased stability
I would typically agree, server stability > performance.
But thus-far, no data indicates it is a memory issue.

And thus ensues a "what is technically an OC" conversation.
[Who says what OC is? AMD, ASRock QVL, Memory QVL?]

For completeness;
Mobo QVL says 128 ECC is okay up to 3200, but 256 should drop down to 2666

BIOS repots PN: W724GU44J2320NA for RAM, sticks say SS32GBDDR4ECC3200 SST.
Neither are on the QVL and it is what the MFG sent us, so there's not much I can do there.

I am running 3000/1500 (MCLK/IF).
Tested with various memory programs for many days to ensure stability.
I don't see any reason to change if there's no data to suggest this is the culprit.

For the record, I have a nearly identical (minus HBA) system running Win Server @ 3200/1600 with perfect up-time. It is running a custom FFMPEG based video server.
 

lightingman117

Dabbler
Joined
Mar 15, 2022
Messages
16
Pull up the terminal and post what you see
Code:
cd /data/crash
ls -al


Here are the files moved/extracted to my machine
1650303209665.png

And the "textdump.tar.0.gz" file with some heaft.
1650303276452.png


Panic.txt
Code:
VERIFY3(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT) failed (36028797018963967 < 32768)


---

What I don't get is all of these crash files are from Jan. We have had at least 3 crashes since then. Where are those logs?

did you make the bios changes for ryzen cpu?
Disable amd cool&quiet erp-ready und global c-states
I was able to adjust the following:

Code:
Advanced > Amd CBS > CPU Common Options > Global C-state Control [disabled]
Advanced > Amd CBS > CPU Common Options > Core Watchdog > Timer Enable [disabled]
Advanced > Amd CBS > CPU Common Options > Core Performance Boost [disabled]
Advanced > Amd CBS > CPU Common Options > Power Supply Idle Control [typical current idle]
Advanced > Amd CBS > NBIO Common Options > SMU Common Options > DF Cstates > Disable (cool'n'quiet?)

Advanced > Amd Overclocking > Precision Boost Overdrive > [Disable]
Advanced > Amd Overclocking > SoC/Uncore OC Mode > [Enabled]
 

lightingman117

Dabbler
Joined
Mar 15, 2022
Messages
16
I have had 79 days of uptime.

I consider this resolved. (fixed with last post)

Thank you!


Edit: I revisited BIOS settings on the other 3970x systems and came up with the following breakdown.
TrueNAS12-U8 is still running the above settings; but I believe the below to be slightly more stable due to further CPU restrictions.

Code:
Advanced > OC Tweaker > Voltage Mode > [Stable Mode]
Advanced > OC Tweaker > SRC Spread Spectrum > [Disabled]
Advanced > OC Tweaker > DRAM freq. > [3200]
Advanced > OC Tweaker > Infinity Fabric > [1600]

Advanced > CPU Configuration > SVM Mode > [Enabled]

Advanced > Chipset Configuration > Primary Graphics Adapter > [Onboard VGA]
Advanced > Chipset Configuration > Restore AC Power Loss > [Last State]

Advanced > PCI Subsystem Settings > SR-IOV > [Enabled]

Advanced > AMD Mem Configuration Settings > Socket [0/1] > DRAM ECC = enabled

Advanced > Amd CBS > CPU Common Options > Core Watchdog > Timer Enable [disabled]
Advanced > Amd CBS > CPU Common Options > Core Performance Boost [disabled]
Advanced > Amd CBS > CPU Common Options > Global C-state Control [disabled]
Advanced > Amd CBS > CPU Common Options > Power Supply Idle Control [typical current idle]
Advanced > AMD CBS > NBIO Common Options > IOMMU > [Enabled]
Advanced > Amd CBS > NBIO Common Options > SMU Common Options > DF Cstates > Disable (cool'n'quiet?)
Extended Frequency Range (XFR)
Advanced > AMD CBS > NBIO Common Options > XFR Enhancement > Accepted > Precision Boost Overdrive > [Disabled]
Advanced > AMD CBS > NBIO Common Options > XFR Enhancement > Accepted > Precision Boost Overdrive Scalar > [Disabled]
Advanced > AMD CBS > NBIO Common Options > XFR Enhancement > FCLK Frequency > [1600]
Advanced > AMD CBS > NBIO Common Options > XFR Enhancement > UCLK DIV1 MODE > [UCLK==MEMCLK]

Advanced > Amd Overclocking > Precision Boost Overdrive > [Disable]
Advanced > Amd Overclocking > SoC/Uncore OC Mode > [Enabled]
Advanced > Amd Overclocking > UCLK DIV1 MODE > [UCLK==MEMCLK]

---

Boot > AMD-RAID (Or UEFI SW Raid)
 
Last edited:
Top