Boot pool degraded - no errors

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
I guess it's safe to assume that my actual data on the data pools is not at risk by a certain SCALE release?
generally yes. its not impossible but IX systems is literally in the business of data storage; any release that mangles data is not going to reflect on them well, so they are highly incentivised to ensure that, if nothing else, data is safe.
Max. temperature was 69 °C.
that seems...possibly on the high side? do you have good airflow? it doesnt always need to be much, but you usually need something going over the hotter components.
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
any release that mangles data is not going to reflect on them well, so they are highly incentivised to ensure that, if nothing else, data is safe.
Last instance and only instance I remember was zfs bug responsible for silent corruption.

that seems...possibly on the high side? do you have good airflow? it doesnt always need to be much, but you usually need something going over the hotter components.
69 °C didn't concern me, we are talking about the system with the 4130. In the sensor readings tab high CT is 100 °C. Tj max is 105 °C I think. Intel ark specifies 72 °C for Tcase. In addition to prime producing unrealistically high temperatures (I remember saying around +10 °C of what is ever to be expected in the real world from my overclocking days).

What I did wonder about myself though the fans did not ramp up during mprime. Your comment got me to check a photo I've taken of the inner workings and it seems like the CPU Fans are connected to FAN A and two case fans to FAN3 (140 mm (or 120 mm, not sure) and FAN4 (180 mm). Additionally there are two intake fans located in the front but they are connected to a manual fan control board where I could turn knobs.

FAN A stayed at 500 RPM
FAN 3 stayed at 1000 RPM
FAN 4 stayed at 500 RPM

This is from current operation:
1709099473528.png

1709099491843.png


Fan mode is set to optimal speed. I did not ramp up the intake fans (they run rather slow, because I read that HDDs should stay above 25 °C and the room is not heated) so when it was colder I could completely turn them off to reach temperatures of high 20s on them.

1709099662230.png

Ignore the cable management, but I would say this is plenty of airflow.

My takeaway from reconsidering the cooling concept is:
  • At least for FAN 3 I could adjust the tresholds so it may spin up.
  • I should connect the CPU Fans to Fan 3 and Fan 1 instead of stacking them up now that I freed Fan 1 and Fan 2 with the external fan control board (not shown in the photo).
  • I could have set the fan mode to Heavy IO and manually ramped up the intake fans for the prime ran, but as I mentioned I thought sub 70 °C was fine
This machine also should not experience elevated load usually, it runs a pfsense VM* and the rsync app for remote replication. And i replicate to it via LAN. Other than beeing a target for replications I don't work on it.

* I know VPN should be run on the router, however my fritzbox cannot be fine tuned in that regard. So I reverted to running it on a VM so I can limit the VPN access to that server only without exposing my homenetwork completely. I case the remote server my friend is using gets compromised. I may invest in a separate mini pc to replace my fritzbox but haven't really decided which one and if I really want to spend that money. I'd also need a separate modem then.

Thank you all for your input so far!
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
My personal experience is that 69C is not bad at all for a CPU stress test and that the heatsink is working. This is for a home environment, not a nice cool data center. That temperature could be lowered with better thermal compound, possibly a liquid product but those a dangerous to use and I would never personally recommend them due to the huge risk. The heatsink looks very adequate. Also the room temperature is a factor to which that data was not posted. My recommendation is that it is all good. You can look at the historical temperature data to see how TrueNAS is doing.

the purpose of the CPU Stress Test was to ensure it didn't crap out. It didn't. Good Test!
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
ahh I misunderstood. I thought the 69C was for the SSD. for the CPU that's fine.
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
Update: This night I received four critical alarms

Code:
Device: /dev/sdd [SAT], not capable of SMART self-check.

 /dev/sdd [SAT], failed to read SMART Attribute Data.

dev/sdd [SAT], Read SMART Self-Test Log Failed.

dev/sdd [SAT], Read SMART Error Log Failed.


The drive is faulted with 3 read and 142 write errors. Curiously enough, I didn't get an alert telling me my boot pool is degraded again.

I'll have to check again, but I thought it should have been fixed with 23.10.2

When I replace the drive I'll need to double check but I'm 99% sure I used a different port for that drive and used the other model SSD in the port where I previously received the errors.

I'll also add the smart data later. At this point I'm thinking that this specific SSD model is problematic.
 
Top