TN Core locked up hard

jcizzo

Explorer
Joined
Jan 20, 2023
Messages
79
they DO say dual FC on them.. so maybe that's it? i thought i read that it could do both, fc and ethernet, just need to do some configuration..
when i plugged the modules into the front of my switch (an old sg300-20) they both lit up.. the don't in the intel ports however..

this is new to me, so..
 

jcizzo

Explorer
Joined
Jan 20, 2023
Messages
79
Those should be the issue, I vaguely remember their cache being problematic. Can someone please confirm this?
you may have nailed this one with those drives having an issue..

when i do my test copy over the 10G line to either my nvme or to the spinners (in raidz1), all goes perfectly.. the nvme takes the stream at full line speed.. the spinners start off at 10Gb, then drop down to ~500MB/s for the remainder of the file copy and i can live with that.. (hey, it's almost 5x faster than over the 1G that i'm accustomed to)..

i planned on putting the samsung ssd's on the 1Gb side.. they're for important file storage and speed isn't important.. so, on the one hand it doesn't matter, its just odd behavior..

the one thing that IS important, is how this issue manages to take down the whole system.. that seems to be a problem with truenas.

unless it's a processing issue? i'm only running 2 cores and hyperthreading has been deactivated.. but all runs perfect with all the other drives, so.. 2 cores on this little thing seem to be more than adequate.

Not a day goes by that i'm not impressed with freebsd..
 

jcizzo

Explorer
Joined
Jan 20, 2023
Messages
79
Those should be the issue, I vaguely remember their cache being problematic. Can someone please confirm this?
I spoke too soon...

WTF??

just happened again!

copying those files to the SSDs resulted in truenas giving me an error.. then it kicks the drives out..

and i can't get back into the dashboard.. this is stupid! A drive shouldn't take down truenas..
 

jcizzo

Explorer
Joined
Jan 20, 2023
Messages
79
been using this hardware for 2 years in my pfsense box.. never a single problem.. now i switch it's use for seemingly lighter duty and its nothing but problems.

the drives get up to about 48c and it kicks them out.. problem is, it takes down the whole system.. i've had a drive fail in pfsense (raidz mirror), so i swapped the drive.. never a problem.. never had the system hang to where i had to do a hard reboot from within the ipmi because the powerbutton on the gui wouldn't respond.. this is miserable. how does the one make the most reliable OS around become unreliable?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
How are connecting those drives to the motherboard? Just making sure there is no HBA overheating.
48C is hot for a SATA SSD.

Regarding the unresponsive system... I don't know, it feels like the drives hang. Monitor CPU and RAM when this happens, both are suspects.

Edit: I forgot this thread's page one ever existed.
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I did find a link on Reddit where a user is reporting similar behavior under load for their 870 EVO drives - "heats up to 50C, goes non-responsive"


I can't speak from personal experience as my newest Samsung is an 860 EVO.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
There is also the following (unrelated?) issue apparently caused by bad NAND.


It appears I remembered right :D
 
Joined
Oct 22, 2019
Messages
3,641
I did find a link on Reddit where a user is reporting similar behavior under load for their 870 EVO drives - "heats up to 50C, goes non-responsive"

There's usually two sensors, and I'd wager that person may have hit temperatures (if measured by the other sensor) above 70 or 80 C.

I have two 970 EVOs. The one without a heatsink will exceed 70 C (and sometimes 75 C) under load. (Based on Sensor #2.)

It also happens to occasionally log errors. But I've never had the system outright "hang" on me; nor does the pool (of which it resides) ever suffer read, write, or checksum errors. (Even after monthly scrubs.)
 

jcizzo

Explorer
Joined
Jan 20, 2023
Messages
79
How are connecting those drives to the motherboard? Just making sure there is no HBA overheating.
48C is hot for a SATA SSD.

Regarding the unresponsive system... I don't know, it feels like the drives hang. Monitor CPU and RAM when this happens, both are suspects.

Edit: I forgot this thread's page one ever existed.
the ssd's are connected directly to the motherboards sata ports.
when i run these large file copy tests, i have the TNC dashboard open and i'm watching cpu temps, memory usage and ssd temps as the file copy progresses.

when i send the files over the 1Gb link, cpu utilization floats between 15-30%. zfs cache fills to the brim (i have 32Gigs of ram).

if the nas is shut off so all cools, it'll perform the full file copy then kick both drives out. the drives stay around 34c-38c during the copy, then in the last 30% or so during the file copy, i notice the drives becoming physically warm despite the dashboard not relaying the correct temps. in the last couple of gigs or so the dashboard updates the temps showing 48c.. all while this is going on, i'm periodically refreshing the page to see if there are temperature updates.

these are the 870 evos, they're 2.5" sataIII's. they shouldn't get warm, not like that anyway.. but even then, 50c shouldn't cause problems, unless the drives are far warmer than that and truenas isn't reporting it correctly, but that wouldn't make sense since TNC pulls it's info from s.m.a.r.t.

when this all started i had two lightly used drives.. one croaked so i ordered another.. installed it and it too croaked after about 45 minutes of use.

i pull the drives out and put them in a usb drive caddy, plug them into my windows workstation departition them with windows diskpart.. i partition them and run disk tests and all comes out clean without issue..

i even swapped the drive backplane that houses the drives with another that i had lying around wondering if it was defective and throwing more volts at them than spec.. made no difference.
 

jcizzo

Explorer
Joined
Jan 20, 2023
Messages
79
There is also the following (unrelated?) issue apparently caused by bad NAND.


It appears I remembered right :D
yeah i saw that.. but they specify the 2 and 4TB, mine are only 500gigs a piece. i was gonna call samsung about this although i doubt they'll do anything. drives DO all fall within warranty.. one is only a week old with a manufacturing date of this past february.. the others are from 2022.

there's not one aspect of any of this that makes any sense..

but it's eating my drives!
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
yeah i saw that.. but they specify the 2 and 4TB, mine are only 500gigs a piece.
They report issues with 1TB too.

You can easily check the smart data of those drives from the cli (better: use putty). Run a long smart test and then post the output in [CODE][/CODE] tags please.

If you haven't already tried, perform a clean installation and change the data cables/ports.
 

jcizzo

Explorer
Joined
Jan 20, 2023
Messages
79
i gparted EVERYTHING and reinstalled 2 nights ago.

i've tried different sata ports (onboard) so this time i'm gonna plug them into the hba (lsi 9211-8i)..

i wouldn't be surprised if the drives are now fried anyway. i can send one back because i just received it on 6/15 (that was when i first reinstalled everything and started frying these ssds). i had NO problems up until that point even with drive transfers, although they were different drives.. before that i was using older samsungs that i had laying around. i couldn't use those because they were too small and this was supposed to be my final build before putting the nas into full-time service.. put it all together, all seemed to be going well.. created the datasets and permissions, started testing file copies and POOF!! all went to hell and has been a nightmare ever since.

all sorta points to the drives but.... 3 drives in a row? that were fine up until that moment? doesn't make any sense.. and i don't wanna go throwing drives at the thing without knowing the cause because, aside from wasting my time, i'm pissing away a lotta $$..
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
I'd try taking those drives and importing the config on a completely different system (different MB, CPU, RAM) and run the transfer again.
 

jcizzo

Explorer
Joined
Jan 20, 2023
Messages
79
i don't have any other systems.. maybe i can build something temporary at work. old stuff that's gonna get tossed in the dumpster anyway
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
i don't have any other systems.. maybe i can build something temporary at work. old stuff that's gonna get tossed in the dumpster anyway
Yup, that's what I'd too in order to test.
 

jcizzo

Explorer
Joined
Jan 20, 2023
Messages
79
welp, i tried 2 completely new, outta the box ssds (1tb crucial mx500's).. same thing.. i even put them on the lsi hba controller instead of the onboard sata... same thing..

i ran memtest64 on it for a long weekend and there weren't any errors..

does anyone know if memtest would show errors with other components? i always assumed it's only going to show issues between the cpu and the memory, but not between other devices.
 

jcizzo

Explorer
Joined
Jan 20, 2023
Messages
79
kind've tough for me to believe this is a hardware issue when i've swapped ssds, swapped sata ports, swapped ssds to different sata ports on an lsi hba that has no problems whatsoever.. ran memtest, all came back clean.. i was always led to believe supermicro motherboards were good. not nearly as easy to work with as others, but just as stable as anything else.. i've used this motherboard and ram with pfsense without ever having an issue.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Then file a bug report, button is up right.
Though pfsense doesn't move gigabytes of data to and from your devices.
 

jcizzo

Explorer
Joined
Jan 20, 2023
Messages
79
what do you think are the chances that the cpu isn't up to snuff?

it's just a 2-core i3-7100t..

when i run the file copy test to the nvme or the spinners (which are raidz1), the problem doesn't occur.. the cpu utilization stays pretty reasonable for those devices (what it is exactly i don't recall).

when i run the file copy for the 2 ssd's in raidz (mirror), it pokes at 70% and the zfs cache completely fills..

is a raidz mirror more resource intensive than raidz1 (raid5)?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
what do you think are the chances that the cpu isn't up to snuff?
I suggested to set up a different system with a different CPU and motherboard.

is a raidz mirror more resource intensive than raidz1 (raid5)?
Please read the following resource.

TL;DR: no, but it depends.
 
Top