Disk errors + hang/reboot during scrub - no issues in safe mode

ckrevel · May 24, 2023

I'm having a strange issue - hoping someone smarter than me can figure it out:

I've had TrueNAS running on the same hardware for the past 3 years (specs below). recently, I've been having occasional issues where I'll get chksum errors. they all appear to be corrected without issue, and a subsequent scrub finds no faults. recently, the scrub started to hang during execution. after that, it would start to reboot during the scrub with page faults in the dmsg log. When I start the system in safe mode, scrub completes successfully, but as soon as I restart into normal mode, i get the same behavior - hang during scrub followed by reboot. I have taken the nas down, run memtest86 for 48 hours without issue, stress for 24 hours without issue, and am currently finishing up a set of tests on the drives (sequential and random access), so far without issue.

I guess I'd like to know:
* What could be causing this behavior
* What does BSD/TrueNAS do differently in Safe Mode vs Normal
* How can I stop / correct / prevent this behavior

My Hardware:
AMD Ryzen 5600
asrock rack x570d4u-2l2t
32 gb Micron ECC
8x seagate ironwolf 8tb 7200rpm

I can provide logs or any other info on request, my current HD tests should be done by the end of 05/24/2023. Thanks for any and all help!

Best,
Chris

jgreco · May 24, 2023

That board has an AMD PCH SATA controller for 8 ports of some sort.

I'm not sure that's stable or reliable. We know that Intel PCH SCU or Intel PCH SATA ports are stable.

ckrevel said:
* What does BSD/TrueNAS do differently in Safe Mode vs Normal

Safe mode disables SMP (taking you down to one core), turns off DMA for ATA and ATAPI devices, turns off ATA write caching, sets kern.eventtimer.periodic to 1 which has been known to fix some weird ZFS/time issues, and some other irrelevant stuff.

ckrevel said:
* How can I stop / correct / prevent this behavior

Well, the first thing I'd try is subbing out the controller. Any time there's questionable activity regarding a disk controller, such as your reported "occasional" checksum errors, it is best to be skeptical of the controller. If you happen to have an LSI HBA available, please try that to see if it corrects the occasional error issue. Yes, I know it was working for you for several years, but sometimes things change in the drivers to fix a problem on Board Model A and it ends up breaking Board Model R. And once in a very nasty unfortunate while, they are even irreconcilably incompatible. It's nice to swap out something that is questionable for something that is expected to work.

ckrevel said:
and am currently finishing up a set of tests on the drives (sequential and random access), so far without issue.

What are you using for this? solnet-array-test-v3?

ckrevel · May 24, 2023

JGreco,

Thanks for the reply! I appreciate the information you provided regarding Safe Mode. I have an old raid controller that I think I can drop into bypass mode, and test using my old array disks. I had thought of doing that, but was hoping I could find and correct the issue without needing to go to that extent. I am not using solnet-array-test-v3, rather I'm using a couple of different utlities found on ubcd and hbcd to perform concurrent read tests. but will spool up a vanilla environment of TrueNAS and run solnet. Once I've finished the test protocol, I'll update.

Thanks,
Chris

joeschmuck · May 24, 2023

What TrueNAS software are you running? You have listed the hardware aspect and this hardware has been running reliably for the past 3 years, why is it now failing. You could have a piece of hardware fail or my initial reaction is you likely changed the software and that is causing the issue.

If you did change the software, can you roll back to the earlier version to see if the problems go away?

ckrevel · May 24, 2023

I'm running 13.0u4. I did try booting into older environments, (u2&3), with the same effect. I was concerned about going further back because of the zfs feature upgrade that came out in 13, so haven't gone beyond 13u2. I don't remember when I upgraded to u4, but I don't *think* the behavior lines up with the upgrade. (I don't remember if it started happening before u4 or after).

Thanks,
Chris

joeschmuck · May 25, 2023

So long as you explored the software possibility. It would terrible to buy new hardware, which might fix the problem, but to later find out the software was the real issue.

ckrevel · May 26, 2023

Well, every new experimental result is a data point to help get to the root cause. But yeah, i'm going to exhaust all the tests I can that don't involve loading up the parts cannon first..

jgreco · May 26, 2023

ckrevel said:
loading up the parts cannon

There's a parts cannon?

ckrevel · May 26, 2023

"Cannon" might be a slight exaggeration, but it sounds better than "Parts pistol", or "parts slingshot", or "throwing alternating handfuls of my hair and hard objects at my server while screaming at it in tears while sitting naked on my office floor".

joeschmuck · May 26, 2023

ckrevel said:
"Cannon" might be a slight exaggeration, but it sounds better than "Parts pistol", or "parts slingshot", or "throwing alternating handfuls of my hair and hard objects at my server while screaming at it in tears while sitting naked on my office floor".

TMI

ckrevel · May 27, 2023

So after several days of testing, under both winpe and bsd, my drives report no issues. I completed parallel read tests both sequential and random using a drive testing utility in winPE, and successfully ran solnet-array-test-v3 without issue. the system seems to be solid until the zvol is imported and mounted (i had no issues when importing with -N, until i started a scrub on the pool). but whenever i scrub the pool in normal mode, it halts and reboots. I'm curious if you think it's still an issue with the sata controller. I'd hate to buy parts that don't solve the problem.

Thanks,
Chris

joeschmuck · May 28, 2023

I don't think you really have a choice now, I feel you do need to use a known supported HBA and hope this solves the problem. @jgreco has a great insight on HBA's and ZFS and I trust his advice. Good luck in moving forward.

jgreco · May 28, 2023

ckrevel said:
I'm curious if you think it's still an issue with the sata controller.

I've got nothing solid for you. I've never seen the AMD PCH SATA, and so most of my opinion here has to be the result of secondhand interactions. And you're one of the first. I do want to know what you discover. It will impact what I tell users in the future...

ckrevel · May 28, 2023

I'm curious if either of you (or anyone else reading this thread) has any idea why the system works reliably in safe mode, but pukes every time it performs the same task (scrub) in normal mode. I don't have a problem replacing (or augmenting) the sata controller with a 3rd party hba, but I would think that the functions in safe mode would be approximately the same as normal.

Is there some documentation somewhere (either through iX/Truenas or BSD) that I can find out all the options being passed to the kernel at startup in safemode and try them one by one to see if I can isolate the offending command? If I can do that, I might be able to correlate it to a setting in the bios, or if it doesn't interfere with data integrity, just add it as a custom option on normal boot.

Thanks,
C

jgreco · May 28, 2023

ckrevel said:
but I would think that the functions in safe mode would be approximately the same as normal.

Depends on whether or not you have a reasonable definition of "approximately", "normal" and "safe mode". The point of safe mode is to disable certain known pain points that can cause system instability especially on dodgy hardware. So if you think the changes I listed above in post #2 qualify, then I guess. But it seems likely that one of those things is changing system behaviour in a way that "fixes" whatever is broken. I would warn you that

ckrevel said:
I can find out all the options being passed to the kernel at startup in safemode and try them one by one

Sure, man core.lua(8).

ckrevel · Jun 11, 2023

ok, after running a number of tests, I wasn't able to find any conclusive links. Furthermore, it seems that while safe mode does work to reduce the incidents of checksum errors and machine reboots / freezes, it doesn't eliminate them.

As a last ditch effort, I bought a broadcom hba (and new cables), and reran the tests. As a result there was...no change. Regular operation still results in significant errors, safe mode results in more stable but not perfect operation.

As I now had a portable sas/sata controller, I next pulled my offsite backup machine back home, attached the hba to it, and routed the cables from the backup machine with the hba to the drives which had been left in the backplane in my regular server. after importing the zpool, and running two scrubs to clean up any residual checksum errors, I have not had any issues with stability or checksum errors. I have scrubbed the pool ~5 times at this point, and have run a deep analysis using 'zdb -U /data/zfs/zpool.cache -c -c {pool}' with no issues.

At this point, I have validated the operation of

the drives
the backplane
the cables
the hba
the psu on the primary server (a separate test. but ipmi confirmed the output voltages prior to swapping it out)

by testing their operation successfully in another machine. Additionally, I have eliminated the onboard SATA controller from my primary server motherboard as a likely source of error. By process of elimination, the only components remaining on my primary server are:

the cpu
the memory
the motherboard
the boot pool

At this point, I'm not sure what I can do. I have tested the memory multiple times with memtest86, and have run burn-in tests on the cpu. I have run disk tests on the boor drive, and scrubbed the boot pool to check for data errors. I'm reluctant to throw in the towel on the mainboard (because it is fairly new and also fairly expensive) so I'm curious if anyone has any ideas on troubleshooting that i can run to narrow down or identify which component to replace.

Thanks,
Chris

joeschmuck · Jun 11, 2023

The RAM, is it one stick of 32GB or two sticks of 16GB? I ask because if you have two sticks, you could also try to eliminate one of the sticks of RAM, also keeping in mind that the motherboard could have a bad channel for the RAM in the back of your mind.

Something else you could try, run a different OS on the remaining components and see if there are any issues. Maybe run Ubuntu for example or Windoze for a few weeks, unfortunately it would never be the same workload but if it fails then you know it's not just a TrueNAS problem. CPU and RAM tests do not fully test a motherboard and I don't know of any testing other than true factory testing that would be able to identify some of those hard to isolate motherboard problems.

Another thing you might try, slow down the motherboard a tiny bit, Underclock. Maybe the system will become stable, maybe not. The CPU voltage can also be dropped slightly in an effort to promote CPU stability. There are a lot of little tricks out there but many people would rather not try them out and expect the system to run as it was designed. I'm not saying you should just make it work and live with it. Use these steps to try to identify the failing component.

Remember that a new component can fail early, called infant mortality.

One last thing, if you can locate a cheap motherboard to test your CPU on, that would help. I wouldn't buy a motherboard unless I could use it in some project, or unless I really suspected the motherboard being at fault. just like I wouldn't buy another CPU just to test the motherboard, but you could buy a less expensive CPU to test out the system as well.

These are just options you might consider when you are pulling out your hair and have no other place to go.

Best of luck to you.

Important Announcement for the TrueNAS Community.

Disk errors + hang/reboot during scrub - no issues in safe mode

ckrevel

Cadet

jgreco

Resident Grinch

ckrevel

Cadet

joeschmuck

Old Man

ckrevel

Cadet

joeschmuck

Old Man

ckrevel

Cadet

jgreco

Resident Grinch

ckrevel

Cadet

joeschmuck

Old Man

ckrevel

Cadet

joeschmuck

Old Man

jgreco

Resident Grinch

ckrevel

Cadet

jgreco

Resident Grinch

ckrevel

Cadet

joeschmuck

Old Man

Similar threads

Important Announcement for the TrueNAS Community.

Disk errors + hang/reboot during scrub - no issues in safe mode

Cadet

Resident Grinch

Cadet

Old Man

Cadet

Old Man

Cadet

Resident Grinch

Cadet

Old Man

Cadet

Old Man

Resident Grinch

Cadet

Resident Grinch

Cadet

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Disk errors + hang/reboot during scrub - no issues in safe mode"

Similar threads