Failing Controller? Random IO errors and system freezing

Rendermandan

Dabbler
Joined
Sep 17, 2022
Messages
20
I'm still trying to get my system up and running and I keep hitting errors and issues and could you some advice. I'm using Truenas Scale and I'm not very familiar with everything. but learning. So please go easy on me. :)

I'm getting random disk read io errors. on the boot pool, and the raidz2 storage. Its random though but usually locks the system up. A reboot seems to fix the issue for a few hours, but then issues comes back, usually not the same exact error though. I have been able to setup samba shares and start copying my movie files to the pool but that is when the errors start occurring. I've tried to research this as much as I can and try different solutions but I haven't' had much luck.
Sometimes when I reboot if I go to disks, it has write errors, but no Checksum errors. or sometimes I see checksum errors, and then after the reboot they disappear. Maybe I'm just not understanding something.

So far here is what I've tried.
Original set of raidZ2 disks were SMR, -They were old and some failed smart tests. at first I though those were the problem so, I changed those out.
Changed brand new Mirrored SSD Boot disks for a new SSD. Reinstalled Scale and pool setup, - Sill random Boot IO Errors -Went back to the mirrored SSD's.
Purchased an entire new set of new CMR 6TB Red Drives for the array. Ran Smart tests on each disk. Long, short and conveyance. All came back Successful.
Ran Smart tests on Boot SSD's. All came back Successful.
checked all cable and drive connections.
Reseated the controller card.
reseated memory
cleared and re flashed IT mode to my controller card.
deleted my pool and made a new pool with two Vdevs instead of one large one. both raid z2
Lastly I restored the Firmware to the controller and installed Ubuntu Server. Everything ran fine, but getting samba sharing on it is just too much work and is a HUGE pain in the ass, so I'd really like to use TrueNAS Scale.

The only thing I haven't tried is a new controller yet. Would that even help? I figured it couldn't hurt, but I really can't afford to keep throwing money at this!!!

I'm running Memtest86 as I type. So far, no errors but with 192GB of ram, its taking a long time, so I can't pull any error logs. sorry.

Here is my current system Hardware.

Dell R720XD With Dual processors.
192GB ECC ram (8) 8GB sticks and (8) 16gb sticks. running in optimized mode.
Mirrored Kingston SSD Boot Drives in the Back plane.
12 Brand new 6TB, CMR WD Red Plus hard drives.
H710P Mini monolithic controller in IT Mode.
Dual 750Watt PSU's
TrueNAS Scale 22.02.4

I really appreciate any help! Thanks.
 

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
You checked off the cables, the drives, pools... Do you get IO errors in Ubuntu too? If so, controller or mobo (even CPU? some PCIe lanes?) can be toast. If not, some driver issue with that controller or the mobo or some intel thing... It would be helpful to post photos of the errors.
 

Rendermandan

Dabbler
Joined
Sep 17, 2022
Messages
20
You checked off the cables, the drives, pools... Do you get IO errors in Ubuntu too? If so, controller or mobo (even CPU? some PCIe lanes?) can be toast. If not, some driver issue with that controller or the mobo or some intel thing... It would be helpful to post photos of the errors.
No errors in Ubuntu. and Unfortunately, I’m running memtest right now. 1 day In. once it’s done, I’ll post some error images. They scroll by pretty fast on the monitor. I just see what looks like a lot of IO errors and then the disks lights just stop flashing. then If I try to open The web interface, it’s just won’t connect.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
You troubleshoot a lot, which helps us.
I do have to point out that simply passing Smart tests doesn't mean the drive is ok. You have to check the values.

I would try a different PSU, but you have two so we can exclude that I think.
Are you sure you installed the right firmware on the HBA? They can be pretty picky.
 

Rendermandan

Dabbler
Joined
Sep 17, 2022
Messages
20
You troubleshoot a lot, which helps us.
I do have to point out that simply passing Smart tests doesn't mean the drive is ok. You have to check the values.

I would try a different PSU.
The server has redundant power supplies. According to I
695FFAB9-212E-4871-9566-C4CC494D5871.png
Dirac, they seem ok.
 

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
No errors in Ubuntu (with all the drives mounted, ofcourse) means likely not a HW issue. @Davvo 's idea about the PSU is a standard troubleshooting step in every weird problem, so you should do that too. My money's on the controller. I don't know how easy it is for you to find another one, though. Until you get back with some error photos (or a wizard here points out something we should have seen but didn't), we will sit tight.
 

Rendermandan

Dabbler
Joined
Sep 17, 2022
Messages
20
No errors in Ubuntu (with all the drives mounted, ofcourse) means likely not a HW issue. @Davvo 's idea about the PSU is a standard troubleshooting step in every weird problem, so you should do that too. My money's on the controller. I don't know how easy it is for you to find another one, though. Until you get back with some error photos (or a wizard here points out something we should have seen but didn't), we will sit tight.
maybe I should clarify Or point out. before I installed ubuntu I reverted the controller firmware back to factory, so it is running with hardware raid setup. So maybe an issue with the it mode firmware? Idk.
 

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
maybe I should clarify Or point out. before I installed ubuntu I reverted the controller firmware back to factory, so it is running with hardware raid setup. So maybe an issue with the it mode firmware? Idk.
Those enterprise controllers are known for incompatibility problems, because they are meant to be used only for hardware RAID. Get an HBA after confirming no other problems exist.
 

Rendermandan

Dabbler
Joined
Sep 17, 2022
Messages
20
Yeah I’m looking at one on eBay now. It’s an Lsi card style already flashed to it mode. It’s not the monolithic style that’s in there. I’ll have to get some different cables though.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
I'm sorry, my previous post was cut in half (misclick from phone). I immediately edited It, but you were so fast! Check it out if you missed it.
My bet is on the HBA firmware, but it could very well be that it just can't handle the load (which It looks likely after a brief search).

Edit: I want to praise @Rendermandan for the great troubleshooting work done before opening the thread, it shows extensive research.
 
Last edited:

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
I'm sorry, my previous post was cut in half (misclick from phone). I immediately edited It, but you were so fast! Check it out if you missed it.
My bet is on the HBA firmware, but it could very well be that it just can't handle the load (which It looks likely after a brief search).

Edit: I want to praise @Rendermandan for the great troubleshooting work done before opening the thread, it shows extensive research.
Its one of those awkward comedic moments. Happens to me all the time. Don't worry about it. As for the PSUs, peak load shown is about 350W, and those PSUs are always specd not to a theoretical peak wattage, but max sustained wattage. I don't think its likely, especially considering that a. there were no errors in ubuntu and b. that its unlikely to always have the same type of error (and that specific) when a PSU is failing.
 

Rendermandan

Dabbler
Joined
Sep 17, 2022
Messages
20
So the first pass of Memtest86 finished overnight. No errors reported. so that's a good sign.
I wanted to give a shout out to "The art of Server" on Ebay. That's who I purchased the card from. He has been gracious enough to help me diagnose my problem, as it could be something other than the controller, such as the cables, back plan/port expander etc. He provided a few links to some great trouble shooting videos too. They go into great detail reading logs and help to research where the problem is coming from.
The quest continues... and I'll keep you posted.

Thanks to everyone's help!
 

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
So the first pass of Memtest86 finished overnight. No errors reported. so that's a good sign.
I wanted to give a shout out to "The art of Server" on Ebay. That's who I purchased the card from. He has been gracious enough to help me diagnose my problem, as it could be something other than the controller, such as the cables, back plan/port expander etc. He provided a few links to some great trouble shooting videos too. They go into great detail reading logs and help to research where the problem is coming from.
The quest continues... and I'll keep you posted.

Thanks to everyone's help!
So this is proving to be a great learning experience. I can't wait to see what is in the logs.
 

Rendermandan

Dabbler
Joined
Sep 17, 2022
Messages
20
ugggg. So Im trying to creat a pool after reinstalling TruNAS scale and it won’t let me create the pool. Or wipe the drives. Error. Errno 16. Device busy.

found the solution. wipefs -af /dev/sd{a,b,c}

it let me create the pool, but as soon as I tried to create a data set. It rebooted the server,

so I know what I did wrong. I ran that command on all of my drives. Including the boot drives! Oops. Reinstalling now.
sometimes you have to learn the hard way I guess. Lol.
 
Last edited:

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
ugggg. So Im trying to creat a pool after reinstalling TruNAS scale and it won’t let me create the pool. Or wipe the drives. Error. Errno 16. Device busy.

found the solution. wipefs -af /dev/sd{a,b,c}

it let me create the pool, but as soon as I tried to create a data set. It rebooted the server,

so I know what I did wrong. I ran that command on all of my drives. Including the boot drives! Oops. Reinstalling now.
sometimes you have to learn the hard way I guess. Lol.
Are you sure that the controller is properly set in IT mode? Creating a single drive hardware RAID for every drive is not what we want here.
Early morning, didn't read the last part.
 

Rendermandan

Dabbler
Joined
Sep 17, 2022
Messages
20
Hey everyone, I wanted to give an update on this issue. The error report showed a bunch of PCIE buss errors. Basically it was detecting an error and reset the controller and since my boot drives are on the controller, the system would freeze. here is an example of the error.

Oct 3 19:07:18 truenas kernel: pcieport 0000:00:02.2: AER: Uncorrected (Fatal) error received: 0000:00:02.2
Oct 3 19:07:18 truenas kernel: mpt2sas_cm0: PCI error: detected callback, state(2)!!
Oct 3 19:07:19 truenas kernel: pcieport 0000:00:02.2: AER: Root Port link has been reset (0)
Oct 3 19:07:19 truenas kernel: mpt2sas_cm0: PCI error: slot reset callback!!
Oct 3 19:07:20 truenas kernel: mpt2sas_cm0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (198024780 kB)
Oct 3 19:07:20 truenas kernel: mpt2sas_cm0: sending diag reset !!
Oct 3 19:07:21 truenas kernel: mpt2sas_cm0: diag reset: SUCCESS

I want to send a big thanks to "The Art of Server" on Ebay. He helped me diagnose this log.

I received the new controller from him the other day and installed it in one of the pcie slots in the back. I had to replace the SAS cables as well. I pulled the H710P out of the motherboard just in case of any issues. I did a clean install of Truenas Scale, and I'm happy to report after two days of uptime and heavy file copying, it has been running flawless! My only complaint now is how loud the PSU fans are. Ughhh. :)

Thanks again for everyone's help!!!
 

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
Hey everyone, I wanted to give an update on this issue. The error report showed a bunch of PCIE buss errors. Basically it was detecting an error and reset the controller and since my boot drives are on the controller, the system would freeze. here is an example of the error.

Oct 3 19:07:18 truenas kernel: pcieport 0000:00:02.2: AER: Uncorrected (Fatal) error received: 0000:00:02.2
Oct 3 19:07:18 truenas kernel: mpt2sas_cm0: PCI error: detected callback, state(2)!!
Oct 3 19:07:19 truenas kernel: pcieport 0000:00:02.2: AER: Root Port link has been reset (0)
Oct 3 19:07:19 truenas kernel: mpt2sas_cm0: PCI error: slot reset callback!!
Oct 3 19:07:20 truenas kernel: mpt2sas_cm0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (198024780 kB)
Oct 3 19:07:20 truenas kernel: mpt2sas_cm0: sending diag reset !!
Oct 3 19:07:21 truenas kernel: mpt2sas_cm0: diag reset: SUCCESS

I want to send a big thanks to "The Art of Server" on Ebay. He helped me diagnose this log.

I received the new controller from him the other day and installed it in one of the pcie slots in the back. I had to replace the SAS cables as well. I pulled the H710P out of the motherboard just in case of any issues. I did a clean install of Truenas Scale, and I'm happy to report after two days of uptime and heavy file copying, it has been running flawless! My only complaint now is how loud the PSU fans are. Ughhh. :)

Thanks again for everyone's help!!!
If you are hackish enough you can try replacing the fans on the PSUs with industrial ball-bearing models, or even external (bigger) fans with 3D printed adapter nozzles... If you are a normal person (:smile:) however, you can always buy Dell PSUs that are known for better acoustics, given the range of wattages and models available... Should be no more than 30~40 bucks apiece? Anyway, may you have fun with your setup!
 

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
Also also, I suspect the noise is from the server fans are making the noise and not the PSU fans, unless you removed any fans inside the server, which for rack-mountable servers you should never do.
 
Top