i/o errors making truenas think a drive is faulted

Eeyore

Cadet
Joined
Dec 11, 2023
Messages
6
I've had this as a semi-recurring issue for a time, and am only posting now since I've tried about everything I've read from searching this forum, but could use a more thorough approach that I'll need help for. In particular, I'm pretty sure the drive is fine.

First, the system:
Version: TrueNAS core 13.0-U6
CPU: i7-7700
Motherboard: B250
RAM: 24GB DDR4
Drives: 10x 4TB Ironwolf (3 are ironwolf pro), 1 nvme boot drive and 1 sata SSD
HBA: SAS9207-8i
NIC: intel dual 10g (I dont remember the exact name, I used the recommended hardware list)

The problem: IO errors causing a drive to fault. At least this is what the TrueNas Alert tells me. checking the SMART data for the drive tells me the drive is working fine. And shuffling cables and a reset usually calm it down for a little while. But then a day or a week or a month later, it'll do it again.

Observations: This seems to happen almost exclusively to drives plugged into the HBA. It may be actually exclusively, as I cant remember it ever happening to a drive plugged into the mobo. But it may have a while back.

Things I've tried: Different cables, thought I havent tried *every* cable combination, so this may still be the solution.
Different HBA. I have another HBA, a SAS9220-8i.
Upgrading the PSU.
All ove the above solutions work for a time. But then the problem comes back.

I'd like some help to identify what the problem actually is. Maybe it is just the cables this whole time. Maybe it's the SAS card, maybe theres something with my combination of hardware (10 drives off this hardware may be more than it should be able to handle). But I'd like your help figuring out what the problem actually is, so I'm not flailing in the dark.

One more note. I'm not particularly good with console commands, so if you're asking me to provide you with info, best to assume I don't know the command and provide that as well.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Hi @Eeyore

What case are you using, and do you have a significant amount of airflow going across your LSI HBA? They do tend to get very warm under workload, and this would track with only the drives plugged into the HBA reporting faults.
 

Eeyore

Cadet
Joined
Dec 11, 2023
Messages
6
its a fractal define. One of the older ones without a basement. It could be getting too warm. I could try to strap a small noctua fan to its heatsink. But iirc, this problem still happened when I had a fan strapped to it in the past.
 

Eeyore

Cadet
Joined
Dec 11, 2023
Messages
6
Oh, also. I can to downgrade to 8 drives (i have my 10 drives in 2 pools of 5 drives, so I could shuffle things around and drop it down to 2 pools of 4 drives). Would it be better to have 3 drives from each pool off the mobo and 1 drive from each on the HBA? or 1 full pull on the mobo and 1 pool half on the hba?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I don't think that downsizing your pool to 4-wide will change anything - if all drives connected to the HBA start faulting, that's where I'd focus my troubleshooting efforts.

The Define line of cases is pretty large, but they're all very much a gaming/tower chassis that won't naturally have airflow over the PCIe slots. Applying some additional airflow to that area in particular would be my first step, especially if the 10GbE card is cozied up next to it.
 

Eeyore

Cadet
Joined
Dec 11, 2023
Messages
6
Thanks. I hope that's all. I increased the speed of the fans, swapped to the other HBA (it has a more secure heatsink) and put a 60mm fan on it. The airflow isn't ideal, but it should be better.

If something like this happens again, is there some way to find out from the HBA if it had any heat related issues?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Thanks. I hope that's all. I increased the speed of the fans, swapped to the other HBA (it has a more secure heatsink) and put a 60mm fan on it. The airflow isn't ideal, but it should be better.

If something like this happens again, is there some way to find out from the HBA if it had any heat related issues?
A bad thermal join between the chip and the heatsink itself could also do it, especially if it's an old card with possibly dried-out thermal interface material. Paste or pads can work here - I don't think an HBA demands a liquid metal treatment. :wink:

Unfortunately the older LSI cards don't have any manner of temperature sensor on them - to be honest, I'm not certain if their newer ones do either.

Here's hoping this sorts it out - if you apply a scrub or other intensive workload it will hopefully be a bit of a proving grounds for it.
 

Eeyore

Cadet
Joined
Dec 11, 2023
Messages
6
I actually had re-pasted the heatsink for that HBA before. But the mounting mechanism isnt very good, just 2 plastic pins on opposite corners and since the IC is so much smaller than the heatsink and the heatsink isnt stabilized, it could easily struggle to transfer heat to heatsink.

THe other HBA seems like it has a much more solid mounting mechanism. So hopefully it works out better.

I'll scrub the pool that's on the HBA, and see if it re-occurs. If it does, maybe I'll just pick up a third HBA...
 

Eeyore

Cadet
Joined
Dec 11, 2023
Messages
6
A bad thermal join between the chip and the heatsink itself could also do it, especially if it's an old card with possibly dried-out thermal interface material. Paste or pads can work here - I don't think an HBA demands a liquid metal treatment. :wink:

Unfortunately the older LSI cards don't have any manner of temperature sensor on them - to be honest, I'm not certain if their newer ones do either.

Here's hoping this sorts it out - if you apply a scrub or other intensive workload it will hopefully be a bit of a proving grounds for it.
It happened once more (some time later, obviously). And I think you're right, it's the HBA and heat issues. I've moved it to a different slot with more airflow.

But I noticed one thing. In addition to it only ever happening to drives on the HBA, it also seems to happen mostly (exclusively?) to the ironwolf pro drives on the HBA. So I moved the pro drives to the motherboards sata ports and have just regular ironwolf drives on the HBA.
 
Top