Large TrueNAS setup help

Canis · Jan 26, 2023

Good Morning!

My team and I are trying to troubleshoot a setup that we have. We have 2x 60 Drive Bay Units using 4x LSI 9305-16i HBA's in each. In the bays are 14TB Seagate SAS Drives that are a mix of X14 and X16. This is powered by a X11SPL motherboard with a Xeon Gold 6230R and 256 GB of RAM. Each unit has a Fiber card with 2 connections on it. One on each connected to the main network and the other port connects one to the other via a TwinAX cable. This serves as the link used for replication between the two.

We've had nothing but issues with this since getting it installed. Constant alerts saying the SMART status could not be read which I think is eventually turning around and kicking the drives from the pool and pulling in our spares. I am baby sitting these units and doing restarts at least twice a week.

Any suggestions appreciated.

Ericloewe · Jan 26, 2023

Canis said:
Constant alerts saying the SMART status could not be read

Well, that's an indication of a problem.

Canis said:
which I think is eventually turning around and kicking the drives from the pool and pulling in our spares

No, that does not happen because of SMART data in any way. What it says is that something is very wrong with either your disks or their communication channels to the host.

Your description is too vague to point at likely culprits, so let's take this a step at a time:

TrueNAS Scale or Core? And which specific version?
Is the problem restricted to only some disks or does it affect all disks?
What are these "60 Drive Bay Units"? Brand and model are rather relevant here.
Why are you using four rather expensive 16-lane HBAs instead of SAS expander backplanes?
What is the cabling situation on these things? Pictures would be nice to help us understand what is going on.
Is the firmware on all eight SAS controllers up-to-date (P16.00.12.00 or P16.00.10.00)?
Please post the output of smartctl -x /dev/daX for some of the offending disks.

Canis · Jan 26, 2023

1.) TrueNAS Core 13 U3.1
2.) Seems random? In the log it references da29/27/23/17/15/18 currently.
3.) These were units built by 45 Drives. Storinator XL's.
4.) Came preconfigured this way.
5.) Our corporate environment prevents us from taking pictures unfortunately. All 4 ports on the HBA Card have cables plugged into them which go to the backplane in the bottom of the unit. Not sure how better to describe it.
6.) Firmware has been updated. Recently even. Its on P16.00.12.00.
7.) For the same reason as #5 I can't do a copy paste but I can respond the best I can with specific information. The most troubling to me look at this is that the bottom section is filled with "Background Short Aborted (device reset ?)"

I have run a smartctl -a /dev/da?? on drives that have shown up in the log immediately after and it doesn't get a response back. It gives me the information up top in the informational section like the vendor and serial but then says the SMART request failed to get a response. 10 - 20 seconds later I get normal responses from the drive. At least the full list that I expect.

Davvo · Jan 26, 2023

Can you post (or do) a zpool status? Are all drives showing up as they should?
~~How did you acquire your HBAs?~~ I guess you bought the entire system (turbo option) from 45drives, that rules out fake HBAs and theoretically hardware compatibility issues.
How are the temps?

Canis · Jan 26, 2023

Yes all the drives are there in zpool status. I have 4 vdevs and 4 spares (14x each 4xhot spares). Using ZFS2.

One spare is currently in use in one on of the units. During a test pull from the production unit that everyone access our recently rebuilt Backup unit had an immediate fault for reasons unknown on one drive.

Johnny Fartpants · Jan 26, 2023

Canis said:
2.) Seems random? In the log it references da29/27/23/17/15/18 currently.

Might be worth checking the physical location of these drives. Just because the da numbers are spaced out doesn't mean the drives are not next to one another which could indicate a cable issue.

Are you having these issues across both systems?

Canis · Jan 26, 2023

Johnny Fartpants said:
Might be worth checking the physical location of these drives. Just because the da numbers are spaced out doesn't mean the drives are not next to one another which could indicate a cable issue.

Are you having these issues across both systems?

Surprisingly they are in the exact location I expect physically. And yes this issues is cropping up on both systems.

Canis · Jan 26, 2023

Daavo, temp wise the drives are fine. What I'm not sure about is the temps from the HBA's.

Johnny Fartpants · Jan 26, 2023

Canis said:
Surprisingly they are in the exact location I expect physically. And yes this issues is cropping up on both systems.

So are they next to each other in the chassis?

Ericloewe · Jan 26, 2023

Canis said:
3.) These were units built by 45 Drives. Storinator XL's.

We don't see many of those, but I have to say they are not at all convincing. Their shtick is that they can supposedly cram in more drives for less money... But a Supermicro SuperServer 640SP-E1CR60 is cheaper, holds as many disks, has slightly better hardware and actually has a reputation for working.

Anyway, overall this sounds a lot like the SAS links are crapping out.

Canis said:
7.) For the same reason as #5 I can't do a copy paste but I can respond the best I can with specific information. The most troubling to me look at this is that the bottom section is filled with "Background Short Aborted (device reset ?)"

smartctl -x includes a bunch of low-level logs, at least for SATA drives. I'm not sure what the specific log looks like on a SAS model, but there are a few things to look out for:

PHY resets, errors, or the like -> Indicative of a bad SAS link
Link CRC errors -> Indicative of a marginal SAS link that's throwing errors without outright failing (yet)
Abnormally high power cycle counts -> Indicative that the drive is losing power

Canis said:
Daavo, temp wise the drives are fine. What I'm not sure about is the temps from the HBA's.

There's a way of checking that on LSI SAS3 cards, but I'm having a hard time finding it.

Canis · Jan 26, 2023

Ericloewe said:
We don't see many of those, but I have to say they are not at all convincing. Their shtick is that they can supposedly cram in more drives for less money... But a Supermicro SuperServer 640SP-E1CR60 is cheaper, holds as many disks, has slightly better hardware and actually has a reputation for working.

Anyway, overall this sounds a lot like the SAS links are crapping out.

smartctl -x includes a bunch of low-level logs, at least for SATA drives. I'm not sure what the specific log looks like on a SAS model, but there are a few things to look out for:

PHY resets, errors, or the like -> Indicative of a bad SAS link

Link CRC errors -> Indicative of a marginal SAS link that's throwing errors without outright failing (yet)

Abnormally high power cycle counts -> Indicative that the drive is losing power

There's a way of checking that on LSI SAS3 cards, but I'm having a hard time finding it.

If you can find that it would be appreciated. I am going to keep looking. I am starting to suspect the HBA cards are overheating. They are in the unit itself in the back and are all next to each other in order. I've thought about it some more and I've realized that the slots reporting failures are never beyond the first two sections of the NAS. None of the last two rows have failed in either unit.

I also ran that command on one of the problematic drives and most of the results are zero but it does say; "Loss of DWORD Synchronization = 1". I do not see a line for power lost.

Ericloewe · Jan 26, 2023

Found it:

mprutil -u 0 show cfgpage page 7 | awk '{ if ($1 == "0010") { printf "%d ", "0x" $4 $5; if ($3 == "00") printf "N/A"; else if ($3 == "01") printf "F"; else if ($3 == "02") printf "C"; print "" } }'

Canis · Jan 26, 2023

Ericloewe said:
Found it:
mprutil -u 0 show cfgpage page 7 | awk '{ if ($1 == "0010") { printf "%d ", "0x" $4 $5; if ($3 == "00") printf "N/A"; else if ($3 == "01") printf "F"; else if ($3 == "02") printf "C"; print "" } }'

This gives me one read out of 69 C. Is there something I can modify to get the read out on the other 3 cards?

Thanks!

Ericloewe · Jan 26, 2023

Yeah, they should be -u 1 / -u 2 / -u 3.

Canis · Jan 27, 2023

Ericloewe said:
Yeah, they should be -u 1 / -u 2 / -u 3.

Thanks for your help Ericloewe. With the 4 cards 3 are reporting high 60's. And the one furthest out is low 50's. This is identical in each unit and these are idle temps.

Ericloewe · Jan 27, 2023

That's not terrible, but there's potential there for it to get substantially worse under load. You can try ramping up the fans and seeing if the situation improves.

With sas3ircu, you should also be able to view the topology, specifically which disks are connected to which controller. Maybe there's a pattern there?

Canis · Jan 27, 2023

Ericloewe said:
That's not terrible, but there's potential there for it to get substantially worse under load. You can try ramping up the fans and seeing if the situation improves.

With sas3ircu, you should also be able to view the topology, specifically which disks are connected to which controller. Maybe there's a pattern there?

Fortunately the topology and layout are identical. Controller 0 is row 1. Controller 1 is row 2. Etc. And yes there is a pattern. Row 1 and 2 are the culprits in both units for 'Failing' drives. TBD if its a temp issue at this point. I have another fan coming that will be mounted to have airflow directly on the controllers.

Ericloewe · Jan 27, 2023

Do the cables have any weird bend for those rows? Cable length shouldn't be a big deal because the disks are SAS, but reflections or breaks in the cables can't be fixed with higher transmit power.

Canis · Jan 27, 2023

Ericloewe said:
Do the cables have any weird bend for those rows? Cable length shouldn't be a big deal because the disks are SAS, but reflections or breaks in the cables can't be fixed with higher transmit power.

Its a tight fit all close together but no super hard bends or anything of that nature. 16 connectors to the 4 cards. I may check and see if I can space them a bit.

Canis · Feb 1, 2023

Update!

We've installed a fan that is mounted over the 4 HBA cards. This fan has brought the temps down from high 60's to low 50's while idle. One is even in the mid 40's. In a 'Wait and see' mode to know if this has fixed the issue.

Important Announcement for the TrueNAS Community.

Large TrueNAS setup help

Dabbler

Server Wrangler

Dabbler

MVP

Dabbler

Guru

Dabbler

Dabbler

Guru

Server Wrangler

Dabbler

Server Wrangler

Dabbler

Server Wrangler

Dabbler

Server Wrangler

Dabbler

Server Wrangler

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Large TrueNAS setup help"

Similar threads