HELP! Not sure how to proceed. Scrub shows degraded and faulted drive(s)

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
I noticed that my storage pool was unhealthy. So I decided to do a scrub. During the scrub it shows various drives as degraded and one of the drives as faulted. I'm not sure what to do at this point..

1661901444570.png


Any help would be greatly appreciated!!
 

indivision

Guru
Joined
Jan 4, 2013
Messages
806

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
Yes, I do. Should I wait for the scrub to finish before replacing the faulted drive? I also did a massive write of about 4TB to the drives. Would this cause issues like this? This is the first time we've really used this system, and I'm concerned about the stability.
 

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
This is also not the first time I've seen behavior like this with this system. I had thought that it was initially issues with the SFF cables between the HBA card and the backplane for the HDDs. I'm not really sure how further to diagnose what's going on. I'm running this on a DL180 G6.
 

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
Do you have a replacement drive ready?

Based on another case: https://www.truenas.com/community/threads/disks-statuses-faulted-and-degraded.90334/

This seems pretty odd to get errors across that many drives all at once. Maybe a controller is going out? Did you make any HW changes recently? Disrupt cables, etc.?
No, nothing has changed in the system hardware. The server is secondhand, and the HBA is just off Amazon. The SFF cables were replaced with new cables.
 

indivision

Guru
Joined
Jan 4, 2013
Messages
806
I would wait for the scrub to finish to better understand entire situation.

Which HBA is it?
 

indivision

Guru
Joined
Jan 4, 2013
Messages
806
Also what drives?

You say this is the first time you used it? How long has it been on-line? If this happened right away, it suggests something wrong above the drives, imo.

(Side note: with that number of drives in the pool I would recommend Z2.)

Are you able to access the data where it is now and is it backed up or non-critical?
 

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
Also what drives?

You say this is the first time you used it? How long has it been on-line? If this happened right away, it suggests something wrong above the drives, imo.

(Side note: with that number of drives in the pool I would recommend Z2.)

Are you able to access the data where it is now and is it backed up or non-critical?
They are 2TB Hitachi drives. We're thinking about running Z2 but figured a single drive failure was not the end of the world and 2 drive failures happening was highly unlikely; alas here we are.

Data is backed up offsite, however, the 4TB of data that was moved was the offload of an additional server and get that server ready for an upgrade to TrueNAS as well. So the data is somewhat critical, but ultimately could be recovered.
 

indivision

Guru
Joined
Jan 4, 2013
Messages
806

That should be ok. I have the LSI 9211 and never had trouble with it. But, did you flash it for IT passthrough mode?

They are 2TB Hitachi drives. We're thinking about running Z2 but figured a single drive failure was not the end of the world and 2 drive failures happening was highly unlikely; alas here we are.

The problem is that one of the most likely times of a second drive failure happening is during resilver of another one. Plus, with that many drives, the cost of Z2 over Z1 isn't much.

Data is backed up offsite, however, the 4TB of data that was moved was the offload of an additional server and get that server ready for an upgrade to TrueNAS as well. So the data is somewhat critical, but ultimately could be recovered.

Ok. For now I would report back here with final results when the scrub is finished. I have a lot of practical experience. But, I think you may need much more technical know-how to figure this one out.

I've never seen a bunch of drives show problems all at once like that. So, it feels like something systemic rather than drive failures. I've had to replace a couple single drives from time to time and it has always just been one with the errors.

Could there be a heat issue? Checked the drive temps during the scrub?
 

indivision

Guru
Joined
Jan 4, 2013
Messages
806
The HBA you have covers 8 drives. But you are showing 11 in the pool.

Is there a pattern to which ones show degraded vs online? In other words, are the ones not on the HBA showing online?
 

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
That should be ok. I have the LSI 9211 and never had trouble with it. But, did you flash it for IT passthrough mode?



The problem is that one of the most likely times of a second drive failure happening is during resilver of another one. Plus, with that many drives, the cost of Z2 over Z1 isn't much.



Ok. For now I would report back here with final results when the scrub is finished. I have a lot of practical experience. But, I think you may need much more technical know-how to figure this one out.

I've never seen a bunch of drives show problems all at once like that. So, it feels like something systemic rather than drive failures. I've had to replace a couple single drives from time to time and it has always just been one with the errors.

Could there be a heat issue? Checked the drive temps during the scrub?
I did not flash either of them for IT passthrough mode. Could this be the cause of the issue? If so, how do I go about flashing them?

We'll look at going to Z2 after we get all of this figured out.

I'll wait for the report of the scrub and then post those results here. Thank you for all of your help!

I don't know that there could be a heat issue. We have the room relatively well monitored and ventilated and have not had any other heat issues. How do I check the temp during the scrub?
 

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
The HBA you have covers 8 drives. But you are showing 11 in the pool.

Is there a pattern to which ones show degraded vs online? In other words, are the ones not on the HBA showing online?
How can I see which drives are connected to the HBA vs which are not? I don't see any correlation to the drives having issues being on the HBA vs not...
 

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
That should be ok. I have the LSI 9211 and never had trouble with it. But, did you flash it for IT passthrough mode?
Comments that I saw on the purchase page indicated that it was already flashed for IT passthrough mode. Is there any way for me to tell this from the CLI?
 

indivision

Guru
Joined
Jan 4, 2013
Messages
806
I did not flash either of them for IT passthrough mode. Could this be the cause of the issue? If so, how do I go about flashing them?

From this thread it looks like your model should already be in that mode: https://www.truenas.com/community/threads/flashing-it-mode-9207.69510/

But, I would double-check that. Make sure its in that mode and using the latest firmware.

I'm not certain what the consequences are of not being in that mode. But, it's a possible cause, imo.

I'll wait for the report of the scrub and then post those results here. Thank you for all of your help!

Any time! I feel for your situation. Sorry I don't have a more concrete solution. There should be some others here with some ways to troubleshoot further.

I don't know that there could be a heat issue. We have the room relatively well monitored and ventilated and have not had any other heat issues. How do I check the temp during the scrub?

Never know. Could be fan issues or that server just needs a cooler room than other HW. Best to check the drives directly. You can do this in TrueNAS > Reporting > Disk > Disk Temperature. Select all disks.

How can I see which drives are connected to the HBA vs which are not? I don't see any correlation to the drives having issues being on the HBA vs not...

You would need some kind of label system to know. In my case I've put the last 4 digits from each serial number on the back of the drive. So, I can quickly find it when TrueNAS references it that way.

Otherwise, you'll have to take things apart to be able to see the serial numbers.
 

indivision

Guru
Joined
Jan 4, 2013
Messages
806

Demonlinx

Explorer
Joined
Apr 11, 2022
Messages
53
From this thread it looks like your model should already be in that mode: https://www.truenas.com/community/threads/flashing-it-mode-9207.69510/

But, I would double-check that. Make sure its in that mode and using the latest firmware.

I'm not certain what the consequences are of not being in that mode. But, it's a possible cause, imo.
I'll check this and see what else I can figure out.
Any time! I feel for your situation. Sorry I don't have a more concrete solution. There should be some others here with some ways to troubleshoot further.



Never know. Could be fan issues or that server just needs a cooler room than other HW. Best to check the drives directly. You can do this in TrueNAS > Reporting > Disk > Disk Temperature. Select all disks.

Mean temperature for the drives seems to be around 32 C so I don't think they are getting too hot...
You would need some kind of label system to know. In my case I've put the last 4 digits from each serial number on the back of the drive. So, I can quickly find it when TrueNAS references it that way.

Otherwise, you'll have to take things apart to be able to see the serial numbers.

This is something I will look into.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
sas2flash -list will get you the adapter info and show IT or IR mode:

1661910103212.png
 

Alex_K

Explorer
Joined
Sep 4, 2016
Messages
64
DL180 G6 has expander on the backplane
If its 12х3.5" it should be working OK with SAS2008 IT. I have 4 of these serevers working for years already solid.

Can use
sas2ircu 0 display
to show which drive reside in which bay (0 is controller number - you'll figure it out)
and
sas2ircu locate
to light a led on a drive caddy of your choice


also would be helpful if you posted entire system config, and output of
smartctl -a /dev/sd*
for every drive thats faulty
mpsutil show all
too
 
Last edited:
Top