HELP! Not sure how to proceed. Scrub shows degraded and faulted drive(s)

Demonlinx · Aug 30, 2022

I noticed that my storage pool was unhealthy. So I decided to do a scrub. During the scrub it shows various drives as degraded and one of the drives as faulted. I'm not sure what to do at this point..

Any help would be greatly appreciated!!

indivision · Aug 30, 2022

Do you have a replacement drive ready?

Based on another case: https://www.truenas.com/community/threads/disks-statuses-faulted-and-degraded.90334/

This seems pretty odd to get errors across that many drives all at once. Maybe a controller is going out? Did you make any HW changes recently? Disrupt cables, etc.?

Demonlinx · Aug 30, 2022

Yes, I do. Should I wait for the scrub to finish before replacing the faulted drive? I also did a massive write of about 4TB to the drives. Would this cause issues like this? This is the first time we've really used this system, and I'm concerned about the stability.

Demonlinx · Aug 30, 2022

This is also not the first time I've seen behavior like this with this system. I had thought that it was initially issues with the SFF cables between the HBA card and the backplane for the HDDs. I'm not really sure how further to diagnose what's going on. I'm running this on a DL180 G6.

Demonlinx · Aug 30, 2022

indivision said:
Do you have a replacement drive ready?

Based on another case: https://www.truenas.com/community/threads/disks-statuses-faulted-and-degraded.90334/

This seems pretty odd to get errors across that many drives all at once. Maybe a controller is going out? Did you make any HW changes recently? Disrupt cables, etc.?

No, nothing has changed in the system hardware. The server is secondhand, and the HBA is just off Amazon. The SFF cables were replaced with new cables.

indivision · Aug 30, 2022

I would wait for the scrub to finish to better understand entire situation.

Which HBA is it?

indivision · Aug 30, 2022

Also what drives?

You say this is the first time you used it? How long has it been on-line? If this happened right away, it suggests something wrong above the drives, imo.

(Side note: with that number of drives in the pool I would recommend Z2.)

Are you able to access the data where it is now and is it backed up or non-critical?

Demonlinx · Aug 30, 2022

indivision said:
I would wait for the scrub to finish to better understand entire situation.

Which HBA is it?

Amazon.com

Demonlinx · Aug 30, 2022

indivision said:
Also what drives?

You say this is the first time you used it? How long has it been on-line? If this happened right away, it suggests something wrong above the drives, imo.

(Side note: with that number of drives in the pool I would recommend Z2.)

Are you able to access the data where it is now and is it backed up or non-critical?

They are 2TB Hitachi drives. We're thinking about running Z2 but figured a single drive failure was not the end of the world and 2 drive failures happening was highly unlikely; alas here we are.

Data is backed up offsite, however, the 4TB of data that was moved was the offload of an additional server and get that server ready for an upgrade to TrueNAS as well. So the data is somewhat critical, but ultimately could be recovered.

indivision · Aug 30, 2022

Demonlinx said:
Amazon.com

That should be ok. I have the LSI 9211 and never had trouble with it. But, did you flash it for IT passthrough mode?

Demonlinx said:
They are 2TB Hitachi drives. We're thinking about running Z2 but figured a single drive failure was not the end of the world and 2 drive failures happening was highly unlikely; alas here we are.

The problem is that one of the most likely times of a second drive failure happening is during resilver of another one. Plus, with that many drives, the cost of Z2 over Z1 isn't much.

Demonlinx said:
Data is backed up offsite, however, the 4TB of data that was moved was the offload of an additional server and get that server ready for an upgrade to TrueNAS as well. So the data is somewhat critical, but ultimately could be recovered.

Ok. For now I would report back here with final results when the scrub is finished. I have a lot of practical experience. But, I think you may need much more technical know-how to figure this one out.

I've never seen a bunch of drives show problems all at once like that. So, it feels like something systemic rather than drive failures. I've had to replace a couple single drives from time to time and it has always just been one with the errors.

Could there be a heat issue? Checked the drive temps during the scrub?

indivision · Aug 30, 2022

The HBA you have covers 8 drives. But you are showing 11 in the pool.

Is there a pattern to which ones show degraded vs online? In other words, are the ones not on the HBA showing online?

Demonlinx · Aug 30, 2022

indivision said:
That should be ok. I have the LSI 9211 and never had trouble with it. But, did you flash it for IT passthrough mode?

The problem is that one of the most likely times of a second drive failure happening is during resilver of another one. Plus, with that many drives, the cost of Z2 over Z1 isn't much.

Ok. For now I would report back here with final results when the scrub is finished. I have a lot of practical experience. But, I think you may need much more technical know-how to figure this one out.

I've never seen a bunch of drives show problems all at once like that. So, it feels like something systemic rather than drive failures. I've had to replace a couple single drives from time to time and it has always just been one with the errors.

Could there be a heat issue? Checked the drive temps during the scrub?

I did not flash either of them for IT passthrough mode. Could this be the cause of the issue? If so, how do I go about flashing them?

We'll look at going to Z2 after we get all of this figured out.

I'll wait for the report of the scrub and then post those results here. Thank you for all of your help!

I don't know that there could be a heat issue. We have the room relatively well monitored and ventilated and have not had any other heat issues. How do I check the temp during the scrub?

Demonlinx · Aug 30, 2022

indivision said:
The HBA you have covers 8 drives. But you are showing 11 in the pool.

Is there a pattern to which ones show degraded vs online? In other words, are the ones not on the HBA showing online?

How can I see which drives are connected to the HBA vs which are not? I don't see any correlation to the drives having issues being on the HBA vs not...

Demonlinx · Aug 30, 2022

indivision said:
That should be ok. I have the LSI 9211 and never had trouble with it. But, did you flash it for IT passthrough mode?

Comments that I saw on the purchase page indicated that it was already flashed for IT passthrough mode. Is there any way for me to tell this from the CLI?

indivision · Aug 30, 2022

Demonlinx said:
I did not flash either of them for IT passthrough mode. Could this be the cause of the issue? If so, how do I go about flashing them?

From this thread it looks like your model should already be in that mode: https://www.truenas.com/community/threads/flashing-it-mode-9207.69510/

But, I would double-check that. Make sure its in that mode and using the latest firmware.

I'm not certain what the consequences are of not being in that mode. But, it's a possible cause, imo.

Demonlinx said:
I'll wait for the report of the scrub and then post those results here. Thank you for all of your help!

Any time! I feel for your situation. Sorry I don't have a more concrete solution. There should be some others here with some ways to troubleshoot further.

Demonlinx said:
I don't know that there could be a heat issue. We have the room relatively well monitored and ventilated and have not had any other heat issues. How do I check the temp during the scrub?

Never know. Could be fan issues or that server just needs a cooler room than other HW. Best to check the drives directly. You can do this in TrueNAS > Reporting > Disk > Disk Temperature. Select all disks.

Demonlinx said:
How can I see which drives are connected to the HBA vs which are not? I don't see any correlation to the drives having issues being on the HBA vs not...

You would need some kind of label system to know. In my case I've put the last 4 digits from each serial number on the back of the drive. So, I can quickly find it when TrueNAS references it that way.

Otherwise, you'll have to take things apart to be able to see the serial numbers.

indivision · Aug 30, 2022

Demonlinx said:
Comments that I saw on the purchase page indicated that it was already flashed for IT passthrough mode. Is there any way for me to tell this from the CLI?

There could be a way via CLI. But, the way I could find was using a tool on a boot disk: https://kbhost.nl/knowledgebase/flash-lsi-sas-9207-8i-hba-to-it-mode/

Demonlinx · Aug 30, 2022

indivision said:
From this thread it looks like your model should already be in that mode: https://www.truenas.com/community/threads/flashing-it-mode-9207.69510/

But, I would double-check that. Make sure its in that mode and using the latest firmware.

I'm not certain what the consequences are of not being in that mode. But, it's a possible cause, imo.

I'll check this and see what else I can figure out.

indivision said:
Any time! I feel for your situation. Sorry I don't have a more concrete solution. There should be some others here with some ways to troubleshoot further.

Never know. Could be fan issues or that server just needs a cooler room than other HW. Best to check the drives directly. You can do this in TrueNAS > Reporting > Disk > Disk Temperature. Select all disks.

Mean temperature for the drives seems to be around 32 C so I don't think they are getting too hot...

indivision said:
You would need some kind of label system to know. In my case I've put the last 4 digits from each serial number on the back of the drive. So, I can quickly find it when TrueNAS references it that way.

Otherwise, you'll have to take things apart to be able to see the serial numbers.

This is something I will look into.

Demonlinx · Aug 30, 2022

indivision said:
There could be a way via CLI. But, the way I could find was using a tool on a boot disk: https://kbhost.nl/knowledgebase/flash-lsi-sas-9207-8i-hba-to-it-mode/

I might try this tomorrow if it comes down to it. I just hope this is more soluble than it currently seems.

Redcoat · Aug 30, 2022

sas2flash -list will get you the adapter info and show IT or IR mode:

Alex_K · Aug 30, 2022

DL180 G6 has expander on the backplane
If its 12х3.5" it should be working OK with SAS2008 IT. I have 4 of these serevers working for years already solid.

Can use
sas2ircu 0 display
to show which drive reside in which bay (0 is controller number - you'll figure it out)
and
sas2ircu locate
to light a led on a drive caddy of your choice

also would be helpful if you posted entire system config, and output of
smartctl -a /dev/sd*
for every drive thats faulty
mpsutil show all
too

Important Announcement for the TrueNAS Community.

HELP! Not sure how to proceed. Scrub shows degraded and faulted drive(s)

Explorer

Guru

Explorer

Explorer

Explorer

Guru

Guru

Explorer

Explorer

Guru

Guru

Explorer

Explorer

Explorer

Guru

Guru

Explorer

Explorer

MVP

Explorer

Similar threads