I noticed that my storage pool was unhealthy. So I decided to do a scrub. During the scrub it shows various drives as degraded and one of the drives as faulted. I'm not sure what to do at this point..
This seems pretty odd to get errors across that many drives all at once. Maybe a controller is going out? Did you make any HW changes recently? Disrupt cables, etc.?
Yes, I do. Should I wait for the scrub to finish before replacing the faulted drive? I also did a massive write of about 4TB to the drives. Would this cause issues like this? This is the first time we've really used this system, and I'm concerned about the stability.
This is also not the first time I've seen behavior like this with this system. I had thought that it was initially issues with the SFF cables between the HBA card and the backplane for the HDDs. I'm not really sure how further to diagnose what's going on. I'm running this on a DL180 G6.
This seems pretty odd to get errors across that many drives all at once. Maybe a controller is going out? Did you make any HW changes recently? Disrupt cables, etc.?
No, nothing has changed in the system hardware. The server is secondhand, and the HBA is just off Amazon. The SFF cables were replaced with new cables.
You say this is the first time you used it? How long has it been on-line? If this happened right away, it suggests something wrong above the drives, imo.
(Side note: with that number of drives in the pool I would recommend Z2.)
Are you able to access the data where it is now and is it backed up or non-critical?
You say this is the first time you used it? How long has it been on-line? If this happened right away, it suggests something wrong above the drives, imo.
(Side note: with that number of drives in the pool I would recommend Z2.)
Are you able to access the data where it is now and is it backed up or non-critical?
They are 2TB Hitachi drives. We're thinking about running Z2 but figured a single drive failure was not the end of the world and 2 drive failures happening was highly unlikely; alas here we are.
Data is backed up offsite, however, the 4TB of data that was moved was the offload of an additional server and get that server ready for an upgrade to TrueNAS as well. So the data is somewhat critical, but ultimately could be recovered.
They are 2TB Hitachi drives. We're thinking about running Z2 but figured a single drive failure was not the end of the world and 2 drive failures happening was highly unlikely; alas here we are.
The problem is that one of the most likely times of a second drive failure happening is during resilver of another one. Plus, with that many drives, the cost of Z2 over Z1 isn't much.
Data is backed up offsite, however, the 4TB of data that was moved was the offload of an additional server and get that server ready for an upgrade to TrueNAS as well. So the data is somewhat critical, but ultimately could be recovered.
Ok. For now I would report back here with final results when the scrub is finished. I have a lot of practical experience. But, I think you may need much more technical know-how to figure this one out.
I've never seen a bunch of drives show problems all at once like that. So, it feels like something systemic rather than drive failures. I've had to replace a couple single drives from time to time and it has always just been one with the errors.
Could there be a heat issue? Checked the drive temps during the scrub?
That should be ok. I have the LSI 9211 and never had trouble with it. But, did you flash it for IT passthrough mode?
The problem is that one of the most likely times of a second drive failure happening is during resilver of another one. Plus, with that many drives, the cost of Z2 over Z1 isn't much.
Ok. For now I would report back here with final results when the scrub is finished. I have a lot of practical experience. But, I think you may need much more technical know-how to figure this one out.
I've never seen a bunch of drives show problems all at once like that. So, it feels like something systemic rather than drive failures. I've had to replace a couple single drives from time to time and it has always just been one with the errors.
Could there be a heat issue? Checked the drive temps during the scrub?
I did not flash either of them for IT passthrough mode. Could this be the cause of the issue? If so, how do I go about flashing them?
We'll look at going to Z2 after we get all of this figured out.
I'll wait for the report of the scrub and then post those results here. Thank you for all of your help!
I don't know that there could be a heat issue. We have the room relatively well monitored and ventilated and have not had any other heat issues. How do I check the temp during the scrub?
How can I see which drives are connected to the HBA vs which are not? I don't see any correlation to the drives having issues being on the HBA vs not...
Comments that I saw on the purchase page indicated that it was already flashed for IT passthrough mode. Is there any way for me to tell this from the CLI?
Any time! I feel for your situation. Sorry I don't have a more concrete solution. There should be some others here with some ways to troubleshoot further.
I don't know that there could be a heat issue. We have the room relatively well monitored and ventilated and have not had any other heat issues. How do I check the temp during the scrub?
Never know. Could be fan issues or that server just needs a cooler room than other HW. Best to check the drives directly. You can do this in TrueNAS > Reporting > Disk > Disk Temperature. Select all disks.
How can I see which drives are connected to the HBA vs which are not? I don't see any correlation to the drives having issues being on the HBA vs not...
You would need some kind of label system to know. In my case I've put the last 4 digits from each serial number on the back of the drive. So, I can quickly find it when TrueNAS references it that way.
Otherwise, you'll have to take things apart to be able to see the serial numbers.
Comments that I saw on the purchase page indicated that it was already flashed for IT passthrough mode. Is there any way for me to tell this from the CLI?
Any time! I feel for your situation. Sorry I don't have a more concrete solution. There should be some others here with some ways to troubleshoot further.
Never know. Could be fan issues or that server just needs a cooler room than other HW. Best to check the drives directly. You can do this in TrueNAS > Reporting > Disk > Disk Temperature. Select all disks.
You would need some kind of label system to know. In my case I've put the last 4 digits from each serial number on the back of the drive. So, I can quickly find it when TrueNAS references it that way.
Otherwise, you'll have to take things apart to be able to see the serial numbers.
DL180 G6 has expander on the backplane
If its 12х3.5" it should be working OK with SAS2008 IT. I have 4 of these serevers working for years already solid.
Can use
sas2ircu 0 display
to show which drive reside in which bay (0 is controller number - you'll figure it out)
and
sas2ircu locate
to light a led on a drive caddy of your choice
also would be helpful if you posted entire system config, and output of
smartctl -a /dev/sd*
for every drive thats faulty
mpsutil show all
too
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.