3 bad drive all from the same vdev? Please Help!

MeatTreats · Jan 10, 2022

So I have a 48 drive server with 3 vdevs of 16x6TB 16x6TB and 16x10TB. I got these errors that indicated one of the 10TB drives failed so I replaced and rebuilt. Last night I decided to bootup my server and start a scrub since the last time I ran it, I was transferring files when the power momentarily went out. This morning while the scrub was still in progress, I wake up to see that another 10TB drive has failed.

And just now I see a bunch of text on the monitor that scrolled by faster than I can take a picture of. BTW, I don't know how to scroll up or if all this info is available in the web GUI. So because I have never had two drives fail within a single day in the same array, I canceled the scrub so as to not put any more use on these drive until I can figure out WTF is going on.

This is what appeared after the second drive failed. What's with the encryption stuff? Is my array infected with a virus?

All 3 drives are 10TB WD drives shucked from Easystores and all purchased at the same time. All 3 drives were in the same vdev. All of my 6TB drives seem to have no issues.

So my quest is are these legitimately failed drives or could something else be going? Is there any way to re-insert them into the array and run a SMART test on them?

sretalla · Jan 11, 2022

MeatTreats said:
What's with the encryption stuff? Is my array infected with a virus?

That's not a virus... it refers to the swap partition that was apparently live on one or more of those bad disks(*), so it switched to another one. Swap is always encrypted, so that's the point you're seeing.

MeatTreats said:
So my quest is are these legitimately failed drives or could something else be going?

* these drives may not be bad, but you'll need to do some checking to understand if that's the case...

It could be the cabling or backplane/ports which are going bad.

MeatTreats said:
s there any way to re-insert them into the array and run a SMART test on them?

I would reverse that... SMART testing first.

you may be able to identify the disks with glabel status to give you the disk identifiers from the gptids.

you could then run smartctl -a /dev/da33 (as an example... I'm guessing the first one might be da33).

Ericloewe · Jan 11, 2022

Ouch, 16-wide vdev? That's not good for performance... Or reliability... Or anything, really.
That said, with RAIDZ3 you can tolerate three failures, so even your scenario is survivable.

As for the possibly bad drives, definitely try attaching them some other way (SATA, USB/SATA bridge, ...) to have a look at the SMART data and ideally run some tests.

NugentS · Jan 11, 2022

A hardware list would help, so we know what we are delaing with

blanchet · Jan 11, 2022

Check also the firmware of your HBA, especially if you use a LSI HBA.

Last year I had random errors on my fresh new JBOD, until I realized that
my just-buy new LSI 9300-8e had a 5-years old firmware.

After upgrading to the last version, all the errors have disappeared.

LSI 9300-xx Firmware Update

Hey Community, If you are using an LSI 9300 HBA with FreeNAS or the soon-to-be TrueNAS CORE, you may experience some performance issues causing the controller to reset when using SATA HDDs. After working with Broadcom, we’ve come up with a...

www.truenas.com

MeatTreats · Jan 11, 2022

sretalla said:
* these drives may not be bad, but you'll need to do some checking to understand if that's the case...

It could be the cabling or backplane/ports which are going bad.

Any idea how I go about testing the backplanes or ports if the drive SMART stats comeback OK?

sretalla said:
I would reverse that... SMART testing first.

you may be able to identify the disks with glabel status to give you the disk identifiers from the gptids.

you could then run smartctl -a /dev/da33 (as an example... I'm guessing the first one might be da33).

I have of course since shut down my server. When I bootup my server, would those 2 drives be back in the array or will they still have a "REMOVED" status? Should I even start it up again?

Ericloewe said:
That said, with RAIDZ3 you can tolerate three failures, so even your scenario is survivable.

As for the possibly bad drives, definitely try attaching them some other way (SATA, USB/SATA bridge, ...) to have a look at the SMART data and ideally run some tests.

I've never had more than one drive failure at any one time and I'm really scared. I don't have a SATA dock or anything. Should I just bootup the server and run the SMART tests in FreeNAS or can I/Should I put them into a Windows PC and run the test? How would I even run a SMART test in Windows? Same CMD commands as in FreeNAS?

NugentS said:
A hardware list would help, so we know what we are delaing with

blanchet said:
Check also the firmware of your HBA, especially if you use a LSI HBA.

Last year I had random errors on my fresh new JBOD, until I realized that
my just-buy new LSI 9300-8e had a 5-years old firmware.

Oof.... I'm assuming there are ways/commands in FreeNAS to be able to do that? I wouldn't even know where to begin on how to check or update hardware firmware.

sretalla · Jan 11, 2022

MeatTreats said:
Any idea how I go about testing the backplanes or ports if the drive SMART stats comeback OK?

Depends on your options... with free ports available, move disks to free ports and see results, without free ports, it's more risky as you would need to move known good disks out of the pool to swap with the not working ones... as long as nothing is accessing the pool, that should be OK and the pool will work itself out when you put disks back.

Ideally, you would have an entirely new server to test it in. I guess not though.

blanchet · Jan 11, 2022

To check the firmware version:

For a LSI 93xx you have to use

Code:

sas3flash -listall

For a LSI 92xx like you have

Code:

sas2flash -listall

MeatTreats · Jan 11, 2022

sretalla said:
Depends on your options... with free ports available, move disks to free ports and see results, without free ports, it's more risky as you would need to move known good disks out of the pool to swap with the not working ones... as long as nothing is accessing the pool, that should be OK and the pool will work itself out when you put disks back.

Ideally, you would have an entirely new server to test it in. I guess not though.

There are no free ports and I am not comfortable moving known good drives to what could be a bad port. I do also have other complete but empty servers I could test on but they are a different brand with a different hardware config and different backplanes.

NugentS · Jan 11, 2022

Do you have a backup?
Do you Know which physical disks are in which vdev
I would remove the ethernet cables to prevent anyone trying to work on the box and then start up and
can you do a zpool status "Pool Name" and post that in code tags please?

MeatTreats · Jan 11, 2022

NugentS said:
Do you have a backup?
Do you Know which physical disks are in which vdev
I would remove the ethernet cables to prevent anyone trying to work on the box and then start up and
can you do a zpool status "Pool Name" and post that in code tags please?

No Backups

Yes I know.

Not an issue.

Ok but is it safe to boot it up? Will those "removed" drives no longer be in that status?

NugentS · Jan 11, 2022

Its as safe as it can be. If the pool isn't there it will tell you and frankly you are going to have to start up at some point anyway.

With ZFS you can move disks around. the dev/??? will change but the GPTID won't and ZFS works from the GPTID so you can shuffle disks and (assuming no physical issues) the vdevs and pool will be fine. Of course you have an issue. You need to work out if the fault is with the disks or with the hardware (backplane, power etc). Frankly the best way to do this is pull the suspect disks and swap them with a known good disk (one at a time, not all at the same time) and see if the faults stay or remain. If not that then pull one disk and put it into another freebsd/linux machine and run smarttests on it (I am assuming that removed = I can no longer see these disks, so you can't run smart locally). Don't use windows which has a nasty habit of deciding you really wanted to format that disk it doesn't understand

What sort of data is on this pool? Can it be recreated? Can it be backed up (might take a while)?
The hardware is old but corporate - is this home or business?

MeatTreats · Jan 11, 2022

NugentS said:
Its as safe as it can be. If the pool isn't there it will tell you and frankly you are going to have to start up at some point anyway.

With ZFS you can move disks around. the dev/??? will change but the GPTID won't and ZFS works from the GPTID so you can shuffle disks and (assuming no physical issues) the vdevs and pool will be fine. Of course you have an issue. You need to work out if the fault is with the disks or with the hardware (backplane, power etc). Frankly the best way to do this is pull the suspect disks and swap them with a known good disk (one at a time, not all at the same time) and see if the faults stay or remain. If not that then pull one disk and put it into another freebsd/linux machine and run smarttests on it (I am assuming that removed = I can no longer see these disks, so you can't run smart locally). Don't use windows which has a nasty habit of deciding you really wanted to format that disk it doesn't understand

What sort of data is on this pool? Can it be recreated? Can it be backed up (might take a while)?
The hardware is old but corporate - is this home or business?

I didn't know disks can be moved around and still work. Can anyone else here confirm this???

If needed, I can install FreeNAS on an empty sever and run tests. That data is a bit of everything and I suppose with enough time, it can be replaced. It's about 80TB and I would need to populate a new server to back it up. This is a home machine for personal stuff.

Ericloewe · Jan 11, 2022

MeatTreats said:
I didn't know disks can be moved around and still work. Can anyone else here confirm this???

Yes. In fact, it's true of any vaguely-usable RAID solution. From Intel fakeRAID to the most vendor-locked RAID controller. ZFS, being better than any of those at all things management, is no exception.

MeatTreats · Jan 12, 2022

Ericloewe said:
Yes. In fact, it's true of any vaguely-usable RAID solution. From Intel fakeRAID to the most vendor-locked RAID controller. ZFS, being better than any of those at all things management, is no exception.

In your opinion, what is the safest way forward? Should I take all three "failed" drives and put then in an empty FreeNAS machine and test them there first to determine next steps? Putting them into another server isn't going to modify the ZFS pool data to where they will be rejected by the array they belong to right?

@NugentS mentions that drives are identified by the GPTID but what is that? Is it a unique ID generated by FreeNAS when the pool is created and written to the drive or is it like a hash based off of the drive's serial number?

I know I will eventually have to boot up the problem server but I am really worried about losing another drive.

Ericloewe · Jan 12, 2022

I thought you said you'd replaced the drives?

MeatTreats · Jan 12, 2022

Ericloewe said:
I thought you said you'd replaced the drives?

No, the first drive that failed (first time it happened) was replaced then within 24 hours, 2 more drive in the same vdev failed. I have since shutdown the server and have done nothing with it out of caution pending the advice of you experts here on this board.

Ericloewe · Jan 13, 2022

Ok, in that case I'd suggest taking out all your current drives and putting back in the one you replaced and testing it, ideally in one of the slots not associated with failing drives. If it's suddenly good, that's an indication of something in the backplane/cables/wahtever.

MeatTreats · Sep 11, 2022

Ericloewe said:
Ok, in that case I'd suggest taking out all your current drives and putting back in the one you replaced and testing it, ideally in one of the slots not associated with failing drives. If it's suddenly good, that's an indication of something in the backplane/cables/wahtever.

Damn... has it really been 9 months...

So I have not done anything with this server. It has just been sitting dead the entire time but I feel like it is finally time to deal with this issue. I have been backing up my data to MEGA as a temp solution but considering my low upload speeds, this cannot continue much longer.

So I have re-read this thread and your advice and the plan is to pull all drives and reinsert the drive that has already been replaced into a known good slot on a different backplane and run tests, correct? Like just a SMART test or is there some other kind of test I should run? What about the other two drives that were ejected from the array? Should I insert those into known good slots and run SMART on those and if they are good, how to I get freeNAS to put them back into the array?

If these drives check ok then I guess it must be a backplane issue in which case I'll have to take it apart and find a replacement. I may as well just replace those cables for good measure as well.

Ericloewe · Sep 11, 2022

MeatTreats said:
Like just a SMART test or is there some other kind of test I should run?

A SMART test won’t do too much good in most backplane failure modes. There’s a read-only test script in the Resources section, by @jgreco.

MeatTreats said:
Should I insert those into known good slots and run SMART on those and if they are good, how to I get freeNAS to put them back into the array?

Won’t hurt. You might need to run zpool clear to have ZFS take the disks back, if it doesn’t do so automatically.

Important Announcement for the TrueNAS Community.

3 bad drive all from the same vdev? Please Help!

Dabbler

Powered by Neutrality

Server Wrangler

MVP

Guru

Dabbler

Powered by Neutrality

Guru

Dabbler

MVP

Dabbler

MVP

Dabbler

Server Wrangler

Dabbler

Server Wrangler

Dabbler

Server Wrangler

Dabbler

Server Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "3 bad drive all from the same vdev? Please Help!"

Similar threads