3 bad drive all from the same vdev? Please Help!

MeatTreats

Dabbler
Joined
Oct 23, 2021
Messages
26
So I have a 48 drive server with 3 vdevs of 16x6TB 16x6TB and 16x10TB. I got these errors that indicated one of the 10TB drives failed so I replaced and rebuilt. Last night I decided to bootup my server and start a scrub since the last time I ran it, I was transferring files when the power momentarily went out. This morning while the scrub was still in progress, I wake up to see that another 10TB drive has failed.

20220110_075001.jpg


And just now I see a bunch of text on the monitor that scrolled by faster than I can take a picture of. BTW, I don't know how to scroll up or if all this info is available in the web GUI. So because I have never had two drives fail within a single day in the same array, I canceled the scrub so as to not put any more use on these drive until I can figure out WTF is going on.

20220110_142557.jpg


This is what appeared after the second drive failed. What's with the encryption stuff? Is my array infected with a virus?

All 3 drives are 10TB WD drives shucked from Easystores and all purchased at the same time. All 3 drives were in the same vdev. All of my 6TB drives seem to have no issues.

fails.png


So my quest is are these legitimately failed drives or could something else be going? Is there any way to re-insert them into the array and run a SMART test on them?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
What's with the encryption stuff? Is my array infected with a virus?
That's not a virus... it refers to the swap partition that was apparently live on one or more of those bad disks(*), so it switched to another one. Swap is always encrypted, so that's the point you're seeing.

So my quest is are these legitimately failed drives or could something else be going?
* these drives may not be bad, but you'll need to do some checking to understand if that's the case...

It could be the cabling or backplane/ports which are going bad.

s there any way to re-insert them into the array and run a SMART test on them?
I would reverse that... SMART testing first.

you may be able to identify the disks with glabel status to give you the disk identifiers from the gptids.

you could then run smartctl -a /dev/da33 (as an example... I'm guessing the first one might be da33).
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Ouch, 16-wide vdev? That's not good for performance... Or reliability... Or anything, really.
That said, with RAIDZ3 you can tolerate three failures, so even your scenario is survivable.

As for the possibly bad drives, definitely try attaching them some other way (SATA, USB/SATA bridge, ...) to have a look at the SMART data and ideally run some tests.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
A hardware list would help, so we know what we are delaing with
 

blanchet

Guru
Joined
Apr 17, 2018
Messages
516
Check also the firmware of your HBA, especially if you use a LSI HBA.

Last year I had random errors on my fresh new JBOD, until I realized that
my just-buy new LSI 9300-8e had a 5-years old firmware.

After upgrading to the last version, all the errors have disappeared.

 

MeatTreats

Dabbler
Joined
Oct 23, 2021
Messages
26
* these drives may not be bad, but you'll need to do some checking to understand if that's the case...

It could be the cabling or backplane/ports which are going bad.

Any idea how I go about testing the backplanes or ports if the drive SMART stats comeback OK?

I would reverse that... SMART testing first.

you may be able to identify the disks with glabel status to give you the disk identifiers from the gptids.

you could then run smartctl -a /dev/da33 (as an example... I'm guessing the first one might be da33).

I have of course since shut down my server. When I bootup my server, would those 2 drives be back in the array or will they still have a "REMOVED" status? Should I even start it up again?

That said, with RAIDZ3 you can tolerate three failures, so even your scenario is survivable.

As for the possibly bad drives, definitely try attaching them some other way (SATA, USB/SATA bridge, ...) to have a look at the SMART data and ideally run some tests.

I've never had more than one drive failure at any one time and I'm really scared. I don't have a SATA dock or anything. Should I just bootup the server and run the SMART tests in FreeNAS or can I/Should I put them into a Windows PC and run the test? How would I even run a SMART test in Windows? Same CMD commands as in FreeNAS?

A hardware list would help, so we know what we are delaing with

Chenbro.png


Check also the firmware of your HBA, especially if you use a LSI HBA.

Last year I had random errors on my fresh new JBOD, until I realized that
my just-buy new LSI 9300-8e had a 5-years old firmware.

Oof.... I'm assuming there are ways/commands in FreeNAS to be able to do that? I wouldn't even know where to begin on how to check or update hardware firmware.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Any idea how I go about testing the backplanes or ports if the drive SMART stats comeback OK?
Depends on your options... with free ports available, move disks to free ports and see results, without free ports, it's more risky as you would need to move known good disks out of the pool to swap with the not working ones... as long as nothing is accessing the pool, that should be OK and the pool will work itself out when you put disks back.

Ideally, you would have an entirely new server to test it in. I guess not though.
 

blanchet

Guru
Joined
Apr 17, 2018
Messages
516
To check the firmware version:

For a LSI 93xx you have to use
Code:
sas3flash -listall


For a LSI 92xx like you have
Code:
sas2flash -listall
 

MeatTreats

Dabbler
Joined
Oct 23, 2021
Messages
26
Depends on your options... with free ports available, move disks to free ports and see results, without free ports, it's more risky as you would need to move known good disks out of the pool to swap with the not working ones... as long as nothing is accessing the pool, that should be OK and the pool will work itself out when you put disks back.

Ideally, you would have an entirely new server to test it in. I guess not though.

There are no free ports and I am not comfortable moving known good drives to what could be a bad port. I do also have other complete but empty servers I could test on but they are a different brand with a different hardware config and different backplanes.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
Do you have a backup?
Do you Know which physical disks are in which vdev
I would remove the ethernet cables to prevent anyone trying to work on the box and then start up and
can you do a zpool status "Pool Name" and post that in code tags please?
 

MeatTreats

Dabbler
Joined
Oct 23, 2021
Messages
26
Do you have a backup?
Do you Know which physical disks are in which vdev
I would remove the ethernet cables to prevent anyone trying to work on the box and then start up and
can you do a zpool status "Pool Name" and post that in code tags please?

No Backups

Yes I know.

Not an issue.

Ok but is it safe to boot it up? Will those "removed" drives no longer be in that status?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
Its as safe as it can be. If the pool isn't there it will tell you and frankly you are going to have to start up at some point anyway.

With ZFS you can move disks around. the dev/??? will change but the GPTID won't and ZFS works from the GPTID so you can shuffle disks and (assuming no physical issues) the vdevs and pool will be fine. Of course you have an issue. You need to work out if the fault is with the disks or with the hardware (backplane, power etc). Frankly the best way to do this is pull the suspect disks and swap them with a known good disk (one at a time, not all at the same time) and see if the faults stay or remain. If not that then pull one disk and put it into another freebsd/linux machine and run smarttests on it (I am assuming that removed = I can no longer see these disks, so you can't run smart locally). Don't use windows which has a nasty habit of deciding you really wanted to format that disk it doesn't understand

What sort of data is on this pool? Can it be recreated? Can it be backed up (might take a while)?
The hardware is old but corporate - is this home or business?
 

MeatTreats

Dabbler
Joined
Oct 23, 2021
Messages
26
Its as safe as it can be. If the pool isn't there it will tell you and frankly you are going to have to start up at some point anyway.

With ZFS you can move disks around. the dev/??? will change but the GPTID won't and ZFS works from the GPTID so you can shuffle disks and (assuming no physical issues) the vdevs and pool will be fine. Of course you have an issue. You need to work out if the fault is with the disks or with the hardware (backplane, power etc). Frankly the best way to do this is pull the suspect disks and swap them with a known good disk (one at a time, not all at the same time) and see if the faults stay or remain. If not that then pull one disk and put it into another freebsd/linux machine and run smarttests on it (I am assuming that removed = I can no longer see these disks, so you can't run smart locally). Don't use windows which has a nasty habit of deciding you really wanted to format that disk it doesn't understand

What sort of data is on this pool? Can it be recreated? Can it be backed up (might take a while)?
The hardware is old but corporate - is this home or business?

I didn't know disks can be moved around and still work. Can anyone else here confirm this???

If needed, I can install FreeNAS on an empty sever and run tests. That data is a bit of everything and I suppose with enough time, it can be replaced. It's about 80TB and I would need to populate a new server to back it up. This is a home machine for personal stuff.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I didn't know disks can be moved around and still work. Can anyone else here confirm this???
Yes. In fact, it's true of any vaguely-usable RAID solution. From Intel fakeRAID to the most vendor-locked RAID controller. ZFS, being better than any of those at all things management, is no exception.
 

MeatTreats

Dabbler
Joined
Oct 23, 2021
Messages
26
Yes. In fact, it's true of any vaguely-usable RAID solution. From Intel fakeRAID to the most vendor-locked RAID controller. ZFS, being better than any of those at all things management, is no exception.

In your opinion, what is the safest way forward? Should I take all three "failed" drives and put then in an empty FreeNAS machine and test them there first to determine next steps? Putting them into another server isn't going to modify the ZFS pool data to where they will be rejected by the array they belong to right?

@NugentS mentions that drives are identified by the GPTID but what is that? Is it a unique ID generated by FreeNAS when the pool is created and written to the drive or is it like a hash based off of the drive's serial number?

I know I will eventually have to boot up the problem server but I am really worried about losing another drive.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I thought you said you'd replaced the drives?
 

MeatTreats

Dabbler
Joined
Oct 23, 2021
Messages
26
I thought you said you'd replaced the drives?

No, the first drive that failed (first time it happened) was replaced then within 24 hours, 2 more drive in the same vdev failed. I have since shutdown the server and have done nothing with it out of caution pending the advice of you experts here on this board.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Ok, in that case I'd suggest taking out all your current drives and putting back in the one you replaced and testing it, ideally in one of the slots not associated with failing drives. If it's suddenly good, that's an indication of something in the backplane/cables/wahtever.
 

MeatTreats

Dabbler
Joined
Oct 23, 2021
Messages
26
Ok, in that case I'd suggest taking out all your current drives and putting back in the one you replaced and testing it, ideally in one of the slots not associated with failing drives. If it's suddenly good, that's an indication of something in the backplane/cables/wahtever.

Damn... has it really been 9 months...

So I have not done anything with this server. It has just been sitting dead the entire time but I feel like it is finally time to deal with this issue. I have been backing up my data to MEGA as a temp solution but considering my low upload speeds, this cannot continue much longer.

So I have re-read this thread and your advice and the plan is to pull all drives and reinsert the drive that has already been replaced into a known good slot on a different backplane and run tests, correct? Like just a SMART test or is there some other kind of test I should run? What about the other two drives that were ejected from the array? Should I insert those into known good slots and run SMART on those and if they are good, how to I get freeNAS to put them back into the array?

If these drives check ok then I guess it must be a backplane issue in which case I'll have to take it apart and find a replacement. I may as well just replace those cables for good measure as well.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Like just a SMART test or is there some other kind of test I should run?
A SMART test won’t do too much good in most backplane failure modes. There’s a read-only test script in the Resources section, by @jgreco.

Should I insert those into known good slots and run SMART on those and if they are good, how to I get freeNAS to put them back into the array?
Won’t hurt. You might need to run zpool clear to have ZFS take the disks back, if it doesn’t do so automatically.
 
Top