Is my system fully dead ?

thorgrim

Dabbler
Joined
Feb 12, 2012
Messages
23
Hi all

I'm not sure if I'm posting at the right place but I have an old Freenas 9.2 install that seem to be failling badly. Don't ask why it hasn't been updated, it's mostly a case of "if it works, don't touch it" and the update to the latest Truenas stable was planed but it seems like it won't happen just right now...

Here are the system specs :
- MB : ASRock Rack C236 WSI Mini ITX Server
- CPU : Intel Pentium G4400 Skylake Dual-Core 3.3 GHz
- RAM : 16Gb ECC
- Drives : 5x4TB in RAIDz2 (mix of Seagate Ironwolf and WD Red)

A couple of days ago one of the WD drives failed and I was in the process of replacing it and the sata cable. But suddenly my whole system went AWOL and I don't really know why. My system is now unusable, I can't connect to it remotely because it is extremly slow. When connecting directly to the machine here is the status I get for the pool :

IMG_20210622_082456.jpg IMG_20210622_082517.jpg

All of a sudden it looks like 2 drives are now dead and resilvering at the same time ? But the resilvering is not doing anything it is stuck at 2.33% since yesterday. I don't really know what to do next or what is most likely failling. My guess is either SATA cables or the ports on the MB, or maybe both but I don't really have another MB to test it out and I don't have 5 cables laying around either.

What would you recommend for me to do ? Can I just unplug the two failling drives and hope to have my NAS back running the time I get replacement parts ? Or should I scrap it all and built a fresh new system (I was looking at the Mini X series the other day...) ? What hope do I have to get my data ? I have recent backups and copies of the most important things but if I can get it back that would be great, even if just to transfer it to another system.

Thanks for your input !
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Hi all

I'm not sure if I'm posting at the right place but I have an old Freenas 9.2 install that seem to be failling badly. Don't ask why it hasn't been updated, it's mostly a case of "if it works, don't touch it" and the update to the latest Truenas stable was planed but it seems like it won't happen just right now...

Here are the system specs :
- MB : ASRock Rack C236 WSI Mini ITX Server
- CPU : Intel Pentium G4400 Skylake Dual-Core 3.3 GHz
- RAM : 16Gb ECC
- Drives : 5x4TB in RAIDz2 (mix of Seagate Ironwolf and WD Red)

A couple of days ago one of the WD drives failed and I was in the process of replacing it and the sata cable. But suddenly my whole system went AWOL and I don't really know why. My system is now unusable, I can't connect to it remotely because it is extremly slow. When connecting directly to the machine here is the status I get for the pool :

View attachment 47939 View attachment 47938

All of a sudden it looks like 2 drives are now dead and resilvering at the same time ? But the resilvering is not doing anything it is stuck at 2.33% since yesterday. I don't really know what to do next or what is most likely failling. My guess is either SATA cables or the ports on the MB, or maybe both but I don't really have another MB to test it out and I don't have 5 cables laying around either.

What would you recommend for me to do ? Can I just unplug the two failling drives and hope to have my NAS back running the time I get replacement parts ? Or should I scrap it all and built a fresh new system (I was looking at the Mini X series the other day...) ? What hope do I have to get my data ? I have recent backups and copies of the most important things but if I can get it back that would be great, even if just to transfer it to another system.

Thanks for your input !
Looks like you lost 2 disks from your RAIDZ2 array... yikes!

So you have two alternatives:
  • Replace both failed disks and resilver
  • Replace one failed disk; resilver; replace the other failed disk; resilver
My hunch is the first choice is the better bet. But in either case: pray!

I've never experienced 2 failed disks in a RAIDZ2 pool, so I can't speak to which alternative is better or safer. Perhaps someone with more experience will chime in.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,112
Your system looks fine, but not its drives. If you suspect a bad port, you can try to shuffle cables around since there are 8 ports for 5 drives. But most likely it's the drives themselves. Could it be the dreaded SMR WD Red? That would explain the very slow resilver.

Which drives are failing? What is the output of smartctl -a /dev/adaN (especially on the failed drives, if they still respond)?
 
Last edited:

thorgrim

Dabbler
Joined
Feb 12, 2012
Messages
23
Your system looks fine, but not its drives. If you suspect a bad port, you can try to shuffle cables around since there are 8 ports for 5 drives. But most likely it's the drives themselves. Could it be the dreaded SMR WD Red? That would explain the very slow resilver.

Which drives are failing? What is the output of smartctl -a /dev/adaN (especially on the failed drives, if they still respond)?
I think it is the case of SMR drives, both failed drives are WD and when I run the smartctl command I get this result (sorry for the pics, I can't connect through ssh) :
IMG_20210623_212540.jpg IMG_20210623_212551.jpg IMG_20210623_212600.jpg IMG_20210623_212607.jpg IMG_20210623_212617.jpg
The console also shows lots of messages like these :
IMG_20210623_212401.jpg

I have a spare drive I could throw in to replace one of the failed drives and use a different cable + port on the MB. I just hope this won't kill this brand new drive. Is it ok if I remove both failed drive and only replace one ? Will the resilvering still happen or will ZFS wait to have 5 drives again ?
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,367
Hmmm.

You have two drives missing, so no redundancy, and then checksum errors on one of you remaining disks. So those sectors have been corrupted.

Often you might've been lucky and just lost files, but it seems you have metadata corruptions.

I suspect you're looking at recreating your pool.

You have a backup right?

FWIW, I'd be trying to restart the box and see if the missing disks come back. If they do everything should come good for the moment.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,112
I have a spare drive I could throw in to replace one of the failed drives and use a different cable + port on the MB. I just hope this won't kill this brand new drive. Is it ok if I remove both failed drive and only replace one ? Will the resilvering still happen or will ZFS wait to have 5 drives again ?
All SMR WD drives have to go out; the sooner, the better. Are there more of those than the two failed drives?
There is no reason that it would kill the new drive—except if it's a SMR drive, in which case it should not be used with ZFS.

What's strange is that the drive for which you have provided SMART results (cannot even use the web console?) shows no errors. (But also a single SMART test in his life, one thousand hours ago.)
If you have a backup, restoring it is quickest way out.
If you want to recover this pool, power off the box (possibly take this opportunity to plug in new CMR drives without removing the old ones) and restart, as suggested by @Stux. Replacing with the old drives still in means that you do not lose redundancy—but also that resilvering will take many days.

ZFS can resilver to a vdev which lacks a drive, and will then resilver to a degraded state, waiting for a fifth drive to be added. But replacing all SMR drives in one go would be better.
 

thorgrim

Dabbler
Joined
Feb 12, 2012
Messages
23
You have a backup right?

FWIW, I'd be trying to restart the box and see if the missing disks come back. If they do everything should come good for the moment.
Yes I have a somewhat recent backup of important things so I can start again if needed. When I restarted the box all drives went back up but just for a moment before everything going back to disks removed and things like that and then it tried to resilver on removed disks and now it is basically stuck...

All SMR WD drives have to go out; the sooner, the better. Are there more of those than the two failed drives?
There is no reason that it would kill the new drive—except if it's a SMR drive, in which case it should not be used with ZFS.

What's strange is that the drive for which you have provided SMART results (cannot even use the web console?) shows no errors. (But also a single SMART test in his life, one thousand hours ago.)
If you have a backup, restoring it is quickest way out.
If you want to recover this pool, power off the box (possibly take this opportunity to plug in new CMR drives without removing the old ones) and restart, as suggested by @Stux. Replacing with the old drives still in means that you do not lose redundancy—but also that resilvering will take many days.

ZFS can resilver to a vdev which lacks a drive, and will then resilver to a degraded state, waiting for a fifth drive to be added. But replacing all SMR drives in one go would be better.
I only had 2 WD SMR drives in the pool the other ones are Seagate. I have another Seagate spare and I just ordered 2 new WD CMR drives and new SATA cables.

My plan is as follow : replacing both WD drives as soon as they arrive and see if the resilvering goes on. If all goes "well" I then can replace the Seagate drive with checksum errors if needed. If resilvering does not go well I'll just call it a day and do a fresh new install of Truenas and recreate my pool, the only question I have here is will I be able to use the same drives or will I have to order again new ones ?

Thanks for all the help !
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,112
If you have to delete the pool, you can put back the known good drives into the new pool. For the Seagate drive with checksum errors it depends on what smartctl -a reports.
In any case, consider scheduling regular SMART tests on the drives.
 

thorgrim

Dabbler
Joined
Feb 12, 2012
Messages
23
Just a quick follow-up after a couple of days of playing with it all (but just 5 minutes a day so it takes a looonnnnggggg time to achieve anything... :( )
I had suspicions that my usb stick was going a bad path so I tested it. Doing so, it got fully erased and is now officially dead. I had a backup of my configuration so no big deal but as I was already"okay" with losing it all I went the opposite direction and decided to directly install a fresh new latest Truenas image. After some fiddeling I found an unused SSD to upgrade from the USB stick and went on with the install.
After doing some very basic setup (IP address, redirect to https and stuff like that) I went on at looking how to import my pool. After some reading and searching and after multiple tries the zpool import has finally passed. But after multiple tries and playing with my HDD in the chassis I just managed to import the pool and get back 4 drives ONLINE which still gives me access to all my data and still have one redundancy drive. Now I plan to go on with the replacement of the failed drive and the other SMR drive I still have in my pool.
I also plan on testing all the drives but I don't really know what to look for in the smartctl output. If you could give me just some pointers at what is important to look for, I think that I have 15 or 20 drives to test from previous capacity upgrades and failed drives I could also test and maybe take the good ones to create another backup/replication pool (like most of us I have "some" spare parts laying around....)
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,112
Short answer: Look at SMART IDs 5 and 196-200.

Long answer:
 

thorgrim

Dabbler
Joined
Feb 12, 2012
Messages
23
Short answer: Look at SMART IDs 5 and 196-200.

Long answer:
Thanks ! Really impressive guide !
 
Top