Unable to mount pool (5 devices failed to decrypt.)

SRT4JRE · Sep 25, 2021

Hi,

I have a 12 disk Z2 pool made up of 6 Seagate Ironwolf 4TB and 6 WD RED 4TB. I have a scheduled scrub every 2 weeks, along with frequent long and short SMART tests. No warning was ever given before this issue happened. I had noticed a scheduled scrub happening at the time a single drive went offline, no other drives dropped out at this time, but I had to abort the scrub shortly after by shutting down the system through the web GUI. Normally this system is up 24/7, but I wasn't going to have access to the system for the next few days and I didn't want to leave it running in this state it was currently in.

I have an encrypted pool that was setup initially when I created it, operating without issue until now. After this issue happened I am receiving the error [EFAULT] Pool could not be imported: 5 devices failed to decrypt. I have gone through my SMART logs and have noticed some troubling errors on some disks, specifically the RAW READ ERROR RATE and MULTI ZONE ERROR RATE. All 6 of the Seagate disks have the RAW READ ERROR RATE, and one of the WD disks has the MULTI ZONE ERROR RATE. These drives were all purchased new, from a few different retailers, and they were checked before putting into service using a burn in tool found on these forums. There was no warning given by the system anything was wrong before this incident happened. I have email alerts setup and verified working.

About the system: It is a IBM x3630 M3 with 48GB of ECC memory, dual power supplies, and two Xeon CPUS. It was originally running FreeNAS, before upgrading to TrueNAS-12.0-U4 a few months maybe, before this incident.

At this point, I do not know how to recover from this. I find it unusual and unlikely all 6 Seagate Ironwolf disks would go bad at the same time, but this is what I am seeing. Any suggestions on what to do or additional information I can provide to help diagnose the issue would be greatly appreciated.

SRT4JRE · Sep 25, 2021

Hit the limited for attached files.

SRT4JRE · Sep 25, 2021

Also, I want to mention all disks are connected though a Dell H310 flashed to IT mode.

Heracles · Sep 25, 2021

SRT4JRE said:
I had noticed a scheduled scrub happening at the time a single drive went offline

One man down... Another one and you will be without any protection, two more and you loose it all...

...

SRT4JRE said:
I had to abort the scrub shortly after by shutting down the system

I really do not see why you HAD to but too late... You did...

SRT4JRE said:
Normally this system is up 24/7, but I wasn't going to have access to the system for the next few days

Ouch! To let drives cool down for a long time when they are hot all the time is a very good way to kill them. You may have demonstrated it here... If it was for having the server unavailable, it would have been better to stop the services or de-configure the network settings to prevent any access while keeping the drives hot...

SRT4JRE said:
I have an encrypted pool

Looks like it will turn to yet another self-inflicted ransomware...

SRT4JRE said:
All 6 of the Seagate disks have the RAW READ ERROR RATE

No worry here. The value is not to be interpreted directly. See this explaination about how Seagate reports its info.

SRT4JRE said:
and one of the WD disks has the MULTI ZONE ERROR RATE

Because it is only a single one, that may be worst but I do not know anything specific about WD other than I would never use them.

SRT4JRE said:
At this point, I do not know how to recover from this

Best way will probably to go to your backups. Backups is the most effective way to recover from ransomware, no matter malicious or self-inflicted. You do have backups right ? No server can be more than a single point of failure, no matter how robust it is.

SRT4JRE · Sep 25, 2021

Thanks for the reply.

The scrub, as you can imagine, takes a very long time on a 12 4TB disk Z2 array. It had many more hours until it was complete; however, I had to leave and would not have access to the system for atleast 4 days. After seeing one disk fall out, I decided to shut the system down and deal with it when I had time to complete a scrub while I was present.

I had no idea about keeping drives hot, like a warmed up engine. I have honestly never read or heard that anywhere. Is this just a contribution or a root the cause of the failure though?

Is the encryption the cause of the failure? I still have the password and keys, and had the disks not been encrypted would this issue still have happened? How does the encryption fit into this issue I have experienced?

Understood, seems not to be a big deal with Seagate disks. I just have not noticed these SMART results before.

My research into the MULTI ZONE ERROR RATE finds its not a huge deal.

Backups aren't going to help me here if I don't know what exactly happened and how to prevent it from happening again. Is this a disk failure or something else?

Heracles · Sep 25, 2021

SRT4JRE said:
I decided to shut the system down and deal with it when I had time to complete a scrub while I was present

Scrubs are meant to happen unattended...

SRT4JRE said:
I had no idea about keeping drives hot, like a warmed up engine

You may now have learned it the hard way...

SRT4JRE said:
I have honestly never read or heard that anywhere.

Often, people look at ways to save energy by putting their drives to sleep. Every time, we recommend against it for that reason (keeping drives hot), plus the fact that frequent spin up / spin down is also another fast track for failures.

SRT4JRE said:
Is this just a contribution or a root the cause of the failure though?

The older the drives were, the higher the probability that this was the cause.

SRT4JRE said:
Is the encryption the cause of the failure?

Probably not the cause, but it is a damage amplifier. Because of encryption, the slightest error / problem on the drive will prevent the entire drive to be accessed. Should you have not been encrypted and some corruption happened here or there, ZFS may have been able to fix it or at worst, you would have ended up with some corrupted files instead of a locked down and unusable pool.

SRT4JRE said:
I still have the password and keys

Good and it is essential to keep them safe. If there is anything to do, they will be hard requirement for sure.

SRT4JRE said:
had the disks not been encrypted would this issue still have happened?

As explained, the damage has potentially been amplified by the encryption. The "initial" damage may had been enough to destroy even an unencrypted pool but we will never know... Even an unencrypted pool can be destroyed by a case where too many drives fail at once.

SRT4JRE said:
how to prevent it from happening again

Keeping the disks hot and avoiding ZFS encryption are already 2 things to put on your side once you restored backups. If you must encrypt the data, better to do it at file level than ZFS level. For example, here, my TrueNAS server hosts my private Cloud built with Nextcloud. It is Nextcloud that encrypts the content and saves the cryptograms in TrueNAS. As such, a disk access will give nothing, even as Root and in the live system. Overall, it turns much safer than ZFS with the same result that no clear text data are saved on TrueNAS's drives.

TrueNAS started to do per dataset encryption. I do not know about this option but it already sounds safer to me because that means the fundamental ZFS pool's structure will not be encrypted.

SRT4JRE said:
Is this a disk failure

Very possible but unfortunately, after I did a few tests zith ZFS encryption myself, I concluded that it was way too high risk with a way too low benefit to be used. As such, I do not have much experience with it, even less recovering after such a failure.

So now that these drives are hot again, I would keep them as such : plugged in and spinning. As for what to do now to try to mount the pool, I remember that @PhiloEpisteme did a lot about it and helped many people around here. You may search for his posts and see if you can found anything to help your own case...

Good luck,

SRT4JRE · Sep 25, 2021

I understand scrubs can happen unattended, but that wasn't the case here for reasons stated.

Understood, always keep drives running, I had no idea they were so delicate.

I should mention these were brand new drives and they weren't running in a high demand environment. They were idle most of the time doing practically nothing.

Understood, next time I won't use ZFS encryption. The reason why I decided on using it, against some advise, was because I don't want to be in a situation where a disk fails under warranty and I have to send it in for replacement with potentially readable data still on it.

Thanks for your help and pointing me int the right direction. I will look into his posts.

AlexGG · Sep 26, 2021

SRT4JRE said:
Understood, next time I won't use ZFS encryption. The reason why I decided on using it, against some advise, was because I don't want to be in a situation where a disk fails under warranty and I have to send it in for replacement with potentially readable data still on it.

Were you using native ZFS encryption, or GELI disk-level encryption, though? Looks like a GELI configuration problem, not a ZFS problem.

ChrisRJ · Sep 26, 2021

SRT4JRE said:
I should mention these were brand new drives and they weren't running in a high demand environment. They were idle most of the time doing practically nothing.

Hard disks are relatively likely to fail when new, and once they are a couple of years old. You should burn-in new drives and the forum has quite a few threads on this.

Good luck!

chruk · Sep 26, 2021

I am 100% on keeping drives powered up. I have over 20 years kept my PC on 24/7, and the amount of drive failures I have had in this way is minimal I think maybe 2-3 in 20 years, and even then they never failed completely, just some unreadable sectors.

However when I upgrade my drives, I replace them, and put the old one's on a shelf or in another PC (full working condition when removed), the one's I put on a shelf I have occasionally decided to use them again for another purpose and the failure's I have had doing this is very high, even if they work they quite often suddenly get SMART errors. I think the only two drives I have that have survived long power off's and many power cycles are a very old 36 gig raptor, and my not much newer WD black 640 gig (in the days when black's were not much worse quality than raptors).

SRT4JRE · Sep 26, 2021

AlexGG said:
Were you using native ZFS encryption, or GELI disk-level encryption, though? Looks like a GELI configuration problem, not a ZFS problem.

Yes, ZFS encryption.

SRT4JRE · Sep 26, 2021

ChrisRJ said:
Hard disks are relatively likely to fail when new, and once they are a couple of years old. You should burn-in new drives and the forum has quite a few threads on this.

Good luck!

The drives were burnt in using a tool found on these forums. They were brand new and no issues were found at the time, they were all good drives. Save for the extremely unlikely event all, less than 1 year old, Seagate drives failed at the same time, I suspect this problem is caused by something else.

AlexGG · Sep 26, 2021

SRT4JRE said:
Yes, ZFS encryption.

The error message, however, is generated by GELI, not ZFS native encryption. You need to investigate and sort out your encryption configuration. Something strange is afoot in there.

Err, not exactly the message is generated by GELI itself, but by the process of unlocking the GELI-encrypted pool.

danb35 · Sep 26, 2021

What's the output of zpool import, in code tags?

ChrisRJ · Sep 27, 2021

SRT4JRE said:
The drives were burnt in using a tool found on these forums. They were brand new and no issues were found at the time, they were all good drives. Save for the extremely unlikely event all, less than 1 year old, Seagate drives failed at the same time, I suspect this problem is caused by something else.

Well, I have had 3 Seagate Exos 16 TB go south between April and August this year. All had been in service since October 2020, cooled to less than 35 C, and with less than 20 power-cycles. Statistically not relevant, of course. But certainly the worst HD experience in over 30 years with approx. 50 drives.

SRT4JRE · Nov 8, 2022

danb35 said:
What's the output of zpool import, in code tags?

no pools available to import

SRT4JRE · Nov 8, 2022

ChrisRJ said:
Well, I have had 3 Seagate Exos 16 TB go south between April and August this year. All had been in service since October 2020, cooled to less than 35 C, and with less than 20 power-cycles. Statistically not relevant, of course. But certainly the worst HD experience in over 30 years with approx. 50 drives.

So not likely disk failure of 5 brand new, different batch and burnt in tested drives. Anything else I can try? I still have the encryption and recovery key for the pool

Arwen · Nov 9, 2022

If you don't have any pools to import, could it be that your pool eventually did import?

Please supply zpool status in code tags please.

SRT4JRE · Nov 9, 2022

Thanks for the reply, yes i believe the the pool is still imported. I have also tried unlocking the pool using FreeNas 11.3-U5 and still receive

Code:

[EFAULT] Pool could not be imported: 5 devices failed to decrypt.

Code:

root@freenas[~]# zpool status
  pool: freenas-boot
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 00:08:06 with 0 errors on Wed Nov  9 03:53:06 2022
config:
       
        NAME            STATE        READ WRITE CKSUM
        freenas-boot    ONLINE         0     0     0
          da12p2        ONLINE         0     0     0
errors: No know data errors

Arwen · Nov 9, 2022

No, with that output, it does not show any data pools. So it is not imported. Something bad happened.

Sorry, I have no further suggestions, (and not much experience with TrueNAS encryption).

Important Announcement for the TrueNAS Community.

Unable to mount pool (5 devices failed to decrypt.)

Dabbler

Attachments

Dabbler

Attachments

Dabbler

Wizard

Dabbler

Wizard

Dabbler

Contributor

Wizard

Dabbler

Dabbler

Dabbler

Contributor

Hall of Famer

Wizard

Dabbler

Dabbler

MVP

Dabbler

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Unable to mount pool (5 devices failed to decrypt.)"

Similar threads