Linus Tech Tips deploying TrueNAS for their production server

jgreco · Dec 14, 2021

Samuel Tai said:
Otherwise, how would he have recovered to post video afterwards

You obviously have insufficient experience with end users. The only thing that is guaranteed to stop them from trying again is death. Some may be discouraged by negative experiences, but many return to try again, triggering "the definition of insanity"

;-)

hescominsoon · Dec 16, 2021

Samuel Tai said:
I cringed when they created their pool as RAIDZ1. For their workload (video editing), a wide stripe of mirrors would have much better IOPS.

considering he's using gen 4 pcie ssd's...mirrored vdevs are not really needed..recommended yes..but not needed.
IX actually likes the video as well

Blog - LinusTechTips says “Goodbye Windows, Hello TrueNAS”

Video editing is one of the more demanding NAS workloads these days. The move to 4K and 8K video means that massive files need to be transferred with reliability and speed. LinusTech Media is one of the organizations that has these challenges, and they recently migrated from Windows storage...

www.truenas.com

ConKbot · Jan 29, 2022

And in a move unsurprising to a lot of people on here...

Our data is GONE... Again - Petabyte Project Recovery Part 1

Configure your own workstation at https://lambdalabs.com/linusCheck out Hetzner Cloud and use code LTT22 for $20 off at http://linustechtips.hetzner.com/en/c...

www.youtube.com

TLDW: 2 CentOS archival servers ended up with zpool failure after catching multiple Z2 vdevs with dual disk failures, replacing drives and starting the resilver. But because the fact that these servers didnt have periodic scrubs enabled, there was enough bit-rot to cause enough errors on other disks on these vdevs to fail even more disks, causing failure.
The two most basic things absolutely hammered home by the folks on here "do periodic scrubs. Do periodic SMART long tests" not set up. Linus, pls...
Edit: Highlighting they were CentOS servers

imsocold · Jan 29, 2022

ConKbot said:
And in a move unsurprising to a lot of people on here...

Our data is GONE... Again - Petabyte Project Recovery Part 1

Configure your own workstation at https://lambdalabs.com/linusCheck out Hetzner Cloud and use code LTT22 for $20 off at http://linustechtips.hetzner.com/en/c...

www.youtube.com

TLDW: 2 archival servers ended up with zpool failure after catching multiple Z2 vdevs with dual disk failures, replacing drives and starting the resilver. But because the fact that these servers didnt have periodic scrubs enabled, there was enough bit-rot to cause enough errors on other disks on these vdevs to fail even more disks, causing failure.
The two most basic things absolutely hammered home by the folks on here "do periodic scrubs. Do periodic SMART long tests" not set up. Linus, pls...

This has to be some kind of record, i fully expected linus to screw this up, but not this quickly.

ConKbot · Jan 29, 2022

imsocold said:
This has to be some kind of record, i fully expected linus to screw this up, but not this quickly.

To be fair this isnt the production server in the original post for this thread. And it does sound like that some of the servers do have periodic scrubs enabled.
Maybe TrueNAS should create a default periodic scrub task for new/imported zpools. Yes I understand that a professional shouldn't have ..issues... like this, the penetration into the home-lab space is increasing. and even for a professional, its not like its adding work. Just changing the default task to the task of your preference.

ClassicGOD · Jan 29, 2022

ConKbot said:
To be fair this isnt the production server in the original post for this thread. And it does sound like that some of the servers do have periodic scrubs enabled.
Maybe TrueNAS should create a default periodic scrub task for new/imported zpools. Yes I understand that a professional shouldn't have ..issues... like this, the penetration into the home-lab space is increasing. and even for a professional, its not like its adding work. Just changing the default task to the task of your preference.

I could've sworn that there are default scrub tasks for imported pools. The issue was that the servers were running CentOS with basically manual storage setup. No failed drives notifications, no periodic scrubs or S.M.A.R.T. tests etc. It was only matter of time that they will start loosing data.

hescominsoon · Jan 29, 2022

imsocold said:
This has to be some kind of record, i fully expected linus to screw this up, but not this quickly.

this wasn't instant..the vaults have been around for years..this is unfortunately common in many businesses...

hescominsoon · Jan 29, 2022

ClassicGOD said:
I could've sworn that there are default scrub tasks for imported pools. The issue was that the servers were running CentOS with basically manual storage setup. No failed drives notifications, no periodic scrubs or S.M.A.R.T. tests etc. It was only matter of time that they will start loosing data.

centos doesn't have the default scrubs setting built into it..as ZFS wasn't truly native in Linux for a while until recently.

ConKbot · Jan 29, 2022

ClassicGOD said:
I could've sworn that there are default scrub tasks for imported pools. The issue was that the servers were running CentOS with basically manual storage setup. No failed drives notifications, no periodic scrubs or S.M.A.R.T. tests etc. It was only matter of time that they will start loosing data.

I spaced out on that bit. You're right. they did say they were using CentOS. I just saw the TrueNAS alerts email they showed in the video, and them talking ZFS had me thinking it was TrueNAS. I'm still not sure if there is a default scrub task in TrueNAS for new pools though. I just set my home server up in 12.0U7, and I cant for the life of me remember if I had to create the scrub task or modify it now.

ClassicGOD · Jan 29, 2022

ConKbot said:
I spaced out on that bit. You're right. they did say they were using CentOS. I just saw the TrueNAS alerts email they showed in the video, and them talking ZFS had me thinking it was TrueNAS. I'm still not sure if there is a default scrub task in TrueNAS for new pools though. I just set my home server up in 12.0U7, and I cant for the life of me remember if I had to create the scrub task or modify it now.

I recently moved my pool to a new hardware with fresh Scale install and I could've sworn it had scrub attached to it. I can't verify it now as I removed all scrub default tasks and created my own schedule since then.

hescominsoon said:
centos doesn't have the default scrubs setting built into it..as ZFS wasn't truly native in Linux for a while until recently.

I agree - I was talking about TrueNas having default scrubs - I should have made it more clear.

jgreco · Jan 29, 2022

ConKbot said:
The two most basic things absolutely hammered home by the folks on here "do periodic scrubs. Do periodic SMART long tests" not set up.

And people wonder why I have the paranoia-level SMART testing and scrubs.

Multiple shorts a day. Longs thrice a week. Scrubs every 14 days. Alerts and emails tested to work. Spare drives available for important pools.

It is NOT possible to prevent bad things from happening. However, I can tell you, fate comes picking on you if you can't be bothered with the little details. One of my "important" filers, 11 disk RAIDZ3 with a spare, replicated off-site. This NAS has been through iterations as 4TB, 8TB, 12TB, and now 14TB HDD's. That's 48 hard drives. You would think that one of them would have failed at some point. Nope. Fate doesn't like the prepared.

danb35 · Jan 29, 2022

hescominsoon said:
ZFS wasn't truly native in Linux for a while until recently.

ZFS still isn't truly native in Linux, and some folks in Linux-land (including Torvalds, IIRC) have a pathological aversion to it. Some distros like it (Ubuntu being one); others really don't (e.g., anything downstream from RedHat/IBM). There are packages for CentOS/RHEL, but you're completely on your own to set up a sensible configuration. No surprise at all that Linus (of LTT, not Torvalds) didn't.

indy · Jan 29, 2022

Ubuntu's out of the box support for ZFS is nice - especially since Canonical put themselves at risk of litigation from Oracle.
On the other hand they knowingly released 21.10 with a custom patch that corrupted the filesystem.
Not sure what to make of that...

Arwen · Jan 29, 2022

indy said:
Ubuntu's out of the box support for ZFS is nice - especially since Canonical put themselves at risk of litigation from Oracle.
...

It was not so much of a risk from Oracle, as that part was mostly or fully taken care of with the open sourcing by Sun Microsystems.

What was the issue, is that some Linux / free software people were complaining that including non-GPL(1/2/3, whatever), as kernel modules violated the GPL(1/2/3) license. So they felt that anyone who included ZFS with a Linux distro could or should be sued. Ubuntu put themselves at risk from being sued by people like the FSF, (Free Software Foundation).

On the general thread, I like this quote, (which I used earlier in this thread mid December, 2021);

FreeNAS forums:

The probability of vdev failure with RAIDZ2 is tiny on a properly-administered
server. Most contributions to that probability would take any other pools along.

We're talking catastrophic PSU failure, massive undetected RAM failure,
physical destruction, alien high-intensity degaussing beams, extreme heat or
the presence of Linus from LinusTechTips in the same room as the server.

Ericloewe, Jan 6, 2016

Ericloewe · Jan 29, 2022

indy said:
Ubuntu's out of the box support for ZFS is nice - especially since Canonical put themselves at risk of litigation from Oracle.

This has always been a rather weird argument. So, Oracle would sue because someone didn't respect this interpretation of the GPL... why? As the copyright holder of some ZFS code? That doesn't make any sense, the CDDL doesn't disallow such use, they don't have standing to sue. As the copyright holder of some Linux code? That would require quite some mental gymnastics, even by Oracle's standards. And what damages could possibly be demanded in such a case?

And ultimately, Oracle hasn't sued yet. Or even made noise about it. Oracle is not the sort of company that acts subtly or hesitates to sue on flimsy premises. High-priced big shot corporate lawyers have clearly determined that there's nothing to see here.

Arwen said:
Ubuntu put themselves at risk from being sued by people like the FSF, (Free Software Foundation).

Which would also be weird, but not beyond some elements of the FSF. Fortunately, they're too busy interjecting about GNU/Linux. Unfortunately, they're also too busy being creeps. Of course, they'd have to convince someone that using Open-Source Software together with other Open-Source Software is bad because the GPL said so. It's not the most convincing of theses. And it opens Pandora's box of litigation around the GPL, some parts of which could conceivably be declared unenforceable.

jgreco · Jan 29, 2022

Ericloewe said:
Pandora's box of litigation around the GPL, some parts of which could conceivably be declared unenforceable.

I will enjoy making popcorn for everyone the day that the so-called "Free Software Foundation" tries to enforce the GPL. The obvious interpretational issues of "free software doesn't mean freedom for the user, it means we want to keep the software free" and "we do that by imposing all sorts of restrictions on it" are a reality that the term "free software" doesn't really convey; most users who haven't actually read the GPL assume it refers to their rights.

Not like the good ol' BSD copyright.

HobartTasmania · Jan 29, 2022

ConKbot said:
But because the fact that these servers didnt have periodic scrubs enabled, there was enough bit-rot to cause enough errors on other disks on these vdevs to fail even more disks, causing failure.

Not true, if any written strips of data say for example a combination of three errors being both bad blocks plus hard disk failures in a Raid-Z2 array occur then you will obviously go below minimum redundancy for that strip and ZFS will tell you that the "file has been damaged - restore from backup" message but more importantly (1) the rest of the files in that ZFS filesystem will be unaffected and 100% OK, and (2) the re-silver will continue until it completes. Also if there are any other instances of below minimum redundancy then more files will be lost.

A scrub is not the only way to detect and repair errors on a redundant pool as merely accessing the file(s) concerned that have damage then ZFS will repair that on the fly, see page 16 of this document that describes the procedure for mirrors https://wiki.chipp.ch/twiki/pub/CmsTier3/NFSServerZFSBackupANDdCache/zfs_last_presentation.pdf and the same happens with Raid-Z/Z2/Z3 stripes.

Ericloewe · Jan 30, 2022

HobartTasmania said:
Not true, if any written strips of data say for example a combination of three errors being both bad blocks plus hard disk failures in a Raid-Z2 array occur then you will obviously go below minimum redundancy for that strip and ZFS will tell you that the "file has been damaged - restore from backup" message but more importantly (1) the rest of the files in that ZFS filesystem will be unaffected and 100% OK, and (2) the re-silver will continue until it completes.

Depends on how much damage there is. At some point, the disk will be thrown out. That doesn't mean the whole pool is unrecoverable, just that it turns into a monstrous pain in the ass.

HobartTasmania said:
A scrub is not the only way to detect and repair errors on a redundant pool as merely accessing the file(s) concerned that have damage then ZFS will repair that on the fly, see page 16 of this document that describes the procedure for mirrors https://wiki.chipp.ch/twiki/pub/CmsTier3/NFSServerZFSBackupANDdCache/zfs_last_presentation.pdf and the same happens with Raid-Z/Z2/Z3 stripes.

Sure, but Murphy will ensure that "live" data is mostly okay while older data is the one that rots. Besides, if they couldn't tell things were not going well until it was that late, they could easily have ignored/not setup any notifications, so ZFS doing its thing during non-scrub operations would only delay this situation.

Arwen · Jan 31, 2022

HobartTasmania said:
Not true, if any written strips of data say for example a combination of three errors being both bad blocks plus hard disk failures in a Raid-Z2 array occur then you will obviously go below minimum redundancy for that strip and ZFS will tell you that the "file has been damaged - restore from backup" message but more importantly (1) the rest of the files in that ZFS filesystem will be unaffected and 100% OK, and (2) the re-silver will continue until it completes. Also if there are any other instances of below minimum redundancy then more files will be lost.

A scrub is not the only way to detect and repair errors on a redundant pool as merely accessing the file(s) concerned that have damage then ZFS will repair that on the fly, see page 16 of this document that describes the procedure for mirrors https://wiki.chipp.ch/twiki/pub/CmsTier3/NFSServerZFSBackupANDdCache/zfs_last_presentation.pdf and the same happens with Raid-Z/Z2/Z3 stripes.

The one problem that LTT has for this, is that these NASes were for Archival storage of videos. So, unless they read every single video on a reasonably regular basis, then without a scrub, data loss.

But, I do agree that normal access can find blocks that fail their checksum. And if redundancy is available, ZFS will automatically attempt to restore redundancy. (Which generally should succeed, but if their are no more spare blocks in the destination, then it is REALLY time to replace the disk...)

Last, their are problems with using normal access as a way to find bad blocks. If by chance you read the good block, you may never know about the bad block. Thus, ZFS scrubs to the rescue.

ZFS checksums and scrubs are one of the biggest reasons I use ZFS. I want my old, archival data to be usable years, decades later.

Important Announcement for the TrueNAS Community.

Linus Tech Tips deploying TrueNAS for their production server

Resident Grinch

Patron

Cadet

Dabbler

Cadet

Contributor

Patron

Patron

Cadet

Contributor

Resident Grinch

Hall of Famer

Patron

MVP

Server Wrangler

Resident Grinch

Cadet

Server Wrangler

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Linus Tech Tips deploying TrueNAS for their production server"

Similar threads