Linus Tech Tips deploying TrueNAS for their production server

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Otherwise, how would he have recovered to post video afterwards

You obviously have insufficient experience with end users. The only thing that is guaranteed to stop them from trying again is death. Some may be discouraged by negative experiences, but many return to try again, triggering "the definition of insanity" :smile: ;-)
 

hescominsoon

Patron
Joined
Jul 27, 2016
Messages
456
I cringed when they created their pool as RAIDZ1. For their workload (video editing), a wide stripe of mirrors would have much better IOPS.
considering he's using gen 4 pcie ssd's...mirrored vdevs are not really needed..recommended yes..but not needed.
IX actually likes the video as well

 

ConKbot

Cadet
Joined
Oct 16, 2021
Messages
7
And in a move unsurprising to a lot of people on here...

TLDW: 2 CentOS archival servers ended up with zpool failure after catching multiple Z2 vdevs with dual disk failures, replacing drives and starting the resilver. But because the fact that these servers didnt have periodic scrubs enabled, there was enough bit-rot to cause enough errors on other disks on these vdevs to fail even more disks, causing failure.
The two most basic things absolutely hammered home by the folks on here "do periodic scrubs. Do periodic SMART long tests" not set up. Linus, pls...
Edit: Highlighting they were CentOS servers
 
Last edited:

imsocold

Dabbler
Joined
Dec 13, 2021
Messages
43
And in a move unsurprising to a lot of people on here...

TLDW: 2 archival servers ended up with zpool failure after catching multiple Z2 vdevs with dual disk failures, replacing drives and starting the resilver. But because the fact that these servers didnt have periodic scrubs enabled, there was enough bit-rot to cause enough errors on other disks on these vdevs to fail even more disks, causing failure.
The two most basic things absolutely hammered home by the folks on here "do periodic scrubs. Do periodic SMART long tests" not set up. Linus, pls...
This has to be some kind of record, i fully expected linus to screw this up, but not this quickly.
 

ConKbot

Cadet
Joined
Oct 16, 2021
Messages
7
This has to be some kind of record, i fully expected linus to screw this up, but not this quickly.
To be fair this isnt the production server in the original post for this thread. And it does sound like that some of the servers do have periodic scrubs enabled.
Maybe TrueNAS should create a default periodic scrub task for new/imported zpools. Yes I understand that a professional shouldn't have ..issues... like this, the penetration into the home-lab space is increasing. and even for a professional, its not like its adding work. Just changing the default task to the task of your preference.
 

ClassicGOD

Contributor
Joined
Jul 28, 2011
Messages
145
To be fair this isnt the production server in the original post for this thread. And it does sound like that some of the servers do have periodic scrubs enabled.
Maybe TrueNAS should create a default periodic scrub task for new/imported zpools. Yes I understand that a professional shouldn't have ..issues... like this, the penetration into the home-lab space is increasing. and even for a professional, its not like its adding work. Just changing the default task to the task of your preference.
I could've sworn that there are default scrub tasks for imported pools. The issue was that the servers were running CentOS with basically manual storage setup. No failed drives notifications, no periodic scrubs or S.M.A.R.T. tests etc. It was only matter of time that they will start loosing data.
 

hescominsoon

Patron
Joined
Jul 27, 2016
Messages
456
This has to be some kind of record, i fully expected linus to screw this up, but not this quickly.
this wasn't instant..the vaults have been around for years..this is unfortunately common in many businesses...
 

hescominsoon

Patron
Joined
Jul 27, 2016
Messages
456
I could've sworn that there are default scrub tasks for imported pools. The issue was that the servers were running CentOS with basically manual storage setup. No failed drives notifications, no periodic scrubs or S.M.A.R.T. tests etc. It was only matter of time that they will start loosing data.
centos doesn't have the default scrubs setting built into it..as ZFS wasn't truly native in Linux for a while until recently.
 

ConKbot

Cadet
Joined
Oct 16, 2021
Messages
7
I could've sworn that there are default scrub tasks for imported pools. The issue was that the servers were running CentOS with basically manual storage setup. No failed drives notifications, no periodic scrubs or S.M.A.R.T. tests etc. It was only matter of time that they will start loosing data.
I spaced out on that bit. You're right. they did say they were using CentOS. I just saw the TrueNAS alerts email they showed in the video, and them talking ZFS had me thinking it was TrueNAS. I'm still not sure if there is a default scrub task in TrueNAS for new pools though. I just set my home server up in 12.0U7, and I cant for the life of me remember if I had to create the scrub task or modify it now.
 

ClassicGOD

Contributor
Joined
Jul 28, 2011
Messages
145
I spaced out on that bit. You're right. they did say they were using CentOS. I just saw the TrueNAS alerts email they showed in the video, and them talking ZFS had me thinking it was TrueNAS. I'm still not sure if there is a default scrub task in TrueNAS for new pools though. I just set my home server up in 12.0U7, and I cant for the life of me remember if I had to create the scrub task or modify it now.
I recently moved my pool to a new hardware with fresh Scale install and I could've sworn it had scrub attached to it. I can't verify it now as I removed all scrub default tasks and created my own schedule since then.
centos doesn't have the default scrubs setting built into it..as ZFS wasn't truly native in Linux for a while until recently.
I agree - I was talking about TrueNas having default scrubs - I should have made it more clear.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The two most basic things absolutely hammered home by the folks on here "do periodic scrubs. Do periodic SMART long tests" not set up.

And people wonder why I have the paranoia-level SMART testing and scrubs.

Multiple shorts a day. Longs thrice a week. Scrubs every 14 days. Alerts and emails tested to work. Spare drives available for important pools.

It is NOT possible to prevent bad things from happening. However, I can tell you, fate comes picking on you if you can't be bothered with the little details. One of my "important" filers, 11 disk RAIDZ3 with a spare, replicated off-site. This NAS has been through iterations as 4TB, 8TB, 12TB, and now 14TB HDD's. That's 48 hard drives. You would think that one of them would have failed at some point. Nope. Fate doesn't like the prepared.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
ZFS wasn't truly native in Linux for a while until recently.
ZFS still isn't truly native in Linux, and some folks in Linux-land (including Torvalds, IIRC) have a pathological aversion to it. Some distros like it (Ubuntu being one); others really don't (e.g., anything downstream from RedHat/IBM). There are packages for CentOS/RHEL, but you're completely on your own to set up a sensible configuration. No surprise at all that Linus (of LTT, not Torvalds) didn't.
 

indy

Patron
Joined
Dec 28, 2013
Messages
287
Ubuntu's out of the box support for ZFS is nice - especially since Canonical put themselves at risk of litigation from Oracle.
On the other hand they knowingly released 21.10 with a custom patch that corrupted the filesystem.
Not sure what to make of that...
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Ubuntu's out of the box support for ZFS is nice - especially since Canonical put themselves at risk of litigation from Oracle.
...
It was not so much of a risk from Oracle, as that part was mostly or fully taken care of with the open sourcing by Sun Microsystems.

What was the issue, is that some Linux / free software people were complaining that including non-GPL(1/2/3, whatever), as kernel modules violated the GPL(1/2/3) license. So they felt that anyone who included ZFS with a Linux distro could or should be sued. Ubuntu put themselves at risk from being sued by people like the FSF, (Free Software Foundation).



On the general thread, I like this quote, (which I used earlier in this thread mid December, 2021);
FreeNAS forums:

The probability of vdev failure with RAIDZ2 is tiny on a properly-administered
server. Most contributions to that probability would take any other pools along.

We're talking catastrophic PSU failure, massive undetected RAM failure,
physical destruction, alien high-intensity degaussing beams, extreme heat or
the presence of Linus from LinusTechTips in the same room as the server.

Ericloewe, Jan 6, 2016
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Ubuntu's out of the box support for ZFS is nice - especially since Canonical put themselves at risk of litigation from Oracle.
This has always been a rather weird argument. So, Oracle would sue because someone didn't respect this interpretation of the GPL... why? As the copyright holder of some ZFS code? That doesn't make any sense, the CDDL doesn't disallow such use, they don't have standing to sue. As the copyright holder of some Linux code? That would require quite some mental gymnastics, even by Oracle's standards. And what damages could possibly be demanded in such a case?

And ultimately, Oracle hasn't sued yet. Or even made noise about it. Oracle is not the sort of company that acts subtly or hesitates to sue on flimsy premises. High-priced big shot corporate lawyers have clearly determined that there's nothing to see here.

Ubuntu put themselves at risk from being sued by people like the FSF, (Free Software Foundation).


Which would also be weird, but not beyond some elements of the FSF. Fortunately, they're too busy interjecting about GNU/Linux. Unfortunately, they're also too busy being creeps. Of course, they'd have to convince someone that using Open-Source Software together with other Open-Source Software is bad because the GPL said so. It's not the most convincing of theses. And it opens Pandora's box of litigation around the GPL, some parts of which could conceivably be declared unenforceable.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Pandora's box of litigation around the GPL, some parts of which could conceivably be declared unenforceable.

I will enjoy making popcorn for everyone the day that the so-called "Free Software Foundation" tries to enforce the GPL. The obvious interpretational issues of "free software doesn't mean freedom for the user, it means we want to keep the software free" and "we do that by imposing all sorts of restrictions on it" are a reality that the term "free software" doesn't really convey; most users who haven't actually read the GPL assume it refers to their rights.

Not like the good ol' BSD copyright.
 
Joined
Jan 29, 2022
Messages
1
But because the fact that these servers didnt have periodic scrubs enabled, there was enough bit-rot to cause enough errors on other disks on these vdevs to fail even more disks, causing failure.

Not true, if any written strips of data say for example a combination of three errors being both bad blocks plus hard disk failures in a Raid-Z2 array occur then you will obviously go below minimum redundancy for that strip and ZFS will tell you that the "file has been damaged - restore from backup" message but more importantly (1) the rest of the files in that ZFS filesystem will be unaffected and 100% OK, and (2) the re-silver will continue until it completes. Also if there are any other instances of below minimum redundancy then more files will be lost.

A scrub is not the only way to detect and repair errors on a redundant pool as merely accessing the file(s) concerned that have damage then ZFS will repair that on the fly, see page 16 of this document that describes the procedure for mirrors https://wiki.chipp.ch/twiki/pub/CmsTier3/NFSServerZFSBackupANDdCache/zfs_last_presentation.pdf and the same happens with Raid-Z/Z2/Z3 stripes.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Not true, if any written strips of data say for example a combination of three errors being both bad blocks plus hard disk failures in a Raid-Z2 array occur then you will obviously go below minimum redundancy for that strip and ZFS will tell you that the "file has been damaged - restore from backup" message but more importantly (1) the rest of the files in that ZFS filesystem will be unaffected and 100% OK, and (2) the re-silver will continue until it completes.
Depends on how much damage there is. At some point, the disk will be thrown out. That doesn't mean the whole pool is unrecoverable, just that it turns into a monstrous pain in the ass.
A scrub is not the only way to detect and repair errors on a redundant pool as merely accessing the file(s) concerned that have damage then ZFS will repair that on the fly, see page 16 of this document that describes the procedure for mirrors https://wiki.chipp.ch/twiki/pub/CmsTier3/NFSServerZFSBackupANDdCache/zfs_last_presentation.pdf and the same happens with Raid-Z/Z2/Z3 stripes.
Sure, but Murphy will ensure that "live" data is mostly okay while older data is the one that rots. Besides, if they couldn't tell things were not going well until it was that late, they could easily have ignored/not setup any notifications, so ZFS doing its thing during non-scrub operations would only delay this situation.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Not true, if any written strips of data say for example a combination of three errors being both bad blocks plus hard disk failures in a Raid-Z2 array occur then you will obviously go below minimum redundancy for that strip and ZFS will tell you that the "file has been damaged - restore from backup" message but more importantly (1) the rest of the files in that ZFS filesystem will be unaffected and 100% OK, and (2) the re-silver will continue until it completes. Also if there are any other instances of below minimum redundancy then more files will be lost.

A scrub is not the only way to detect and repair errors on a redundant pool as merely accessing the file(s) concerned that have damage then ZFS will repair that on the fly, see page 16 of this document that describes the procedure for mirrors https://wiki.chipp.ch/twiki/pub/CmsTier3/NFSServerZFSBackupANDdCache/zfs_last_presentation.pdf and the same happens with Raid-Z/Z2/Z3 stripes.
The one problem that LTT has for this, is that these NASes were for Archival storage of videos. So, unless they read every single video on a reasonably regular basis, then without a scrub, data loss.

But, I do agree that normal access can find blocks that fail their checksum. And if redundancy is available, ZFS will automatically attempt to restore redundancy. (Which generally should succeed, but if their are no more spare blocks in the destination, then it is REALLY time to replace the disk...)

Last, their are problems with using normal access as a way to find bad blocks. If by chance you read the good block, you may never know about the bad block. Thus, ZFS scrubs to the rescue.



ZFS checksums and scrubs are one of the biggest reasons I use ZFS. I want my old, archival data to be usable years, decades later.
 
Top