Server Degraded from Checksum Errors, what to do next.

PeanutPlays

Dabbler
Joined
Feb 20, 2023
Messages
13
Hello Everyone,

So I recently started dumping all of my dvd's and blu-rays into my server. By and large this had been going swimmingly. A few nights ago I was streaming via jellyfin while more dvds were being ripped and the video quality and FPS started to tank. Wondering what was going on, I ventured to my desktop and pulled up the GUI. The Data VDEVs were listed as mix capacity (They're identical 4 disc RZ1 VDEVs with a hot spare) and the Pool Status was Degraded with over 500 ZFS Errors. At the time when I looked at the Devices page, the degraded discs were on my StarTech sata expansion card. I stopped the two discs that were getting ripped, pulled a couple of recent work files off the server and onto my desktop, let it finish its in-progreess resilver, and shut the server down.

I thought something was wrong with the StarTech card, so after some reading, I decided to replace it with a new LSI 9207-8i with a pair of 4 way breakouts.

Card and cables arrived today, I popped them in, plugged in the according drives, and booted the server back up. It started resilvering and I went in to the devices to see what all the drives were up to.

Storage Dashboard.png


Pool status.png


Not much in the way of useful info there, so I went and ran
Code:
zpool status
in the shell.

zpool status.png

Now, at the time it happened late Saturday night, I didn't know zpool status, so I didn't think to run it. But today, it was showing 48 errors. I don't understand what prompts a checksum error, but I do know I hadn't written to or read from any drives while this was happening. This is also where I realized that my spare got called off the bench.

For some reason TN seems to be shuffling the sd letter assignments, so I've come to a couple of wrong conclusions. I originally thought it was the StarTech card. But when I booted it up today, it had sda-d listed as the drives in trouble. Per the original configuration, a-d are on the mobo. So I thought maybe I just had too much happening for the mobo to keep up with. Now that 8 of the drives are on the LSI card, the lettering has roughly gone back to original. So my original thought to blame the StarTech card seems to stand up again.

Resilver is done:

2nd post resilver.png


ZPool Resilver.png


And here's where I don't know what to do next. I did read the doc it told me to. But I'm still not sure what to do. I can run the
Code:
zpool clear
but will that take the drives out of degraded mode? And why did my spare get pulled into this? All the drives in raidz1-1 only have about 400 hours on them, but I'm not buying the idea that 3 brand new drives are having an early death. It had to be the StarTech Expansion card. That aside, I don't know what the 500+ errors were that piled up before I realized something was wrong. (I have tried to set up OAuth email to alert me of stuff like this, but it REFUSES to work.) That I can tell, none of the files seem to be corrupted that were being W/R when this was all going sideways.

So what should I be doing here? Do I clear the pool and carry on? Is there a way to see all the previous errors? Why is my spare drive involved? If the card was the issue, how do I get the drive that the spare replaced back into the mix as the new spare? Anyone got any ideas on the email thing? This guide is what I followed and I got no love.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
First off, WD Blue drives may not be suitable for NAS use. Their are 2 things that can cause issues:
  • Head parking, which can cause a delay in response making ZFS thing the drive is failing
  • Long error recovery, which during bad block reads causes the drive to spend way too much time trying to read the data
The last can cause drives to drop out of RAID sets because desktop drives, (like WD Blues), spend over a minute trying to re-read and apply error correction codes for bad blocks. But, in a NAS with redundancy, we use TLER of like 7 seconds, (Time Limited Error Recovery, Seagate calls it something else). This allows ZFS to detect the bad block and immediately apply available redundancy to spare out the failed block and restore full redundancy to the RAID set. Instead of declaring the drive DEGRADED and using a hot spare.

You can also reduce the head parking by increasing the time it waits before parking. This may prevent the drives from parking at all, because of the occasional reads to the pool. Thus, not triggering ZFS fault detection.


Now where you go from here, is problematic. Just recovering all your redundancy may simply delay the over all problem. Meaning the problem might repeat since you have said the server is about 400 hours old, (less than 17 days).

My first comment is, Do you have a full, recent backup?
You also don't list the amount of memory you have in your NAS Specs. Not that is should make much difference, but how much?


I am not sure what the best path forward would be. Perhaps someone else can chime in.
 
Last edited:

PeanutPlays

Dabbler
Joined
Feb 20, 2023
Messages
13
@Arwen as for a full recent backup, no. I pulled all the really important work and personal files off the server and copied them to multiple places. The bulk of the data left on the array is gameplay recordings and the DVD/BR rips I had made. The discs are easy enough to re-rip, and if push comes to shove, I can use Recuva on the portable drives I used to transfer the data and get back any recordings that I don't already have copied elsewhere.

So I guess I technically do have a full backup, but it's just in pieces and scattered around.

There's 32GB of Vengeance DDR4 28...33? I figured that should have been plenty.

Between your post and some answers I received over at Level1Techs, I'm going to rebuild the whole thing. I'm going to attempt to pull everything off the array, but if it becomes problematic, oh well. Going to get a box of Ironwolf drives, an actual server chassis, and do this whole thing over. Actually gonna look at relocating the server, my new workstation, and my gaming rig into rack chassis and put them in a different room. Too much damn heat sitting next to me at my desk, lol.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Glad you have backups. And 32GB of memory should be good enough.

One reason we suggest server boards with NAS disks, in server chassis, is that a NAS is different from a desktop or gaming computer. The goal for NASes in general, is zero data loss. TrueNAS with ZFS takes this to a bit higher level than some other home / small office NASes. However, TrueNAS with ZFS works better with server boards and such.

In case you don't know, ZFS was written for the Enterprise by Sun Microsystems for their own use. Later it was made Open Source and available for FreeBSD & Linux.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
(They're identical 4 disc RZ1 VDEVs with a hot spare
I assume with RZ1 you mean RAIDZ1. Please use correct terminology to avoid any risk of misunderstanding. For more on this, please have a look here


As to the VDEV setup, it does not make sense to have a single RAIDZ1 with a hot spare. You should rather go for RAIDZ2.
 

PeanutPlays

Dabbler
Joined
Feb 20, 2023
Messages
13
As to the VDEV setup, it does not make sense to have a single RAIDZ1 with a hot spare. You should rather go for RAIDZ2.
When I initially set up the system, I only had 4 drives. Hence the RaidZ1. I then decided I wanted to double the space and have a hot spare in place to protect it when I was gone on work trips. That's when I added the second RaidZ1 to the pool and set the hot spare to cover the whole pool.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
Ok, that makes sense :smile:
 
Top