raid z2 scrub getting stuck

SAK

Dabbler
Joined
Dec 9, 2022
Messages
20
Has anyone experienced their pool scrub getting stuck at a certain percentage? I've had this server and pool for over a year, and it has never happened. Then a few days ago it was doing its regularly-scheduled scrub and I noticed it got to around 91% and would not progress any further.

I stopped the scrub, did some research, and decided to do long SMART tests. All drives came back OK. After this, I also updated to the new version - 13.0-U4 - and started the scrub again. It is now stuck, again, at 91.69%. I'm not sure if this was the exact percentage before, but it is close I believe.

They are spinning disks - Seagate EXOS X16 16TB - 8 of them in a z2 config. I have never updated their firmware - perhaps I should do that. Just wondering if anyone else had this experience before. edit: I know the disks are being accessed - they are louder than heck and crunching away busily, but scrub does not progress. Disk activity stops if i stop the scrub.

I am running TrueNAS virtualized with the controller passed through. Proxmox host. Never had any issues before, except for some instability with certain #s of CPU cores assigned. It seems to like 6 cores just fine now.


SCRUB

Status: SCANNING

Completed: 91.69%

Time Remaining:4 hours,46 minutes,3 seconds

Errors: 0

Date: 2023-03-03 21:12:14
 
Last edited:

SAK

Dabbler
Joined
Dec 9, 2022
Messages
20
Update:

After a reboot, I decided to leave the datasets locked and leave the data offline. This time it finished the scrub just fine, and I noticed the scrub was very quiet too. The drives weren't clunking and clanking the whole time as they were before. So I suspect somehow disk access/use was interfering with the scrub and making it get stuck at certain points.

No one else has ever encountered this?
 

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
Just wanted to chime in to say that we had the same issue this month on a FreeBSD 14.0-RELEASE-p4 server. This is a newer server of ours that had been successfully scrubbing monthly for many months now. But this month it got stuck at 95% and 4 or 5 hours estimated completion for several days. The pool is 4 x 6 raidz2 vdevs. Drives are WD Ultrastar DC HC550 18TB. We run regular monthly SMART long and short tests. All drives appear to be healthy.

No logs or any indication as to why it was stuck at 95%. I saw what appeared to be normal scrub activity on the drives as if the scrub was stuck in a loop.

First I tried to pause/unpause the scrub. That didn't help. Then I tried stopping and restarting the scrub. It progressed but at a very slow rate (like 23MB/s). This is a backup server w/ almost no load so usually scrubs are much faster than this.

Finally I decided to upgrade and reboot, then I started the scrub again and it completed in 8ish hours. Looks like a reboot is the only fix for this.

The only thing that changed recently is I started to do rsyncs to a remote server out of the snapshot dirs. Maybe that's triggering some rare bug I don't know.
 
Top