Scrub hangs at circa 22%, samba shares not working at all

banancookie

Cadet
Joined
Apr 22, 2023
Messages
2
Hi everyone, this is my first post.

We have a problem with our TrueNAS Scale server. There is an automated scrub task on our mirrored pool named "main". However, this scrub never finishes and keeps hanging at circa 22% every time. The initial time estimation is about 3,5h, however, once this 22%-mark is reached the time estimation just increases to, hours, days, months or years, depending how long we keep it running. The scrub can't be stopped via the web ui or command line. When the scrub starts hanging, the disk IO goes to zero and the shown log appears in dmesg: "PANIC: fs: attempting to increase fill beyond max; probable double add in segment [2980f0ef000:298368e8000]" (see screenshots).

The pool consist of two mirrored HDDs, 2 WD Red Plus with 4TB. When we shut down the system, disconnect one of the HDDs and let the scrub run again, the scrub hangs again at ~22%.

A S.M.A.R.T. test did run successfully on the two HDDs without any errors. Scrubs run successfully on other pools without errors.

We upgraded from TrueNAS Core about a year ago to TrueNAS scale. We thought maybe something went wrong during the upgrade, so we recently did a fresh TrueNAS scale installation of TrueNAS-SCALE-22.12.2. After importing the pool and starting a new scrub, this one also hangs at ~22%.

Recently, TrueNAS reported some errors (see attached images). To me, this sounds like a hardware error.

System information:
  • Motherboard make and model: ASRock Z87Pro4
  • CPU make and model: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
  • RAM quantity: 24 GB (2x8GB, 2x4GB) non ecc
  • Hard drives, quantity, model numbers, and RAID configuration, including boot drives:
    • main pool: 2 WD Red Plus 4TB, WDC_WD40EFZX-68AWUN0, bought in Oct 2021
    • boot-pool: Samsung_SSD_840_EVO_250GB
  • Hard disk controllers: via mainboard
  • Network cards: Intel e82585 (e1000)
We recently were able to backup the most important files of the main pool and are able to read and write to the pool via the TrueNAS command line without problems. However, the samba shares of the datasets don't work at all. When trying to read directories on the main pool, the system completely hangs and the samba processes on TrueNAS eventually go into the D state (uninterruptible sleep). The processes can't be killed if they're in this state and the only way to stop them is shutting down the system (which takes at least 15 min probably because there are processes which can't be stopped). With each request, a new samba process is spawned, which eventually ends up in D state as well. Writing to the samba share is also not possible at all.

During our scrub tests, no other services were running. No system services like samba, dyndns, etc. and no vms were running

I don't know if these two "symptoms" (scrub hanging and samba not working), are related at all. However, especially the samba share problem renders the system mostly useless.

Is there anything else we can try or is this probably due to a hardware error in one of the WD Reds? They are not really old. They used to work fine, until around January of this year which is when the scrub problem arose. So it's not likely they were damage during transport and were broken in the first place.

Thanks in advance.
 

Attachments

  • image-20230420-091857.jpeg
    image-20230420-091857.jpeg
    73.6 KB · Views: 72
  • image-20230420-091942.jpeg
    image-20230420-091942.jpeg
    41.5 KB · Views: 79
  • image-20230422-153624.jpeg
    image-20230422-153624.jpeg
    94.2 KB · Views: 69
  • image-20230422-153634.jpeg
    image-20230422-153634.jpeg
    81.8 KB · Views: 79

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Does the scrub hang regardless of which mirror disk is out of the pool? It sounds like there's damage to both disks. When you stood the system up, did you burn in the disks before putting data on them?
 

banancookie

Cadet
Joined
Apr 22, 2023
Messages
2
Yes, the scrub hangs regardless of which of the two mirror disks is currently inserted.
We didn't know burning in disks was a thing, but after having read into it, it sounds like a reasonable thing to do.

So you also think there is a disk damage. We will try to get new disks, burn them in and then try again.

Thank you for your reply!
 

samarium

Contributor
Joined
Apr 8, 2023
Messages
192
Of could have been am memory issue that ECC would have detected and corrected and metadata was corrupted at some stage that has been propogated to the pool, and baked into it. You will still be vulnerable to this with your existing hardware. Seems like a good time to bring out the backups. If you still are looking for recent data you might like to search for things like turning off some of the zfs/spl checks using kernel parameters, and also using zdb. I'm lucky that I have never had to go down that path, but if the backups are inadequate, then it might be a direction, but I'd recommend imaging the disks first, and then snapshotting the images, so you can roll back to the snapshot and start again, you could even use a VM to do the investigation so that a kernel crash wouldn't take down the physical hardware, but again messy, time consuming and requires more storage. May also depend on how important the data if you need to recover it, and could even engage recovery people.
 
Top