Panic Solaris zfs adding existent segment to range tree

nickdems

Dabbler
Joined
Jun 7, 2023
Messages
13
Hi all,

Wasn't sure where to post this thread, admins please feel free to move it to a different section if more relevant.

Out of the blue a couple weeks ago my TruenasCore (TrueNAS-13.0-U3.1) server crashed while transferring data and rebooted it self. Once it came back it worked again for a few days till it crashed again, and now will struggle to stay online after boot.

I hooked up a gpu and can see that the server boots just fine, and gives the usual console options and message "The web user interface is at: xx". However, a few seconds to a few minutes later, the server crashes with this error:

panic: Solaris(panic): zfs: adding existent segment to range tree

20231101_182511.jpg



I've also ran a memtest overnight which passes, leading me to assume memory/mobo/cpu are unlikely culprits.

20231102_100044.jpg



Here is my setup:
  • Motherboard: msi b450 tomahawk max ii
  • CPU: amd ryzen5 3600
  • RAM: 4x16gb ecc samsung
  • Hard drives, quantity, model numbers, and RAID configuration, including boot drives:
    • Boot Pool: 2x 256gb patriot 210 mirrored
    • SSD Pool: 2x 500gb crucial bx500 mirrored
    • Tank Pool: 4x 8tb wd red pro raidz1, 4x 16tb exos x16 raidz1
  • Hard disk controllers: all the hard drives are connected to a dell h200, and all the ssds are connected to the mobo directly
  • Network cards: onboard lan



I have potentially narrowed down the error down relating to my "Tank" pool, as when i remove the HBA (with all HDDs attached) the system boots and is stable.
I tried to remove the HDDs one at a time to see if it was a specific drive, but the error and subsequent reboot occurred every time.

I read a few similar issues on the site, but they all relate to the issue occurring during the import of the pool, which I am not sure is not my case, as the system clearly boots and falls over later on ?

Any help would be appreciated.

Thanks,
Nicholas
 
Last edited:

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
It could be heat related to the Dell H200 controller. As computers age, they collect dust and start reducing air flow.

Does it have adequate cooling?

Even if you think it does, try adding additional cooling to the card anyway.
 

nickdems

Dabbler
Joined
Jun 7, 2023
Messages
13
It could be heat related to the Dell H200 controller. As computers age, they collect dust and start reducing air flow.

Does it have adequate cooling?

Even if you think it does, try adding additional cooling to the card anyway.
Thanks for your comment.
I have a 120m fan directly blowing air on it so it should be fine (its not hot to the touch), as i read these heatsinks can get hot and i dont have an actual server/chasis for front-to-back cooling.

Can i view the controller's temps from the server perhaps ?
 
Last edited:

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Thanks for your comment.
I have a 120m fan directly blowing air on it so it should be fine (its not hot to the touch), as i read these heatsinks can get hot and i dont have an actual server/chasis for front-to-back cooling.

Can i view the controller's temps from the server perhaps ?
I don't know.

Do you have regular scrubs of your boot-pool?

That may seem odd because you are having trouble with your data pool. But, if their is corruption on your boot device, that can translate into weird issues anywhere.

To be on the safe side, you can first verify a recent boot-pool scrub. And if not recent, start one.

Or, perhaps backup your configuration and re-install to the boot device.


Other than those, I don't have other thoughts. Perhaps someone else will.
 

nickdems

Dabbler
Joined
Jun 7, 2023
Messages
13
I don't know.

Do you have regular scrubs of your boot-pool?

That may seem odd because you are having trouble with your data pool. But, if their is corruption on your boot device, that can translate into weird issues anywhere.

To be on the safe side, you can first verify a recent boot-pool scrub. And if not recent, start one.

Or, perhaps backup your configuration and re-install to the boot device.


Other than those, I don't have other thoughts. Perhaps someone else will.
Yep ssd pools are scrubbed monthly hdd pool bi-monthly.
I reran the scrub but it repaired nothing and issue persists.

i'll give re-installing a go later on and report back
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Yep ssd pools are scrubbed monthly hdd pool bi-monthly.
I reran the scrub but it repaired nothing and issue persists.

i'll give re-installing a go later on and report back
Great.

Even with a "clean" boot-pool scrub, their can still be corruption due to temporary un-corrected memory error that got written to the boot-pool. One reason ECC RAM is suggested. With the tons and tons of data now stored on ZFS, lots of lower end servers without ECC RAM may end up with odd things like this because of a bit flip.
 

nickdems

Dabbler
Joined
Jun 7, 2023
Messages
13
Using a fresh install, I am able to import the pool via the CLI (zpool import -f Tank) however the pool will not show in the GUI nor can i get any SMB/NFS shares going - so can't really access the data.

I am however able to get a scrub going for the pool, will report on this in the next few days when it completes.

Open to new avenues also :)
 
Last edited:

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
I am out of ideas, sorry.
 

AlexDRL

Cadet
Joined
Feb 16, 2022
Messages
8
I am having the same issue with Truenas Scale Cobia, with ZFS 2.2.0, it smells something caused by the BRT feature, is there some info I can post there? I can run my NAS "normally" with the recover ZFS tunable, but more info on this issue would be nice
 
Top