Kernel Panic

Joined
Dec 2, 2015
Messages
730
I'm trying to troubleshoot a kernel panic issue. The server had worked well for several years, but it KP'd overnight last night, and now panics within a minute or two after FreeNAS boots.

I've been able to look at /var/log in the brief periods before it panics, but haven't found anything yet that looks unusual. /data/crash has nothing new - it only has dumps from panics in 2016.

I recorded the console via IPMI, and a screen grab shows:

vlcsnap-2020-11-03-11h02m43s333.png


Does anyone see any clues in the KP console output above?

I reseated all data and power cables. I reseated the RAM sticks and the M1015. Still panics.

I tried reverting to an older FreeNAS boot environment (FN 11.3-U3). Still panics.

I tried booting without the data disks installed. Still panics.

Now I'm testing RAM using MemTest86. I figure that should rule out RAM, and partially rule out the CPU. If the RAM looks OK after one full pass, I'll do a CPU stress test with mprime. If that passes, I'll temporarily remove the M1015. If it still panics, I'll try moving the OS to a USB stick (only for testing) to rule out the SSD.

Any advice is appreciated.

Thanks,

Kevin

System Config:
FreeNAS 11.3-U5
Supermicro X10SRH-cF with E5-1650v4 CPU.
RAM: 96 GB (2 x Samsung 32GB M393A4K40BB0-CPB + 2 x Samsung 16GB M393A2G40DB0-CPB)
Boot drive: SanDisk Plus 120GB connected to M1015
HBA: M1015 flashed to IT mode, with firmware 20.00.07.00
PSU: 860W SeaSonic SS-860-XP2 Platinum
Data disks: 2 pools, each consisting of 8 x WB Red WD40EFRX 4TB in RAIDZ2
Chassis: Norco RPC-4224
 
Joined
Jan 7, 2015
Messages
1,150
Ive had these issues with bad boot drives and or ram before. Is system dataset on the boot drive or the z2 pool? Try installing temporarily to a USB and see?

Seems you are on the right track.
 

belya

Cadet
Joined
Nov 4, 2020
Messages
5
I have the same situation, I have 2 zpool on TrueNAS 12 Release:
- 1 x Samsung 970 EVO Plus 500 GB NVME M.2 (MZ-V7S500BW) - just simple volume (zpool VM)
- 3 x SATA 4.0TB Seagate IronWolf Pro NAS 7200rpm 256MB (ST4000NE001) - zraid (if I remember correctly) (zpool GENERAL)

Once upon a time server was restarted and one of zvol on VM has broken.
Now I could not import zpool VM, but from debug mode I got information that there is one "broken zvol" and everything that I can do - it's import zpool in read-only mode.

What I have done for that:
1. Boot with option 3 in boot menu
2. Run following commands:
Code:
set vfs.zfs.recover=1
set vfs.zfs.debug=1
boot -s

3. After boot run following:
Code:
sh /etc/rc.initdiskless

4. Import zpool to temp location
Code:
systcl vfs.zfs.recover=1
sysctl vfs.zfs.debug=1
mkdir /tmp/vm
zpool import -o readonly=on -R /tmp/vm vm. <--- able to import in readonly mode ONLY
zpool import -fFn -R /tmp/vm -o failmode=continue vm <---- trying to recover result the kernel panic
 

belya

Cadet
Joined
Nov 4, 2020
Messages
5
The picture to my post, didn't find the way how to edit my post :(
 

Attachments

  • nas-error-cutted.jpg
    nas-error-cutted.jpg
    364.4 KB · Views: 325

dak180

Patron
Joined
Nov 22, 2017
Messages
308
System Config:
FreeNAS 11.3-U5
Supermicro X10SRH-cF with E5-1650v4 CPU.
RAM: 96 GB (2 x Samsung 32GB M393A4K40BB0-CPB + 2 x Samsung 16GB M393A2G40DB0-CPB)
Boot drive: SanDisk Plus 120GB connected to M1015
HBA: M1015 flashed to IT mode, with firmware 20.00.07.00
PSU: 860W SeaSonic SS-860-XP2 Platinum
Data disks: 2 pools, each consisting of 8 x WB Red WD40EFRX 4TB in RAIDZ2
Chassis: Norco RPC-4224
Are there any replication jobs that normally run (local or otherwise)?
 
Joined
Dec 2, 2015
Messages
730
Are there any replication jobs that normally run (local or otherwise)?
Yes, there is a local replication from the main pool to the backup pool, and a remote task to push the main pool to the backup server. It is certainly possible that they would be trying to startup around when the KPs occur, so they are a suspect. Thanks for the question.

I'll see if I can get into the web interface quickly enough to disable those tasks before it KPs once I restart the server later this morning.

The MemTest86 tests passed with no failures, which should rule out RAM, and probably the CPU as well. Next up is mprime, then I'll try booting FreeNAS again and disabling replication.
 

dak180

Patron
Joined
Nov 22, 2017
Messages
308
Yes, there is a local replication from the main pool to the backup pool
This sounds like damage to the pool as described by NAS-102541; you will need to rebuild the effected pools and restore from backups if this is the case.
 
Joined
Dec 2, 2015
Messages
730
This sounds like damage to the pool as described by NAS-102541; you will need to rebuild the effected pools and restore from backups if this is the case.
This is looking like a potential cause. I removed the boot SSD from the system, and switched to a temporary USB flash drive for the OS. I get a repeatable KP that looks similar to one in the original post when importing the backup pool.

vlcsnap-2020-11-05-07h54m40s825.png


I'll switch back to the SSD for the boot drive and boot without the backup pool disks inserted. If I can restore all functions except the replication to the backup pool, I'll wipe that pool and start over with a fresh backup pool.
 

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,211
I'll switch back to the SSD for the boot drive and boot without the backup pool disks inserted. If I can restore all functions except the replication to the backup pool, I'll wipe that pool and start over with a fresh backup pool.
How did this go? Man, I would be totally freaking out if this happened. It seems like an OS design flaw if a corrupted pool can cause kernel panics. The good thing is if it's your backup pool, that is the least traumatic fix. But you may want to test those drives thoroughly before rebuilding it.
 
Joined
Dec 2, 2015
Messages
730
It certainly looks like the corrupted backup pool was the cause of the KPs. After switching back to the boot SSD, I gradually restored services over a period of several days, waiting a few hours at each configuration to see if the latest change would trigger KPs again. Then I wiped the backup pool and made a new one. Replication has started to the backup pool, with no issues yet.

I didn't do any specific testing on the drives in the backup pool before putting them back in service, as the SMART data all looked good.

I agree that it seems unacceptable that a corrupted pool should cause a KP, unless it was the boot pool. I would expect that any services that required the corrupted pool would be affected, or maybe crash, but I would hope that the OS would be unaffected.
 

styno

Patron
Joined
Apr 11, 2016
Messages
466
I was struggling with that issue for almost a good year until it got fixed around April.

I can't remember the exact FreeNAS version the commit got merged into. It was supposed to be 11.3-U3something but I remember that it didn't made the deadline so go for 11.3-U4 at least. You will have to recreate the corrupt pool and resync from scratch as every I/O that hits the corruption will trigger a reboot. (that also means that a resilver will throw you into a bootloop as a resilver operation can't be cancelled.)

ps. try grep -i panic /var/db/system/ixdiagnose/ixdiagnose/crash/info* if you want to see the panic strings that caused the panic/reboot. That sure beats trying to grab terminal screenshots ;)
 
Joined
Dec 2, 2015
Messages
730
ps. try grep -i panic /var/db/system/ixdiagnose/ixdiagnose/crash/info* if you want to see the panic strings that caused the panic/reboot. That sure beats trying to grab terminal screenshots ;)
Strangely enough, the hits from that grep are all from 2016, as are all the core dumps in /data/crash. Is it possible to panic hard enough that the system is unable to do a core dump?
 

styno

Patron
Joined
Apr 11, 2016
Messages
466
Ah, I assume your system dataset points to the datapool (that has the corruption) and not to the boot pool. IIRC it writes the core dumps to the .system dataset.

edit: D'OH: "system - advanced - save debug" first ofcourse. my bad.
 
Joined
Dec 2, 2015
Messages
730
Ah, I assume your system dataset points to the datapool (that has the corruption) and not to the boot pool. IIRC it writes the core dumps to the .system dataset.

edit: D'OH: "system - advanced - save debug" first ofcourse. my bad.
The system dataset is on the boot pool, if the GUI can be believed.

Are you suggesting that core dumps are only saved if a debug kernel is enabled?
 

styno

Patron
Joined
Apr 11, 2016
Messages
466
Are you suggesting that core dumps are only saved if a debug kernel is enabled?
No, there is no need to activate the debug kernel. You can save a 'debug' package to upload to iXsystems for troubleshooting purposes. This data will be saved in the ixdiagnose directory and you should be able to extract the panic string with the above command.
 
Top