Question about panic

Status
Not open for further replies.

nielsk

Cadet
Joined
Sep 18, 2018
Messages
7
Hi,

I am struggling with a freenas-system and the newest problem is now a panic tonight. Can anyone tell me what the problem is?

/data/crash/info.last:
Code:
Dump header from device: /dev/da10p1
  Architecture: amd64
  Architecture Version: 1
  Dump Length: 1056256
  Blocksize: 512
  Dumptime: Tue Sep 18 06:38:16 2018
  Hostname: mitch.interdotnet.de
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 11.1-STABLE #0 r321665+9902d126c39(freenas/11.1-stable): Tue Aug 21 12:24:37 EDT 2018
	root@nemesis.tn.ixsystems.com:/freenas-11-releng/freenas/_BE/objs/freenas-11-releng/freenas/
  Panic String: solaris assert: !zilog_is_dirty(zilog), file: /freenas-11-releng/freenas/_BE/os/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c, line: 1878																																   
  Dump Parity: 838383930
  Bounds: 3
  Dump Status: good
 
D

dlavigne

Guest
Did the panic occur when booting up a previously working system? If so, the most likely culprit is a failed boot device. You can test that theory by reinstalling to a new usb stick. If it boots, restore your previously saved config.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Please go read the forum rules. Link at the top of the page. Supply the hardware and usage details that the rules require. Then, someone might have a clue what the problem is.

Sent from my SAMSUNG-SGH-I537 using Tapatalk
 

nielsk

Cadet
Joined
Sep 18, 2018
Messages
7
  • Mainboard: Supermicro Mainboard X11SSH-F
  • CPU: Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80GHz (3792.13-MHz K8-class CPU)
  • RAM: 32GB
Disks:
  • System: 2x Samsung SSD 850 PRO 256GB EXM04B6Q (Mirror)
  • Pool1 (RaidZ2): 7x Hitachi HDS5C302 AA10, 1x ST32000645NS 0004 (2TB disks), for the SLOG-device: 1x Samsung SSD 860 1B6Q
  • Pool2 and 3 (Mirrors): each 2x ST8000NM0055-1RM SN04
  • Harddisk controller: Avago Technologies (LSI) SAS3224
Network cards:
  • Onboard 2x Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k; they are coupled as LACP
Pool 1 is used for iscsi-targets and some nfs-shares
Pool 2 is used for an NFS-share
Pool 3 is not in use currently

All iscsi-targets (for windows-servers) and nfs-shares (for some linux-servers) are used for backups which are run periodically like once per day but not spread out over the whole day but start in the evening or during the night.

I see iscsi-errors but they got better after I added a SLOG-device and after I moved a huge backup to Pool 2. Users don't see any problems anymore after those actions but tonight I got an unexpected reboot with the file created above. Thus I'd like to fix what might be the reason for the reboot. Even though I see me already moving the iscsi-targets to the mirror-pools, dissolving the RaidZ2 and creating multiple mirrors.
 

nielsk

Cadet
Joined
Sep 18, 2018
Messages
7
@dlavigne the system was running and working before. The crash and subsequent reboot happened during the night
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Panic String: solaris assert: !zilog_is_dirty(zilog), file: /freenas-11-releng/freenas/_BE/os/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c, line: 1878

This assert seems to be related to a ZIL race condition that's referenced in the comments of FreeBSD bug 200592 (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200592#c8)

Do you have scheduled snapshots running during the night, and do you have a kernel.full dump?

SLOG-device: 1x Samsung SSD 860 1B6Q

Is that an 860 EVO? Poor SLOG choice as it lacks PLP, could potentially be related to the panic given that your assert references a dirty ZIL.
 

nielsk

Cadet
Joined
Sep 18, 2018
Messages
7
No, I don't do snapshots on that machines because it is only a target for backups by other machines. I don't need to snapshot these.

Yes, that is an 860 EVO. Why should the panic be related to the SSD not having PLP? There was no power outage.

I am not that much worried about a power outage to be honest. The machine has two power supplies which are connected to two independent power circuits. If this machine goes down because of a power outage, I have to worry about much bigger issues than a bit of lost data of a partial backup.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
No, I don't do snapshots on that machines because it is only a target for backups by other machines. I don't need to snapshot these.

Damn, there goes that grasp at straws.

Yes, that is an 860 EVO. Why should the panic be related to the SSD not having PLP? There was no power outage.

Power Loss Protection is an inherent part of what makes a good SLOG device - a sync write isn't considered complete until it's stored on non-volatile storage. An SSD with PLP can return "complete" to this much sooner, since it just needs to accept the write into RAM, which is backed by capacitors to flush to NAND in case of power loss.

I am not that much worried about a power outage to be honest. The machine has two power supplies which are connected to two independent power circuits. If this machine goes down because of a power outage, I have to worry about much bigger issues than a bit of lost data of a partial backup.

PLP protects against internal faults like a backplane/HBA failure - redundant external power can't solve that.

I have a textdump.tar.gz. Is that enough?

Got some bits from that, thanks:

Code:
cpuid		= 6
dynamic pcpu = 0xfffffe04c7c2c400
curthread	= 0xfffff8007693e000: pid 6 "txg_thread_enter"
curpcb	   = 0xfffffe045dc30a80
fpcurthread  = none
idlethread   = 0xfffff8000837c000: tid 100009 "idle: cpu6"
curpmap	  = 0xffffffff821b1bb8
tssp		 = 0xffffffff821e5600
commontssp   = 0xffffffff821e5600
rsp0		 = 0xfffffe045dc30a80
gs32p		= 0xffffffff821ebe58
ldt		  = 0xffffffff821ebe98
tss		  = 0xffffffff821ebe88
curvnet	  = 0

--- snipped ---

Tracing command init pid 1 tid 100002 td 0xfffff8000837f5c0
kdb_enter() at kdb_enter+0x3b/frame 0xfffffe04499532b0
vpanic() at vpanic+0x1a3/frame 0xfffffe0449953330
panic() at panic+0x43/frame 0xfffffe0449953390
assfail() at assfail+0x1a/frame 0xfffffe04499533a0
zil_close() at zil_close+0x186/frame 0xfffffe04499533e0
zvol_last_close() at zvol_last_close+0x15/frame 0xfffffe0449953400
zvol_d_close() at zvol_d_close+0x8c/frame 0xfffffe0449953430
devfs_close() at devfs_close+0x401/frame 0xfffffe04499534b0
VOP_CLOSE_APV() at VOP_CLOSE_APV+0x83/frame 0xfffffe04499534e0
vgonel() at vgonel+0xb3/frame 0xfffffe0449953560
vflush() at vflush+0x341/frame 0xfffffe04499536a0
devfs_unmount() at devfs_unmount+0x38/frame 0xfffffe04499536e0
dounmount() at dounmount+0x64c/frame 0xfffffe0449953760
vfs_unmountall() at vfs_unmountall+0xc4/frame 0xfffffe0449953790
bufshutdown() at bufshutdown+0x3dd/frame 0xfffffe04499537e0
kern_reboot() at kern_reboot+0x1aa/frame 0xfffffe0449953830
sys_reboot() at sys_reboot+0x458/frame 0xfffffe0449953880
amd64_syscall() at amd64_syscall+0xa4a/frame 0xfffffe04499539b0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe04499539b0
--- syscall (55, FreeBSD ELF64, sys_reboot), rip = 0x40ef7a, rsp = 0x7fffffffe798, rbp = 0x7fffffffe880 ---


Looks like this could be related to the txg race condition referenced.

Have you ruled out hardware failure (eg: checked for memory faults in the IPMI log?)
 

nielsk

Cadet
Joined
Sep 18, 2018
Messages
7
Yeah, ok the backplane-error-part could be a reason to change the SSD to something different. I have to see if I can change it.

I cannot see anything in the IPMI about RAM-problems and dmesg doesn't show me anything either.
The machine is new except of the system-SSDs and the 2 TB-HDDs which I have taken out of the machine we replaced.
I could switch the RAM with a twin I have on hand which isn't in production-use and has different RAM in it but I'd rather wait if this crash happens yet another time.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
SSDs with proper PLP should also be faster at sync writes, since they can just take the write into RAM and not have to actually commit it to NAND, so there's a performance reason to go with it as well.

The only function in zil.c that can ASSERT !zilog_is_dirty(zilog) without also sending || spa_freeze_txg appears to be zil_close but the previous bug seemed to hit a dead end other than the user reporting that "yes it happened again in a newer version" - I can't find a reopened one.

Perhaps it's time to file a bug report upstream at OpenZFS?
 
Status
Not open for further replies.
Top