Question about panic

nielsk · Sep 18, 2018

Hi,

I am struggling with a freenas-system and the newest problem is now a panic tonight. Can anyone tell me what the problem is?

/data/crash/info.last:

Code:

Dump header from device: /dev/da10p1
  Architecture: amd64
  Architecture Version: 1
  Dump Length: 1056256
  Blocksize: 512
  Dumptime: Tue Sep 18 06:38:16 2018
  Hostname: mitch.interdotnet.de
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 11.1-STABLE #0 r321665+9902d126c39(freenas/11.1-stable): Tue Aug 21 12:24:37 EDT 2018
	root@nemesis.tn.ixsystems.com:/freenas-11-releng/freenas/_BE/objs/freenas-11-releng/freenas/
  Panic String: solaris assert: !zilog_is_dirty(zilog), file: /freenas-11-releng/freenas/_BE/os/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c, line: 1878																																   
  Dump Parity: 838383930
  Bounds: 3
  Dump Status: good

dlavigne · Sep 19, 2018

Did the panic occur when booting up a previously working system? If so, the most likely culprit is a failed boot device. You can test that theory by reinstalling to a new usb stick. If it boots, restore your previously saved config.

Chris Moore · Sep 19, 2018

Please go read the forum rules. Link at the top of the page. Supply the hardware and usage details that the rules require. Then, someone might have a clue what the problem is.

Sent from my SAMSUNG-SGH-I537 using Tapatalk

nielsk · Sep 19, 2018

Mainboard: Supermicro Mainboard X11SSH-F
CPU: Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80GHz (3792.13-MHz K8-class CPU)
RAM: 32GB

Disks:

System: 2x Samsung SSD 850 PRO 256GB EXM04B6Q (Mirror)
Pool1 (RaidZ2): 7x Hitachi HDS5C302 AA10, 1x ST32000645NS 0004 (2TB disks), for the SLOG-device: 1x Samsung SSD 860 1B6Q
Pool2 and 3 (Mirrors): each 2x ST8000NM0055-1RM SN04
Harddisk controller: Avago Technologies (LSI) SAS3224

Network cards:

Onboard 2x Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k; they are coupled as LACP

Pool 1 is used for iscsi-targets and some nfs-shares
Pool 2 is used for an NFS-share
Pool 3 is not in use currently

All iscsi-targets (for windows-servers) and nfs-shares (for some linux-servers) are used for backups which are run periodically like once per day but not spread out over the whole day but start in the evening or during the night.

I see iscsi-errors but they got better after I added a SLOG-device and after I moved a huge backup to Pool 2. Users don't see any problems anymore after those actions but tonight I got an unexpected reboot with the file created above. Thus I'd like to fix what might be the reason for the reboot. Even though I see me already moving the iscsi-targets to the mirror-pools, dissolving the RaidZ2 and creating multiple mirrors.

nielsk · Sep 19, 2018

@dlavigne the system was running and working before. The crash and subsequent reboot happened during the night

HoneyBadger · Sep 19, 2018

nielsk said:
Panic String: solaris assert: !zilog_is_dirty(zilog), file: /freenas-11-releng/freenas/_BE/os/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c, line: 1878

This assert seems to be related to a ZIL race condition that's referenced in the comments of FreeBSD bug 200592 (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200592#c8)

Do you have scheduled snapshots running during the night, and do you have a kernel.full dump?

nielsk said:
SLOG-device: 1x Samsung SSD 860 1B6Q

Is that an 860 EVO? Poor SLOG choice as it lacks PLP, could potentially be related to the panic given that your assert references a dirty ZIL.

nielsk · Sep 19, 2018

No, I don't do snapshots on that machines because it is only a target for backups by other machines. I don't need to snapshot these.

Yes, that is an 860 EVO. Why should the panic be related to the SSD not having PLP? There was no power outage.

I am not that much worried about a power outage to be honest. The machine has two power supplies which are connected to two independent power circuits. If this machine goes down because of a power outage, I have to worry about much bigger issues than a bit of lost data of a partial backup.

nielsk · Sep 19, 2018

I have a textdump.tar.gz. Is that enough?

HoneyBadger · Sep 19, 2018

nielsk said:
No, I don't do snapshots on that machines because it is only a target for backups by other machines. I don't need to snapshot these.

Damn, there goes that grasp at straws.

Yes, that is an 860 EVO. Why should the panic be related to the SSD not having PLP? There was no power outage.

Power Loss Protection is an inherent part of what makes a good SLOG device - a sync write isn't considered complete until it's stored on non-volatile storage. An SSD with PLP can return "complete" to this much sooner, since it just needs to accept the write into RAM, which is backed by capacitors to flush to NAND in case of power loss.

I am not that much worried about a power outage to be honest. The machine has two power supplies which are connected to two independent power circuits. If this machine goes down because of a power outage, I have to worry about much bigger issues than a bit of lost data of a partial backup.

PLP protects against internal faults like a backplane/HBA failure - redundant external power can't solve that.

nielsk said:
I have a textdump.tar.gz. Is that enough?

Got some bits from that, thanks:

Code:

cpuid		= 6
dynamic pcpu = 0xfffffe04c7c2c400
curthread	= 0xfffff8007693e000: pid 6 "txg_thread_enter"
curpcb	   = 0xfffffe045dc30a80
fpcurthread  = none
idlethread   = 0xfffff8000837c000: tid 100009 "idle: cpu6"
curpmap	  = 0xffffffff821b1bb8
tssp		 = 0xffffffff821e5600
commontssp   = 0xffffffff821e5600
rsp0		 = 0xfffffe045dc30a80
gs32p		= 0xffffffff821ebe58
ldt		  = 0xffffffff821ebe98
tss		  = 0xffffffff821ebe88
curvnet	  = 0

--- snipped ---

Tracing command init pid 1 tid 100002 td 0xfffff8000837f5c0
kdb_enter() at kdb_enter+0x3b/frame 0xfffffe04499532b0
vpanic() at vpanic+0x1a3/frame 0xfffffe0449953330
panic() at panic+0x43/frame 0xfffffe0449953390
assfail() at assfail+0x1a/frame 0xfffffe04499533a0
zil_close() at zil_close+0x186/frame 0xfffffe04499533e0
zvol_last_close() at zvol_last_close+0x15/frame 0xfffffe0449953400
zvol_d_close() at zvol_d_close+0x8c/frame 0xfffffe0449953430
devfs_close() at devfs_close+0x401/frame 0xfffffe04499534b0
VOP_CLOSE_APV() at VOP_CLOSE_APV+0x83/frame 0xfffffe04499534e0
vgonel() at vgonel+0xb3/frame 0xfffffe0449953560
vflush() at vflush+0x341/frame 0xfffffe04499536a0
devfs_unmount() at devfs_unmount+0x38/frame 0xfffffe04499536e0
dounmount() at dounmount+0x64c/frame 0xfffffe0449953760
vfs_unmountall() at vfs_unmountall+0xc4/frame 0xfffffe0449953790
bufshutdown() at bufshutdown+0x3dd/frame 0xfffffe04499537e0
kern_reboot() at kern_reboot+0x1aa/frame 0xfffffe0449953830
sys_reboot() at sys_reboot+0x458/frame 0xfffffe0449953880
amd64_syscall() at amd64_syscall+0xa4a/frame 0xfffffe04499539b0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe04499539b0
--- syscall (55, FreeBSD ELF64, sys_reboot), rip = 0x40ef7a, rsp = 0x7fffffffe798, rbp = 0x7fffffffe880 ---

Looks like this could be related to the txg race condition referenced.

Have you ruled out hardware failure (eg: checked for memory faults in the IPMI log?)

nielsk · Sep 19, 2018

Yeah, ok the backplane-error-part could be a reason to change the SSD to something different. I have to see if I can change it.

I cannot see anything in the IPMI about RAM-problems and dmesg doesn't show me anything either.
The machine is new except of the system-SSDs and the 2 TB-HDDs which I have taken out of the machine we replaced.
I could switch the RAM with a twin I have on hand which isn't in production-use and has different RAM in it but I'd rather wait if this crash happens yet another time.

HoneyBadger · Sep 19, 2018

SSDs with proper PLP should also be faster at sync writes, since they can just take the write into RAM and not have to actually commit it to NAND, so there's a performance reason to go with it as well.

The only function in zil.c that can ASSERT !zilog_is_dirty(zilog) without also sending || spa_freeze_txg appears to be zil_close but the previous bug seemed to hit a dead end other than the user reporting that "yes it happened again in a newer version" - I can't find a reopened one.

Perhaps it's time to file a bug report upstream at OpenZFS?

Important Announcement for the TrueNAS Community.

Question about panic

nielsk

Cadet

dlavigne

Guest

Chris Moore

Hall of Famer

nielsk

Cadet

nielsk

Cadet

HoneyBadger

actually does care

nielsk

Cadet

nielsk

Cadet

HoneyBadger

actually does care

nielsk

Cadet

HoneyBadger

actually does care

Similar threads

Important Announcement for the TrueNAS Community.

Question about panic

Cadet

dlavigne

Guest

Hall of Famer

Cadet

Cadet

actually does care

Cadet

Cadet

actually does care

Cadet

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Question about panic"

Similar threads