ZFS crash

Status
Not open for further replies.

arthurr

Cadet
Joined
Feb 29, 2016
Messages
9
I'm having stability issues with FreeNAS-9.3-STABLE-201509022158 see attached screenshot.
Code:
# zfs list tank
NAME  USED  AVAIL  REFER  MOUNTPOINT
tank  251G  290G  44.5K  /mnt/tank

Code:
zpool status tank
  pool: tank
state: ONLINE
status: The pool is formatted using a legacy on-disk format.  The pool can
    still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
    pool will no longer be accessible on software that does not support feature
    flags.
  scan: scrub repaired 0 in 1h3m with 0 errors on Sun Mar  6 01:03:08 2016
config:

    NAME                                            STATE     READ WRITE CKSUM
    tank                                            ONLINE       0     0     0
      mirror-0                                      ONLINE       0     0     0
        gptid/5a0a0323-7921-6964-8e75-d40c5d01776a  ONLINE       0     0     0
        gptid/34ba81da-c021-5acf-da0a-b99b02b81b6d  ONLINE       0     0     0

errors: No known data errors


I suspect tank was the zpool that caused the kernel crash as the other zpool should have been inactive.

Code:
# zfs list unitrends
NAME  USED  AVAIL  REFER  MOUNTPOINT
unitrends  6.22T  1.68T  52.4K  /mnt/unitrends
# zfs get compression unitrends
NAME  PROPERTY  VALUE  SOURCE
unitrends  compression  off  default
# zfs get dedup unitrends
NAME  PROPERTY  VALUE  SOURCE
unitrends  dedup  off  default

Code:
zpool status unitrends
  pool: unitrends
state: ONLINE
status: One or more devices are configured to use a non-native block size.
   Expect reduced performance.
action: Replace affected devices with devices that support the
   configured block size, or migrate data to a properly configured
   pool.
  scan: scrub repaired 598K in 3h7m with 0 errors on Sun Mar  6 03:07:59 2016
config:

   NAME  STATE  READ WRITE CKSUM
   unitrends  ONLINE  0  0  0
    raidz1-0  ONLINE  0  0  0
    gptid/dcf88df4-59eb-b1c5-f893-cafbd1c7cad4  ONLINE  0  0  0
    gptid/b874ee52-e07c-8545-9ab4-9417e2e89593  ONLINE  0  0  0
    gptid/7964f343-cf69-0acb-d655-8c71bd326151  ONLINE  0  0  0
    gptid/c3e7c2fa-3d20-26c8-afd1-af0cc1486df7  ONLINE  0  0  0
    raidz1-1  ONLINE  0  0  0
    gptid/b03363ed-dc79-ab6f-d88b-f1414fd9fd3a  ONLINE  0  0  0
    gptid/f5c57bd1-3221-b9ee-ee9b-db74aef5b8c4  ONLINE  0  0  0
    gptid/0097375e-29e6-9de9-a6bf-d0cc9dc82701  ONLINE  0  0  0
    gptid/8354f83f-3444-7f49-e08c-a9f2db39b402  ONLINE  0  0  0
    raidz1-2  ONLINE  0  0  0
    gptid/c21e456e-e7d2-5d47-e4f5-8cf202781a30  ONLINE  0  0  0
    gptid/dda4e736-52e1-c267-f0e4-d97df6dff207  ONLINE  0  0  0
    gptid/69233fe6-4f27-e1ec-bf3a-b9bcdc829df5  ONLINE  0  0  0
    gptid/01dccfde-d2d5-03e1-b6cd-973255168e2b  ONLINE  0  0  0
    raidz1-3  ONLINE  0  0  0
    gptid/d58a3b0c-a848-ace0-af98-a082c98fe49b  ONLINE  0  0  0
    gptid/fed64fbe-6dfd-3bc6-b20d-cfc3f618e91f  ONLINE  0  0  0
    gptid/37e8f78b-f155-806b-f6cb-b8b785af9920  ONLINE  0  0  0
    gptid/2832fc4f-ddad-7649-b814-f80c8e1a7452  ONLINE  0  0  0
   logs
    gptid/8a62d106-7d45-03cd-dc93-ad039fc7dc09  ONLINE  0  0  0  block size: 512B configured, 4096B native
   cache
    gptid/c7857548-c70a-d141-c2de-9b91ff687fdf  ONLINE  0  0  0

errors: No known data errors


Code:
lspci
00:00.0 Host bridge: Intel Corporation Xeon E5/Core i7 DMI2 (rev 07)
00:01.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 1a (rev 07)
00:02.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 2a (rev 07)
00:02.2 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 2c (rev 07)
00:03.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 3a in PCI Express Mode (rev 07)
00:03.2 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 3c (rev 07)
00:04.0 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 0 (rev 07)
00:04.1 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 1 (rev 07)
00:04.2 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 2 (rev 07)
00:04.3 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 3 (rev 07)
00:04.4 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 4 (rev 07)
00:04.5 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 5 (rev 07)
00:04.6 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 6 (rev 07)
00:04.7 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 7 (rev 07)
00:05.0 System peripheral: Intel Corporation Xeon E5/Core i7 Address Map, VTd_Misc, System Management (rev 07)
00:05.2 System peripheral: Intel Corporation Xeon E5/Core i7 Control Status and Global Errors (rev 07)
00:05.4 PIC: Intel Corporation Xeon E5/Core i7 I/O APIC (rev 07)
00:11.0 PCI bridge: Intel Corporation C600/X79 series chipset PCI Express Virtual Root Port (rev 06)
00:16.0 Communication controller: Intel Corporation C600/X79 series chipset MEI Controller #1 (rev 05)
00:16.1 Communication controller: Intel Corporation C600/X79 series chipset MEI Controller #2 (rev 05)
00:1a.0 USB controller: Intel Corporation C600/X79 series chipset USB2 Enhanced Host Controller #2 (rev 06)
00:1d.0 USB controller: Intel Corporation C600/X79 series chipset USB2 Enhanced Host Controller #1 (rev 06)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev a6)
00:1f.0 ISA bridge: Intel Corporation C600/X79 series chipset LPC Controller (rev 06)
00:1f.2 SATA controller: Intel Corporation C600/X79 series chipset 6-Port SATA AHCI Controller (rev 06)
00:1f.3 SMBus: Intel Corporation C600/X79 series chipset SMBus Host Controller (rev 06)
00:1f.6 Signal processing controller: Intel Corporation C600/X79 series chipset Thermal Management Controller (rev 06)
03:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
05:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
05:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
05:00.2 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
05:00.3 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
08:01.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a)


This is a 2U Supermicro with 16 GB of RAM.
 

Attachments

  • Capture.PNG
    Capture.PNG
    40.9 KB · Views: 260
Last edited:

arthurr

Cadet
Joined
Feb 29, 2016
Messages
9
No iSCSI, just CIFS and NFS. This box was stable with 8 GB of memory with Nexenta (Illumos), but thinking the stability issues were memory related I already doubled it. What would you recommend for RAM? Is there something in the backtrace that leads you to suspect memory?
 

Attachments

  • memory.png
    memory.png
    17.3 KB · Views: 234
  • load.png
    load.png
    26 KB · Views: 248
  • cpu.png
    cpu.png
    16.6 KB · Views: 246

arthurr

Cadet
Joined
Feb 29, 2016
Messages
9
I've got zpool status outputs in code blocks in my first post. Did you notice the graphs I attached showing memory usage, load, etc? I've included graphs from the FreeNAS reporting now too.
 

Attachments

  • memory_freenas.png
    memory_freenas.png
    16.2 KB · Views: 236
  • arc_hit_ratio.png
    arc_hit_ratio.png
    9.2 KB · Views: 238
  • arc_hit_ratio.png
    arc_hit_ratio.png
    9.2 KB · Views: 231

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
I've got zpool status outputs in code blocks in my first post. Did you notice the graphs I attached showing memory usage, load, etc? I've included graphs from the FreeNAS reporting now too.
No sir, your original post is very unclear, at least to me, and apparently hugovsky.

You mention a pool that was "faulted", but show no zpool status for it.

We just want a plain
Code:
zpool status
with ALL of the output.
 

arthurr

Cadet
Joined
Feb 29, 2016
Messages
9
I see the confusion: the zpool state for all pools is "ONLINE". My first post contains the full zpool status output for all pools on this device wrapped in code blocks. As you may have inferred, both pools were imported from Nexenta, which explains why the output looks a bit strange. When I mentioned faulted I simply meant activity with that zpool, tank, likely resulted in the kernel crash. This is possibly relevant given the size and configuration of the tank. Also, tank is primary accessed with CIFS whereas the unitrends zpool is all NFS. I'm using Samba with the following config.

Unix Extensions: true
Use Default Domain: true
Idmap backend: ad
Winbind NSS Info: rfc2307
Enable: true

I was having a lot of stability issues before I rebuilt the shares on the tank zpool. My theory was that it was ACL related, but couldn't isolate the problem better than that. Prior to rebuilding shares / ACLs it was crashing multiple times hourly... almost continuously. I put the shares into read only and rebuilt everything to resolve that. This was my first crash since doing that, which was ~1 week ago. This was the first time when the backtrace was small enough to capture via a screenshot of IPMI console. I checked for any core dumps or anything else useful, but couldn't find anything. I'm guessing it was activity with tank given the previous instability and that the unitrends pool is only active at night for backups. The unitrends pool may occasional see activity during the day, but very little and the ACL setup is simple: just simple ownership and mask. The tank zpool makes heavy use of NFSv4 ACLs that are looked up using winbind.

I edited my first post for clarity. I apologize for my brevity and the resulting confusion.
 

arthurr

Cadet
Joined
Feb 29, 2016
Messages
9
I was able to isolate the issue further. Apparently this is the same issue with the ACLs. I missed a share when I rebuilt things.

It crashes when a Windows client accesses a folder called Scans in the share called public on the tank zpool. Accessing that folder modifies the Thumbs.db file, which is what really crashes it. This is confirmed by setting the share to read only (it doesn't crash in read only). I assuming this issue has something to do what how Nexenta (Illumos) does ACLs when compared to FreeBSD.

Code:
# zfs get all tank/public
NAME  PROPERTY  VALUE  SOURCE
tank/public  type  filesystem  -
tank/public  creation  Thu Dec 11 11:26 2014  -
tank/public  used  15.8G  -
tank/public  available  289G  -
tank/public  referenced  15.8G  -
tank/public  compressratio  1.06x  -
tank/public  mounted  yes  -
tank/public  quota  none  default
tank/public  reservation  none  default
tank/public  recordsize  128K  default
tank/public  mountpoint  /mnt/tank/public  default
tank/public  sharenfs  off  default
tank/public  checksum  on  default
tank/public  compression  on  inherited from tank
tank/public  atime  on  default
tank/public  devices  on  default
tank/public  exec  on  default
tank/public  setuid  on  default
tank/public  readonly  off  default
tank/public  jailed  off  default
tank/public  snapdir  hidden  default
tank/public  aclmode  passthrough  inherited from tank
tank/public  aclinherit  passthrough  inherited from tank
tank/public  canmount  on  default
tank/public  xattr  off  temporary
tank/public  copies  1  default
tank/public  version  5  -
tank/public  utf8only  off  -
tank/public  normalization  none  -
tank/public  casesensitivity  mixed  -
tank/public  vscan  off  default
tank/public  nbmand  off  default
tank/public  sharesmb  name=public  local
tank/public  refquota  none  default
tank/public  refreservation  none  default
tank/public  primarycache  all  default
tank/public  secondarycache  all  default
tank/public  usedbysnapshots  0  -
tank/public  usedbydataset  15.8G  -
tank/public  usedbychildren  0  -
tank/public  usedbyrefreservation  0  -
tank/public  logbias  latency  default
tank/public  dedup  off  default
tank/public  mlslabel  -
tank/public  sync  standard  default
tank/public  refcompressratio  1.06x  -
tank/public  written  15.8G  -
tank/public  logicalused  16.8G  -
tank/public  logicalreferenced  16.8G  -
tank/public  volmode  default  default
tank/public  filesystem_limit  none  default
tank/public  snapshot_limit  none  default
tank/public  filesystem_count  none  default
tank/public  snapshot_count  none  default
tank/public  redundant_metadata  all  default
tank/public  nms:dedup-dirty  off  local

# getfacl /mnt/tank/public
# file: /mnt/tank/public
# owner: root
# group: wheel
group:domain admins:rwxpDdaARWcCos:fd----:allow
  everyone@:rwxp-daARWc--s:fd----:allow

# getfacl /mnt/tank/public/Scans
# file: /mnt/tank/public/Scans
# owner: 1002
# group: 10
group:domain admins:rwxpDdaARWcCos:fd----:allow
group:domain admins:rwxpDdaARWcCos:fd----:allow
  everyone@:rwxp-daARWc--s:fd----:allow

# getfacl /mnt/tank/public/Scans/Thumbs.db
# file: /mnt/tank/public/Scans/Thumbs.db
# owner: nobody
# group: nobody
group:domain admins:rwxpDdaARWcCos:------:allow
group:domain admins:rwxpDdaARWcCos:------:allow
  everyone@:rwxp-daARWc--s:------:allow
 
Last edited:

arthurr

Cadet
Joined
Feb 29, 2016
Messages
9
Code:
# zpool get version
NAME  PROPERTY  VALUE  SOURCE
freenas-boot  version  -  default
tank  version  28  local
unitrends  version  28  local
 

solarisguy

Guru
Joined
Apr 4, 2014
Messages
1,125
@arthurr, would you know how to create a small test pool on FreeNAS 9.3.1 that is version 28? If yes, then you could try to replicate your crash scenario in a controlled environment.
 

arthurr

Cadet
Joined
Feb 29, 2016
Messages
9
@solarisguy, yes, but I need to figure out how to get core dumps working for that to do any good? I could zfs send / recv the share over that reproduces the issue too. FYI, the crashing stops when I rebuild the share even with zpool version 28.

This was my process for rebuilding.

1. Set old dataset to readonly
Code:
zfs set readonly=on tank/public

2. Create new dataset via FreeNAS UI with compression: on, share type: Windows, case sensitivity: sensitive, enable atime: on and dedupe: off.
3. Set ACLs
Code:
chown nobody:nobody /mnt/tank/new
setfacl -x 'owner@:rwxpD-a-R-c---:------:allow' /mnt/tank/new
setfacl -x 'group@:rwxpD-a-R-c---:------:allow' /mnt/tank/new
setfacl -m 'group:domain admins:rwxpDdaARWcCo-:fd----:allow' /mnt/tank/new

4. Setup CIFS share
5. Copied data from old read only dataset to new dataset using Windows CIFS client
6. Corrected ACLs using Windows CIFS client and applied the ACL recursively.
7. Rename old dataset
8. Rename new dataset to replace old dataset
 

arthurr

Cadet
Joined
Feb 29, 2016
Messages
9
I didn't have any bare metal available for the test. I couldn't figure out how to install FreeNAS 9.10 nightly on XenServer 6.5. I was able to reproduce the issue on FreeBSD 9.3 amb64, zpool version 28 and samba41 (samba43 too) with a VM. Once I upgraded the zpool on the FreeBSD VM it stopped crashing. I then confirmed that a zpool upgrade fixed it on the original box. I was afraid of running zpool upgrade because I didn't want to lose backwards compatibility with this issue open, but with this process I was able to validate that a zpool upgrade does in fact resolve the issue. I also got core dumps from the FreeBSD VM. Should I just report it upstream to FreeBSD then?

The backtrace from the VM seems to more clearly point to ACL (lines 26, 28, 29) .
Code:
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x48
fault code     = supervisor read data, page not present
instruction pointer   = 0x20:0xffffffff81ac5970
stack pointer    = 0x28:0xffffff824f2301c0
frame pointer    = 0x28:0xffffff824f230250
code segment     = base 0x0, limit 0xfffff, type 0x1b
       = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags   = interrupt enabled, resume, IOPL = 0
current process     = 1223 (smbd)
trap number     = 12
panic: page fault
cpuid = 0
KDB: stack backtrace:
#0 0xffffffff80925736 at kdb_backtrace+0x66
#1 0xffffffff808eb2fe at panic+0x1ce
#2 0xffffffff80cd28e0 at trap_fatal+0x290
#3 0xffffffff80cd2c41 at trap_pfault+0x211
#4 0xffffffff80cd3243 at trap+0x363
#5 0xffffffff80cbc433 at calltrap+0x8
#6 0xffffffff81ad137c at zfs_freebsd_create+0x6ec
#7 0xffffffff80dd5492 at VOP_CREATE_APV+0x72
#8 0xffffffff8099a01c at vn_open_cred+0x4bc
#9 0xffffffff81ad25d8 at zfs_setextattr+0x1b8
#10 0xffffffff80dd4288 at VOP_SETEXTATTR_APV+0x78
#11 0xffffffff8097ad23 at extattr_set_vp+0x193
#12 0xffffffff8097b082 at sys_extattr_set_file+0x162
#13 0xffffffff80cd208a at amd64_syscall+0x5ea
#14 0xffffffff80cbc717 at Xfast_syscall+0xf7
 
Last edited:

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Generally we don't recommend taking pools from non-FreeNAS machines and mounting them and trying to use them on FreeNAS. You break a lot of assumptions that FreeNAS' middleware assumes aren't being circumvented, etc. Also ZFS isn't always the same (yes, I realize it is *supposed* to be the same, but anyone with years of ZFS experience across different platforms will tell you, things just go weird sometimes).

I think if you copy your data to a new zpool, created on FreeNAS, you'll find everything works just fine. :)
 
Status
Not open for further replies.
Top