9.10 System crash with unexplained log

DeanNotDin · Apr 24, 2017

Hey everyone,

So after downgrading back to 9.10 from Corral I had about 2 weeks of blessed stability, until yesterday.

My server became unresponsive - file transfers stopped, jails crash, and the server was inaccessible from the web GUI.

I checked the on screen console and got the following disturbing log (picture attached)
Also, I got a volume degraded email and 3 hours after that the daily security emails. Until now, I did not reset the server and it is still offline.

Any ideas?

Sent from my ONEPLUS A3000 using Tapatalk

SweetAndLow · Apr 24, 2017

We you got an email. What did it say? Looks like a usb failure to me.

Sent from my Nexus 5X using Tapatalk

DeanNotDin · Apr 24, 2017

SweetAndLow said:
We you got an email. What did it say? Looks like a usb failure to me.

I got 3 emails, and by them it seems that it wasn't a USB failure. The humb drive is connected and active.
Also, the volume degraded alarm is something that I have been getting for a while, resolves after a reboot, and SMART status for all drives is fine. Could be a defected SATA port on the server's motherboard.

The emails that I got are:

Volume Degraded email

Code:

The volume Storage (ZFS) state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.

Daily security run output

Code:

freenas.local kernel log messages:
arp: 192.168.1.89 moved from 00:5a:20:31:a1:f0 to 7c:dd:90:a5:a7:6a on epair0b
arp: 192.168.1.89 moved from 00:5a:20:31:a1:f0 to 7c:dd:90:a5:a7:6a on epair0b
arp: 192.168.1.89 moved from 00:5a:20:31:a1:f0 to 7c:dd:90:a5:a7:6a on epair0b
arp: 192.168.1.89 moved from 7c:dd:90:a5:a7:6a to 00:5a:20:31:a1:f0 on epair0b
arp: 192.168.1.89 moved from 00:5a:20:31:a1:f0 to 7c:dd:90:a5:a7:6a on epair0b
arp: 192.168.1.89 moved from 00:5a:20:31:a1:f0 to 7c:dd:90:a5:a7:6a on epair0b
arp: 192.168.1.89 moved from 7c:dd:90:a5:a7:6a to 00:5a:20:31:a1:f0 on epair0b
arp: 192.168.1.89 moved from 00:5a:20:31:a1:f0 to 7c:dd:90:a5:a7:6a on epair0b
arp: 192.168.1.89 moved from 00:5a:20:31:a1:f0 to 7c:dd:90:a5:a7:6a on epair0b
arp: 192.168.1.89 moved from 7c:dd:90:a5:a7:6a to 00:5a:20:31:a1:f0 on epair0b
arp: 192.168.1.89 moved from 00:5a:20:31:a1:f0 to 7c:dd:90:a5:a7:6a on epair0b
arp: 192.168.1.89 moved from 7c:dd:90:a5:a7:6a to 00:5a:20:31:a1:f0 on epair0b
arp: 192.168.1.89 moved from 00:5a:20:31:a1:f0 to 7c:dd:90:a5:a7:6a on epair0b
arp: 192.168.1.89 moved from 00:5a:20:31:a1:f0 to 7c:dd:90:a5:a7:6a on epair0b
arp: 192.168.1.89 moved from 7c:dd:90:a5:a7:6a to 00:5a:20:31:a1:f0 on epair0b
arp: 192.168.1.47 moved from 00:5a:20:31:a1:f0 to 7c:dd:90:a5:a7:6a on epair0b
arp: 192.168.1.47 moved from 00:5a:20:31:a1:f0 to 7c:dd:90:a5:a7:6a on epair0b
arp: 192.168.1.47 moved from 00:5a:20:31:a1:f0 to 7c:dd:90:a5:a7:6a on epair0b
arp: 192.168.1.47 moved from 00:5a:20:31:a1:f0 to 7c:dd:90:a5:a7:6a on epair0b
arp: 192.168.1.47 moved from 00:5a:20:31:a1:f0 to 7c:dd:90:a5:a7:6a on epair0b
arp: 192.168.1.47 moved from 00:5a:20:31:a1:f0 to 7c:dd:90:a5:a7:6a on epair0b
arp: 192.168.1.47 moved from 00:5a:20:31:a1:f0 to 7c:dd:90:a5:a7:6a on epair0b
arp: 192.168.1.47 moved from 00:5a:20:31:a1:f0 to 7c:dd:90:a5:a7:6a on epair0b
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 524298, size: 4096
ahcich0: Timeout on slot 5 port 0
ahcich0: is 00000000 cs 00000020 ss 00000000 rs 00000020 tfd c0 serr 00000000 cmd 0004c517
(ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 524298, size: 4096
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 575437, size: 8192
ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 524298, size: 4096
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 575437, size: 8192
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 524298, size: 4096
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <WDC WD40EFRX-68WT0N0 80.00A80> s/n WD-WCC4E0701422 detached
GEOM_ELI: g_eli_read_done() failed (error=6) ada0p1.eli[READ(offset=209502208, length=8192)]
swap_pager: I/O error - pagein failed; blkno 575437,size 8192, error 6
vm_fault: pager read error, pid 1132 (devd)
GEOM_ELI: g_eli_read_done() failed (error=6) ada0p1.eli[READ(offset=36864, length=4096)]
swap_pager: I/O error - pagein failed; blkno 524298,size 4096, error 6
vm_fault: pager read error, pid 2524 (cnid_metad)
GEOM_ELI: g_eli_read_done() failed (error=6) ada0p1.eli[READ(offset=73830400, length=8192)]
swap_pager: I/O error - pagein failed; blkno 542314,size 8192, error 6
vm_fault: pager read error, pid 1132 (devd)
GEOM_ELI: g_eli_read_done() failed (error=6) ada0p1.eli[READ(offset=73830400, length=8192)]
swap_pager: I/O error - pagein failed; blkno 542314,size 8192, error 6
vm_fault: pager read error, pid 1132 (devd)
GEOM_ELI: g_eli_read_done() failed (error=6) ada0p1.eli[READ(offset=32493568, length=4096)]
swap_pager: I/O error - pagein failed; blkno 532222,size 4096, error 6
vm_fault: pager read error, pid 1132 (devd)
Failed to write core file for process devd (error 14)
pid 1132 (devd), uid 0: exited on signal 11
GEOM_ELI: g_eli_read_done() failed (error=6) ada0p1.eli[READ(offset=96215040, length=4096)]
swap_pager: I/O error - pagein failed; blkno 547779,size 4096, error 6
vm_fault: pager read error, pid 2524 (cnid_metad)
Failed to write core file for process cnid_metad (error 14)
pid 2524 (cnid_metad), uid 0: exited on signal 11
GEOM_ELI: g_eli_read_done() failed (error=6) ada0p1.eli[READ(offset=211886080, length=4096)]
swap_pager: I/O error - pagein failed; blkno 576019,size 4096, error 6
vm_fault: pager read error, pid 79357 (afpd)
pid 79357 (afpd), uid 1001: exited on signal 6
GEOM_ELI: g_eli_read_done() failed (error=6) ada0p1.eli[READ(offset=115728384, length=8192)]
swap_pager: I/O error - pagein failed; blkno 552543,size 8192, error 6
vm_fault: pager read error, pid 8035 (zfsd)
GEOM_ELI: g_eli_read_done() failed (error=6) ada0p1.eli[READ(offset=33107968, length=8192)]
swap_pager: I/O error - pagein failed; blkno 532372,size 8192, error 6
vm_fault: pager read error, pid 2502 (netatalk)
GEOM_ELI: Device ada0p1.eli destroyed.
GEOM_ELI: Detached ada0p1.eli on last close.
swap_pager: I/O error - pagein failed; blkno 542723,size 4096, error 6
vm_fault: pager read error, pid 2502 (netatalk)
Failed to write core file for process netatalk (error 14)
pid 2502 (netatalk), uid 0: exited on signal 11
swap_pager: I/O error - pagein failed; blkno 552569,size 4096, error 6
vm_fault: pager read error, pid 8035 (zfsd)
Failed to write core file for process zfsd (error 14)
pid 8035 (zfsd), uid 0: exited on signal 11
(ada0:ahcich0:0:0:0): Periph destroyed
swap_pager: I/O error - pagein failed; blkno 566473,size 20480, error 6
vm_fault: pager read error, pid 15601 (transmission-daemon)
pid 15601 (transmission-daemon), uid 921: exited on signal 11
(aprobe0:ahcich0:0:0:0): NOP FLUSHQUEUE. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: ATA Status Error
(aprobe0:ahcich0:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT )
(aprobe0:ahcich0:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
(aprobe0:ahcich0:0:0:0): NOP FLUSHQUEUE. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: ATA Status Error
(aprobe0:ahcich0:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT )
(aprobe0:ahcich0:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
swap_pager: I/O error - pagein failed; blkno 576019,size 4096, error 6
vm_fault: pager read error, pid 96534 (afpd)
pid 96534 (afpd), uid 1001: exited on signal 6
swap_pager: I/O error - pagein failed; blkno 576019,size 4096, error 6
vm_fault: pager read error, pid 96560 (afpd)
pid 96560 (afpd), uid 1001: exited on signal 6
swap_pager: I/O error - pagein failed; blkno 576019,size 4096, error 6
vm_fault: pager read error, pid 96561 (afpd)
pid 96561 (afpd), uid 1001: exited on signal 6
swap_pager: I/O error - pagein failed; blkno 551069,size 4096, error 6
vm_fault: pager read error, pid 2847 (collectd)
swap_pager: I/O error - pagein failed; blkno 539389,size 8192, error 6
vm_fault: pager read error, pid 2847 (collectd)
Failed to write core file for process collectd (error 14)
pid 2847 (collectd), uid 0: exited on signal 11
swap_pager: I/O error - pagein failed; blkno 576019,size 4096, error 6
vm_fault: pager read error, pid 96562 (afpd)
pid 96562 (afpd), uid 1001: exited on signal 6
swap_pager: I/O error - pagein failed; blkno 548186,size 8192, error 6
vm_fault: pager read error, pid 2545 (python2.7)
pid 2545 (python2.7), uid 0: exited on signal 11
swap_pager: I/O error - pagein failed; blkno 576019,size 4096, error 6
vm_fault: pager read error, pid 96579 (afpd)
pid 96579 (afpd), uid 1001: exited on signal 6
swap_pager: I/O error - pagein failed; blkno 576019,size 4096, error 6
vm_fault: pager read error, pid 96580 (afpd)
pid 96580 (afpd), uid 1001: exited on signal 6
swap_pager: I/O error - pagein failed; blkno 576019,size 4096, error 6
vm_fault: pager read error, pid 96581 (afpd)
pid 96581 (afpd), uid 1001: exited on signal 6
swap_pager: I/O error - pagein failed; blkno 576019,size 4096, error 6
vm_fault: pager read error, pid 96582 (afpd)
pid 96582 (afpd), uid 1001: exited on signal 6
swap_pager: I/O error - pagein failed; blkno 576019,size 4096, error 6
vm_fault: pager read error, pid 96583 (afpd)
pid 96583 (afpd), uid 1001: exited on signal 6
swap_pager: I/O error - pagein failed; blkno 576019,size 4096, error 6
vm_fault: pager read error, pid 96584 (afpd)
pid 96584 (afpd), uid 1001: exited on signal 6
arp: 192.168.1.47 moved from 00:5a:20:31:a1:f0 to 7c:dd:90:a5:a7:6a on epair0b
swap_pager: I/O error - pagein failed; blkno 542673,size 16384, error 6
vm_fault: pager read error, pid 2267 (smartd)
swap_pager: I/O error - pagein failed; blkno 542668,size 20480, error 6
vm_fault: pager read error, pid 2267 (smartd)
Failed to write core file for process smartd (error 14)
pid 2267 (smartd), uid 0: exited on signal 11
swap_pager: I/O error - pagein failed; blkno 576019,size 4096, error 6
vm_fault: pager read error, pid 96890 (afpd)
pid 96890 (afpd), uid 1001: exited on signal 6
swap_pager: I/O error - pagein failed; blkno 525281,size 40960, error 6
vm_fault: pager read error, pid 196 (python2.7)
swap_pager: I/O error - pagein failed; blkno 544436,size 32768, error 6
vm_fault: pager read error, pid 196 (python2.7)
Failed to write core file for process python2.7 (error 14)
pid 196 (python2.7), uid 0: exited on signal 11
swap_pager: I/O error - pagein failed; blkno 542206,size 12288, error 6
vm_fault: pager read error, pid 194 (daemon)
swap_pager: I/O error - pagein failed; blkno 542206,size 12288, error 6
vm_fault: pager read error, pid 194 (daemon)
swap_pager: I/O error - pagein failed; blkno 542206,size 12288, error 6
vm_fault: pager read error, pid 194 (daemon)
swap_pager: I/O error - pagein failed; blkno 542206,size 12288, error 6
vm_fault: pager read error, pid 194 (daemon)
swap_pager: I/O error - pagein failed; blkno 542175,size 8192, error 6
vm_fault: pager read error, pid 194 (daemon)
Failed to write core file for process daemon (error 14)
pid 194 (daemon), uid 0: exited on signal 11
swap_pager: I/O error - pagein failed; blkno 534273,size 40960, error 6
vm_fault: pager read error, pid 2568 (python2.7)
swap_pager: I/O error - pagein failed; blkno 542744,size 8192, error 6
vm_fault: pager read error, pid 2568 (python2.7)
Failed to write core file for process python2.7 (error 14)
pid 2568 (python2.7), uid 0: exited on signal 11
swap_pager: I/O error - pagein failed; blkno 576153,size 4096, error 6
vm_fault: pager read error, pid 24518 (nginx)
pid 24518 (nginx), uid 80: exited on signal 11
swap_pager: I/O error - pagein failed; blkno 567547,size 8192, error 6
vm_fault: pager read error, pid 2489 (nginx)
swap_pager: I/O error - pagein failed; blkno 567547,size 8192, error 6
vm_fault: pager read error, pid 2489 (nginx)
swap_pager: I/O error - pagein failed; blkno 567547,size 8192, error 6
vm_fault: pager read error, pid 2489 (nginx)
swap_pager: I/O error - pagein failed; blkno 574999,size 4096, error 6
vm_fault: pager read error, pid 2489 (nginx)
Failed to write core file for process nginx (error 14)
pid 2489 (nginx), uid 0: exited on signal 4
swap_pager: I/O error - pagein failed; blkno 524609,size 4096, error 6
vm_fault: pager read error, pid 15291 (Plex Media Server)
pid 15291 (Plex Media Server), uid 972: exited on signal 11
swap_pager: I/O error - pagein failed; blkno 557708,size 32768, error 6
vm_fault: pager read error, pid 16255 (Plex Script Host)
pid 16255 (Plex Script Host), uid 972: exited on signal 11
swap_pager: I/O error - pagein failed; blkno 576019,size 4096, error 6
vm_fault: pager read error, pid 96914 (afpd)
pid 96914 (afpd), uid 1001: exited on signal 6
arp: 192.168.1.47 moved from 00:5a:20:31:a1:f0 to 7c:dd:90:a5:a7:6a on epair0b
arp: 192.168.1.89 moved from 7c:dd:90:a5:a7:6a to 00:5a:20:31:a1:f0 on epair0b

-- End of security output --

Daily run output

Code:

Checking status of zfs pools:
NAME  SIZE  ALLOC  FREE  EXPANDSZ  FRAG  CAP  DEDUP  HEALTH  ALTROOT
Storage  10.9T  3.87T  7.01T  -  19%  35%  1.00x  DEGRADED  /mnt
freenas-boot  29.5G  654M  28.9G  -  -  2%  1.00x  ONLINE  -

pool: Storage
state: DEGRADED
status: One or more devices has been removed by the administrator.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Online the device using 'zpool online' or replace the device with
	'zpool replace'.
scan: resilvered 256G in 1h13m with 0 errors on Fri Apr 14 19:32:48 2017
config:

	NAME  STATE  READ WRITE CKSUM
	Storage  DEGRADED  0  0  0
	 raidz1-0  DEGRADED  0  0  0
	 gptid/d456cbba-20ff-11e7-b130-e06995776367  ONLINE  0  0  0
	 4195179654383725630  REMOVED  0  0  0  was /dev/gptid/47194161-2100-11e7-b130-e06995776367
	 gptid/48242829-2100-11e7-b130-e06995776367  ONLINE  0  0  0

errors: No known data errors

-- End of daily output --

William Grzybowski · Apr 24, 2017

Sounds like it was using swap in one of the disks that died.

DeanNotDin · Apr 24, 2017

William Grzybowski said:
Sounds like it was using swap in one of the disks that died.

I didn't know that FreeNAS stores swap on the raid drives, I mean - I have 6gb of RAM and a 32gb flash drive with only ~600MB used, it would be more efficient to store the swap on the usb drive and not on the RAID drives.

Ericloewe · Apr 24, 2017

What's the plan for reliable swapping in FreeNAS 11?

DeanNotDin · Apr 24, 2017

Oh wow, I had no idea it was that stupid.
So solving the disconnecting drive is my issue then.
Replaced ports and cables, and the disconnecting drive is almost random, could it be a PS issue?

Sent from my ONEPLUS A3000 using Tapatalk

SweetAndLow · Apr 24, 2017

Your system does not meet the minimum specs. Fix that before asking any more questions.

You had a drive fail and because you have an inadequate system it was using swap which should never happen on a healthy system. When the swap disappeared because of the disk failure system process are lost, causing the error message.

Sent from my Nexus 5X using Tapatalk

gpsguy · Apr 24, 2017

6GB is still below the 8GB minimum for FreeNAS.

Swapping to USB would really kill performance. FreeNAS puts a swap partition on each hard disk to facilitate swapping hard disks when the disk sizes aren't exactly the same size.

DeanNotDin said:
I didn't know that FreeNAS stores swap on the raid drives, I mean - I have 6gb of RAM and a 32gb flash drive with only ~600MB used, it would be more efficient to store the swap on the usb drive and not on the RAID drives.

danb35 · Apr 24, 2017

DeanNotDin said:
I didn't know that FreeNAS stores swap on the raid drives

Sounds like someone needs to RTFM.

DeanNotDin said:
I have 6gb of RAM

As already noted, this doesn't meet the minimum system requirements for FreeNAS.

DeanNotDin said:
it would be more efficient to store the swap on the usb drive and not on the RAID drives.

Try thinking about this some more. USB drives are notoriously unreliable, they're usually slower than spinning rust, and they're much smaller in capacity.

William Grzybowski · Apr 25, 2017

Ericloewe said:
What's the plan for reliable swapping in FreeNAS 11?

We will be mirroring the data disk swaps but that won't come until 11.1

Ericloewe · Apr 25, 2017

William Grzybowski said:
We will be mirroring the data disk swaps but that won't come until 11.1

Ok, cool.

DeanNotDin · Apr 25, 2017

gpsguy said:
6GB is still below the 8GB minimum for FreeNAS.

SweetAndLow said:

Your system does not meet the minimum specs. Fix that before asking any more questions.

Click to expand...

I remember that 9.10 suggested a 4gb of RAM as a minimum requirement.
Anyway, I installed another 2GB stick and now have a total of 8GB RAM, And the problem still persists.

Now I'm getting every couple of minutes an email with an error:

Code:

CRITICAL: April 25, 2017, 6:41 p.m. - The volume Storage (ZFS) state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

I replaced the SATA and MOLEX cables, and changed the disks SATA ports. The issue still persists.

Any suggestions?

Ericloewe · Apr 25, 2017

DeanNotDin said:
I remember that 9.10 suggested a 4gb of RAM as a minimum requirement.

That is not true and and never was. The absolute minimum has always been 8GB since 9.3.

DeanNotDin · Apr 25, 2017

Ericloewe said:
That is not true and and never was. The absolute minimum has always been 8GB since 9.3.

Sorry then, its my bad. Now I have 8gb of RAM and I erased 2 out of 4 jails.

Vito Reiter · Apr 25, 2017

DeanNotDin said:
Sorry then, its my bad. Now I have 8gb of RAM and I erased 2 out of 4 jails.

Your failure was likely CAUSED by the lack of RAM in past-tense. You can't just add 2GB and expect it to work, just like you can't take a bullet out of your foot and expect it to be as if you were never shot. Your best bet is to try and repair the boot volume with the 8GB of RAM, and remount the volume. Also, see if that drive connects with any other drive on the same port, if another drive works on the same port then that disk has failed. Worst case scenario: Reinstall, import volume, and if disk still doesn't work replace it with a known good one and see if it picks up.

SweetAndLow · Apr 25, 2017

Tell us the rest of your hardware. I have a feeling since you didn't have minimum ram requirements that you are probably lacking in all the other areas also.

I want to know:
* motherboard
* CPU
* memory amount and brand
* psu
* chassie
* how are disks attached
* what layout are you using, zpool status output
* are you running smart tests and output from all disks for smartctl -a /dev/adaX

This should have all been posted in the first post so we can actually help you.

danb35 · Apr 25, 2017

When you hear hoofbeats in the distance, at least in North America, you should assume it's a horse, not a zebra. Your system is reporting a failed drive. The obvious conclusion is that the drive has failed--and if you've replaced the power and data cables, and tried plugging the data cable into a different SATA port, that makes the obvious conclusion that much more likely. Replace the drive, following the instructions in the manual.

The reason that RAM was relevant is that you were using swap space that was stored on the fail[ed|ing] drive, which probably wouldn't have happened if you'd had enough RAM. The corruption/loss of swap crashed the system.

Vito Reiter said:
Your best bet is to try and repair the boot volume with the 8GB of RAM

What does the boot volume have to do with anything?

Important Announcement for the TrueNAS Community.

9.10 System crash with unexplained log

DeanNotDin

Dabbler

SweetAndLow

Sweet'NASty

DeanNotDin

Dabbler

William Grzybowski

Wizard

DeanNotDin

Dabbler

Ericloewe

Server Wrangler

DeanNotDin

Dabbler

SweetAndLow

Sweet'NASty

gpsguy

Active Member

danb35

Hall of Famer

William Grzybowski

Wizard

Ericloewe

Server Wrangler

DeanNotDin

Dabbler

Attachments

Ericloewe

Server Wrangler

DeanNotDin

Dabbler

Vito Reiter

Wise in the Ways of Science

SweetAndLow

Sweet'NASty

danb35

Hall of Famer

Similar threads

Important Announcement for the TrueNAS Community.

9.10 System crash with unexplained log

Dabbler

Sweet'NASty

Dabbler

Wizard

Dabbler

Server Wrangler

Dabbler

Sweet'NASty

Active Member

Hall of Famer

Wizard

Server Wrangler

Dabbler

Attachments

Server Wrangler

Dabbler

Wise in the Ways of Science

Sweet'NASty

Hall of Famer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "9.10 System crash with unexplained log"

Similar threads