Server Crashing: not sure why

Bilco · Dec 21, 2013

Hi All,

Im hoping someone might be able to give me a hand. My server started crashing, I believe this is due to a bad disk. It seems the server goes into a kernel panic during a normal boot. If I pull the drive I think is bad, the server will boot up but then I can not mount the Volume on it. If I boot up with out the disk and place and try to mount the volume, I get another crash.

Would someone be able to help and help me shed some light on this? I've exhausted my search and not sure how to proceed anymore.

My setup:

I have 8 total which I split up into 2x4 disk raids. One volume is called Vol_1 (the problem) and the other is VOL_2.

[root@intersect] ~# cat /etc/version
FreeNAS-9.1.1-RELEASE-x64 (a752d35)

[root@intersect] ~# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
VOL_2 7.25T 5.01T 2.24T 69% 1.00x ONLINE /mnt

[root@intersect] ~# zpool status
pool: VOL_2
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: scrub in progress since Sat Dec 21 09:19:54 2013
261G scanned out of 5.01T at 70.6M/s, 19h37m to go
0 repaired, 5.09% done
config:

NAME STATE READ WRITE CKSUM
VOL_2 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/8d1f4eb1-54a6-11e3-a70b-5404a6497364 ONLINE 0 0 0
gptid/8ec1a62b-54a6-11e3-a70b-5404a6497364 ONLINE 0 0 0
gptid/9059a0dc-54a6-11e3-a70b-5404a6497364 ONLINE 0 0 0
gptid/912caa24-54a6-11e3-a70b-5404a6497364 ONLINE 0 0 0

errors: 2 data errors, use '-v' for a list

[root@intersect] ~# camcontrol devlist
<ATA WDC WD20EZRX-00D 0A80> at scbus0 target 0 lun 0 (da0,pass0)
<ATA WDC WD20EARS-00J 0A80> at scbus0 target 1 lun 0 (da1,pass1)
<ATA WDC WD20EZRX-00D 0A80> at scbus0 target 3 lun 0 (da2,pass2)
<ATA WDC WD30EZRX-00M 0A80> at scbus0 target 4 lun 0 (da3,pass3)
<ATA WDC WD30EZRX-00M 0A80> at scbus0 target 5 lun 0 (da4,pass4)
<ATA WDC WD30EZRX-00M 0A80> at scbus0 target 6 lun 0 (pass8,da8)
<ATA WDC WD30EZRX-00M 0A80> at scbus0 target 7 lun 0 (pass9,da9)
<ATA WDC WD20EFRX-68E 0A80> at scbus0 target 8 lun 0 (da5,pass5)
< > at scbus12 target 0 lun 0 (da6,pass6)
<Generic- USB3.0 CRW -1 1.00> at scbus12 target 0 lun 1 (da7,pass7)

[root@intersect] ~# glabel status
Name Status Components
gptid/8d1f4eb1-54a6-11e3-a70b-5404a6497364 N/A da0p2
gptid/8ec1a62b-54a6-11e3-a70b-5404a6497364 N/A da1p2
gptid/9059a0dc-54a6-11e3-a70b-5404a6497364 N/A da2p2
gptid/b374bcae-5bcb-11e2-b965-5404a6497364 N/A da3p2
gptid/b2ae06a0-5bcb-11e2-b965-5404a6497364 N/A da4p2
gptid/90ed4dcf-54a6-11e3-a70b-5404a6497364 N/A da5p1
gptid/912caa24-54a6-11e3-a70b-5404a6497364 N/A da5p2
ufs/FreeNASs3 N/A da7s3
ufs/FreeNASs4 N/A da7s4
ufs/FreeNASs1a N/A da7s1a
gptid/b1d4f55f-5bcb-11e2-b965-5404a6497364 N/A da8p1
gptid/b1ec9c06-5bcb-11e2-b965-5404a6497364 N/A da8p2
gptid/b110613b-5bcb-11e2-b965-5404a6497364 N/A da9p1
gptid/b1299abc-5bcb-11e2-b965-5404a6497364 N/A da9p2

[root@intersect] ~# gpart show
=> 34 3907029101 da0 GPT (1.8T)
34 94 - free - (47k)
128 4194304 1 freebsd-swap (2.0G)
4194432 3902834696 2 freebsd-zfs (1.8T)
3907029128 7 - free - (3.5k)

=> 34 3907029101 da1 GPT (1.8T)
34 94 - free - (47k)
128 4194304 1 freebsd-swap (2.0G)
4194432 3902834696 2 freebsd-zfs (1.8T)
3907029128 7 - free - (3.5k)

=> 34 3907029101 da2 GPT (1.8T)
34 94 - free - (47k)
128 4194304 1 freebsd-swap (2.0G)
4194432 3902834696 2 freebsd-zfs (1.8T)
3907029128 7 - free - (3.5k)

=> 34 5860533101 da3 GPT (2.7T)
34 94 - free - (47k)
128 4194304 1 freebsd-swap (2.0G)
4194432 5856338703 2 freebsd-zfs (2.7T)

=> 34 5860533101 da4 GPT (2.7T)
34 94 - free - (47k)
128 4194304 1 freebsd-swap (2.0G)
4194432 5856338703 2 freebsd-zfs (2.7T)

=> 34 3907029101 da5 GPT (1.8T)
34 94 - free - (47k)
128 4194304 1 freebsd-swap (2.0G)
4194432 3902834696 2 freebsd-zfs (1.8T)
3907029128 7 - free - (3.5k)

=> 63 15564737 da7 MBR (7.4G)
63 1930257 1 freebsd [active] (942M)
1930320 63 - free - (31k)
1930383 1930257 2 freebsd (942M)
3860640 3024 3 freebsd (1.5M)
3863664 41328 4 freebsd (20M)
3904992 11659808 - free - (5.6G)

=> 0 1930257 da7s1 BSD (942M)
0 16 - free - (8.0k)
16 1930241 1 !0 (942M)

=> 34 5860533101 da8 GPT (2.7T)
34 94 - free - (47k)
128 4194304 1 freebsd-swap (2.0G)
4194432 5856338703 2 freebsd-zfs (2.7T)

=> 34 5860533101 da9 GPT (2.7T)
34 94 - free - (47k)
128 4194304 1 freebsd-swap (2.0G)
4194432 5856338703 2 freebsd-zfs (2.7T)

[root@intersect] ~# zpool import -f
pool: Vol_1
id: 162564732152805204
state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

Vol_1 ONLINE
raidz1-0 ONLINE
gptid/b1299abc-5bcb-11e2-b965-5404a6497364 ONLINE
gptid/b1ec9c06-5bcb-11e2-b965-5404a6497364 ONLINE
gptid/b2ae06a0-5bcb-11e2-b965-5404a6497364 ONLINE
gptid/b374bcae-5bcb-11e2-b965-5404a6497364 ONLINE

**** Crash message below ****

Mounting local file systems:.
cannot import '162564732152805204': no such pool available
(da2:mps0:0:3:0):READ(10).CDB2800e8e0860000010000
(da2:mps0:0:3:0): CAM status: SCSI Status Error
(da2:mps0:0:3:0): SCSI status: Check Condition
(da2:mps0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da2:mps0:0:3:0): Info: 0xe8e08688
(da2:mps0:0:3:0): Error 5, Unretrable error

Fatal Trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address =0x40
fault code = supervisor read data, page not present
instruction pointer= 0x20:0xffffffff8167ddc1
stack pointer= 0x28:0xffffffff8919928f00
frame point= 0x28:0xffffffff89119s8f30
code segment= base 0x0 limit 0xfffff, type 0x1b
processor eflags= interrupt enabled, resume, IOPL = 0
current process= 197 (txg_thread_enter)
[ thread pid 197 tid 100638 ]
stopped atvdev_is_dead+0x1:cmpq$ox5,ox40(%rdi)

Any and all help is appreciated.

Thanks.

JakeHiltz · Dec 22, 2013

How much ram do you have ? ECC or non-ECC?

Bilco · Dec 22, 2013

32 Gigs Non ECC.

I have tried pulling some out in various configurations to check for bad ram.

It will boot fine as long as that one drive is removed.

DJ9 · Dec 22, 2013

Run memtest. http://www.memtest.org/

cyberjock · Dec 22, 2013

You are making a big mistake using non-ECC RAM.

jgreco · Dec 24, 2013

stopped at vdev_is_dead hints at unrecoverable pool damage. It may still be possible to recover the data from the pool if vital, there was someone with what looks to be a similar problem a year or two ago on freebsd-fs. But generally the fix is:

1) Use ECC RAM (suspect anytime there is pool corruption evident on non-ECC systems)
2) Strongly suggest RAIDZ2 when you rebuild your pool
3) Then restore the data from backup

Important Announcement for the TrueNAS Community.

Server Crashing: not sure why

Bilco

Cadet

JakeHiltz

Dabbler

Bilco

Cadet

DJ9

Contributor

cyberjock

Inactive Account

jgreco

Resident Grinch

Similar threads

Important Announcement for the TrueNAS Community.

Server Crashing: not sure why

Bilco

Cadet

JakeHiltz

Dabbler

Bilco

Cadet

DJ9

Contributor

cyberjock

Inactive Account

jgreco

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Server Crashing: not sure why"

Similar threads