ema-freenas
Cadet
- Joined
- Mar 12, 2013
- Messages
- 6
Hi,
after observing this on two different hosts (same hardware) i suspect a driver bug has been introduced when updating arcmsr in 8.3.0-Beta3.
So, here the details:
Controller: Areca-1160ML,
Disks: 16x 2TB Seagate HDD
FS: raidz1
The machine was running rock-stable for two years running FN8.x (x<3) - after a sudden crash (while storing NFS-data) whe decided to use the downtime to upgrade to the latest stable version of FN and its updated zfs-stack.
Since then, the machine crashed sometimes, after changing some details in the cabling, swapping controllers and so on to rule out hardware-problems, we can now reproduce the failure almost immediatly. The crash-data looks always like this:
Fatal trap 12: page fault while in kernel mode
cpuid = 3, apic id = 03
fault code = supervisor read data, page not found
..
code segement = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor flags = interrupt enabled, resume, IOPL = 0
current process = 12 (irq30: arcmsr0)
[thread pid 12 tid 100045 ]
Stopped at arcmsr_drain_donequeue+0x1b: movq 0x208(%rbx),%rdi
As this function "arcmsr_drain_donequeue" was changed while updating to the latest vendor's drivers version in the last FN-release, i assume the root-cause to be somewhere there.
By starting some I/O (i.e. zfs scrub) the crash occures immediatly when the Areca-controler is running in JBOD-mode. Switching to RAID-mode and having all disks in "pass-through"-mode the crash takes some time to trigger, but will happen after a view days...
So, where do i go with this? Should i talk to areca?
Kind Regards
Marko
after observing this on two different hosts (same hardware) i suspect a driver bug has been introduced when updating arcmsr in 8.3.0-Beta3.
So, here the details:
Controller: Areca-1160ML,
Disks: 16x 2TB Seagate HDD
FS: raidz1
The machine was running rock-stable for two years running FN8.x (x<3) - after a sudden crash (while storing NFS-data) whe decided to use the downtime to upgrade to the latest stable version of FN and its updated zfs-stack.
Since then, the machine crashed sometimes, after changing some details in the cabling, swapping controllers and so on to rule out hardware-problems, we can now reproduce the failure almost immediatly. The crash-data looks always like this:
Fatal trap 12: page fault while in kernel mode
cpuid = 3, apic id = 03
fault code = supervisor read data, page not found
..
code segement = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor flags = interrupt enabled, resume, IOPL = 0
current process = 12 (irq30: arcmsr0)
[thread pid 12 tid 100045 ]
Stopped at arcmsr_drain_donequeue+0x1b: movq 0x208(%rbx),%rdi
As this function "arcmsr_drain_donequeue" was changed while updating to the latest vendor's drivers version in the last FN-release, i assume the root-cause to be somewhere there.
By starting some I/O (i.e. zfs scrub) the crash occures immediatly when the Areca-controler is running in JBOD-mode. Switching to RAID-mode and having all disks in "pass-through"-mode the crash takes some time to trigger, but will happen after a view days...
So, where do i go with this? Should i talk to areca?
Kind Regards
Marko