Very unstable

deadlock · Oct 12, 2011

Hi all,

I just switched from FreeNAS 7 to FreeNAS 8. That also implied a switch from softraid5 to zfs.

At the same time I bought a new serve from SuperMicro.
My server is a Intel core i7 (4 cores) 2533 Mhz
I have 16GB (4x4) of ECC memory (Kingston 65525-008.A00LF)
I have 4x SEAGATE BARRACUDA GREEN 2TB 5900RPM SATA/600 64MB which together form a RAIDZ pool.

dmesg is attached to this message for details

Now to the problem!

The first few hours the pool worked perfectly. I set up offsite replication via zfs send & zfs receive. Storing around 1.5 TB of data sharing it over CIFS, AFP, NFS. Booting some VMWare Virtual machines over NFS. Everything working great!
OK, NFS write times were a bit slow and VMWare sometimes did get timeouts over NFS (I know I should put the ZIL on a separate SSD disk to speed it up since VMWare does atomic writes, so this is solvable)

However from dmesg I could see that I also was starting to get timeouts on disk #4. Ex:
ahcich3: Timeout on slot 30 port 0
ahcich3: is 00000000 cs 40000000 ss 00000000 rs 40000000 tfd c0 serr 00000000
ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich3: Timeout on slot 30 port 0
ahcich3: is 00000000 cs 40000000 ss 00000000 rs 40000000 tfd 80 serr 00000000
ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich3: Timeout on slot 30 port 0
ahcich3: is 00000000 cs 40000000 ss 00000000 rs 40000000 tfd 80 serr 00000000

Soon after that I lost disk #4
(ada3:ahcich3:0:0:0): lost device

I rebooted the system but the drive didn't even show up in BIOS. After I shut down the system (pulling the powerplug) and restarted, the disk would once again reconnect.

Running zpool status I could see that I had some checksum errors now and with -v I realized that I lost some files. How can this be?? Shouldn't it be redundant. One disk failed but I still had 3 disks left that should contain the correct data?

Anyway, I put the failing disk in a USB cradle and executed SeaTools on it. No errors were reported. I runned the bad sector scan in SeaTools once again. No errors! (about 16 hours runtime)

I bought a replacement disk anyway just to be safe. I put it in the server today and issued a zfs replace. It did rebuild most of the disk overnight. This morning it was stuck at 100%. I guess that there were some finishing up after it reached 100%...

An hour later. KERNEL PANIC! :/
(See attached screenshot zfs panic 2011-10-12.png)

I rebooted the server and once again after 5 minutes of runtime I get another kernel panic (same text). I am now suspecting that the meta-data in the zfs pool have been corrupted.

I have now rebooted the server once again and have now booted memtest86 and checking the memory in ECC mode. ECC memory should never have any problems, right? so I dont know if this will do any good. (mentest86 is in progress as I write)

I am out of ideas now. Completely stuck. Can anyone suggest anything for me to investigate further?

Kind regards
Jens

Durkatlon · Oct 12, 2011

This sounds like a bad disk if it won't even show in the BIOS now and then. Probably nothing to do with FreeNAS.

deadlock · Oct 12, 2011

Hi,

Like I wrote, I have replaced the physical disk that seemed to have issues. Also SeaTools did not find any issues with the removed disk when I runned an 8h bad sector scan.

Any other ideas?

Jens

Durkatlon · Oct 12, 2011

Oh sorry, I missed the part where you replaced the drive (was reading on my phone earlier in the morning). So looks like the new drive does not have AHCI time-outs but now you have panics inside ZFS. I'm sure you're running the amd version on this box, not i386 so memory shouldn't be an issue. Perhaps a bad (or badly supported) motherboard. Pretty strange problem, I don't think anyone's reported this kind of crash before.

ProtoSD · Oct 12, 2011

ACPI Warning: Incorrect checksum in table [OEMB] - 0xA4, should be 0xA1

This looks suspicious to me.

deadlock · Oct 12, 2011

Very interesting,

in what way do you think this can affect my drives?
I googled it but did not find anything specific. Hope I am not the only one having this issue :/

jgreco · Oct 17, 2011

Pulling the disk and testing it wasn't a bad idea, but it's always best to also test things in the intended configuration. You might want to do some tests on each disk to make sure that you're reading and writing to them reliably and without problems. This is much harder on an installed system. On a blank system, usually it's easiest to just install FreeBSD and
play around a bit. "iostat <devlist> 1" is good for monitoring per-disk performance while you do some aggressive "dd" tests; one of the best tests is to take a large random data file and dd it onto each disk, then verify the data by reading it back (and comparing it). Keep watching iostat and dmesg for any exceptional conditions.

You can do read tests easily enough even on an installed FreeNAS system, which may be where you ought to start out. Do a "dd if=/dev/${foo} of=/dev/null bs=65536&" on each of your disk devices and then run "iostat ${foo1} ${foo2} ${foo3}{...} 1" to monitor the system for unusual exceptions. It should be able to basically run all of those at a more-or-less consistent speed simultaneously.

ian351c · Oct 20, 2011

I had a very similar issue with one disk on my setup. I swapped out the SATA cable and it went away. I know it's purely anecdotal, but in case you haven't tried it already...

Important Announcement for the TrueNAS Community.

Very unstable

deadlock

Cadet

Attachments

Durkatlon

Patron

deadlock

Cadet

Durkatlon

Patron

ProtoSD

MVP

deadlock

Cadet

jgreco

Resident Grinch

ian351c

Patron

Similar threads

Important Announcement for the TrueNAS Community.

Very unstable

Cadet

Attachments

Patron

Cadet

Patron

MVP

Cadet

Resident Grinch

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Very unstable"

Similar threads