Freenas iSCSI errors / crash

cwilliams · Jul 23, 2014

Hi, we've been running freenas for some time now and on the whole it works well. However we are continually having problems with our VMware host which is using storage from freenas via iSCSI. The symptoms are always the same, guest vm's stop working, datastore offline (ESX5.5), freenas generally unresponsive. The freenas host machine specs are below

Freenas version: 9.2.1.5
Supermicro M/B
Opteron 6272
64GB ECC Ram
IBM1015 HBA crossflashed to LSI
6 x 3TB Seagate Constellation CS SAS
1 x 128gb Samsung 840 Pro L2 arc
8 x 1GB Intel NICs on various cards

ESXi and Freenas are connected using MPIO over 3 different paths all on seperate IP ranges and physical NICs

Here is the extract from /var/log/messages

Jul 23 01:33:49 shfstorage01 istgt[26343]: istgt_iscsi.c:6256:istgt_create_conn: ***ERROR*** pthread_create() failed
Jul 23 01:33:49 shfstorage01 istgt[26343]: istgt.c:2480:istgt_acceptor: ***ERROR*** istgt_create_conn() failed
Jul 23 01:33:50 shfstorage01 istgt[26343]: istgt_iscsi.c:5551:worker: ***ERROR*** pthread_create() failed
Jul 23 01:33:52 shfstorage01 istgt[26343]: istgt_iscsi.c:5551:worker: ***ERROR*** pthread_create() failed
Jul 23 01:33:52 shfstorage01 istgt[26343]: istgt_iscsi.c:6256:istgt_create_conn: ***ERROR*** pthread_create() failed
Jul 23 01:33:52 shfstorage01 istgt[26343]: istgt.c:2480:istgt_acceptor: ***ERROR*** istgt_create_conn() failed
Jul 23 01:33:53 shfstorage01 istgt[26343]: istgt_iscsi.c:5551:worker: ***ERROR*** pthread_create() failed
Jul 23 01:33:55 shfstorage01 istgt[26343]: istgt_iscsi.c:5551:worker: ***ERROR*** pthread_create() failed
Jul 23 01:33:55 shfstorage01 istgt[26343]: istgt_iscsi.c:6256:istgt_create_conn: ***ERROR*** pthread_create() failed
Jul 23 01:33:55 shfstorage01 istgt[26343]: istgt.c:2480:istgt_acceptor: ***ERROR*** istgt_create_conn() failed
Jul 23 01:33:56 shfstorage01 istgt[26343]: istgt_iscsi.c:5551:worker: ***ERROR*** pthread_create() failed
Jul 23 01:33:58 shfstorage01 istgt[26343]: istgt_iscsi.c:5551:worker: ***ERROR*** pthread_create() failed
Jul 23 01:33:58 shfstorage01 istgt[26343]: istgt_iscsi.c:6256:istgt_create_conn: ***ERROR*** pthread_create() failed
Jul 23 01:33:58 shfstorage01 istgt[26343]: istgt.c:2480:istgt_acceptor: ***ERROR*** istgt_create_conn() failed
Jul 23 01:33:59 shfstorage01 istgt[26343]: istgt_iscsi.c:5551:worker: ***ERROR*** pthread_create() failed
Jul 23 01:34:01 shfstorage01 istgt[26343]: istgt_iscsi.c:5551:worker: ***ERROR*** pthread_create() failed
Jul 23 01:34:01 shfstorage01 istgt[26343]: istgt_iscsi.c:6256:istgt_create_conn: ***ERROR*** pthread_create() failed
Jul 23 01:34:01 shfstorage01 istgt[26343]: istgt.c:2480:istgt_acceptor: ***ERROR*** istgt_create_conn() failed
Jul 23 01:34:02 shfstorage01 istgt[26343]: istgt_iscsi.c:5551:worker: ***ERROR*** pthread_create() failed

Any ideas?

cyberjock · Jul 23, 2014

Am I correct in assuming your zpool isn't composed of mirrored vdevs and is instead a RAIDZ1 or RAIDZ2?

cwilliams · Jul 23, 2014

Nope, mirror'd vdevs

pool: vol1
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0 in 25h39m with 0 errors on Mon Jul 21 01:39:39 2014
config:

NAME STATE READ WRITE CKSUM
vol1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/3b16b747-08a2-11e3-8f9c-0015177de7a2 ONLINE 0 0 0
gptid/3c8396d5-08a2-11e3-8f9c-0015177de7a2 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/ae20a267-356f-11e3-8fb6-0015177de7a2 ONLINE 0 0 0
gptid/5bfa12f9-08a2-11e3-8f9c-0015177de7a2 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
gptid/67978593-08a2-11e3-8f9c-0015177de7a2 ONLINE 0 0 0
gptid/68aeeab7-08a2-11e3-8f9c-0015177de7a2 ONLINE 0 0 0
cache
gptid/90fe1209-4855-11e3-b316-0015177de7a2 ONLINE 0 0 0

errors: No known data errors

diehard · Jul 23, 2014

How full is your pool?

make sure there is no swap usage in FN.

cyberjock · Jul 23, 2014

Whoa, it took 25 hours to run your last scrub!? I'd have expected it to complete MUCH faster than that.

-How full is your pool?
-Do you have SMART monitoring and SMART testing setup on your disks? If so can you post the SMART data on all of your disks?

c32767a · Jul 23, 2014

cwilliams said:
Here is the extract from /var/log/messages

Jul 23 01:33:49 shfstorage01 istgt[26343]: istgt_iscsi.c:6256:istgt_create_conn: ***ERROR*** pthread_create() failed
Jul 23 01:33:49 shfstorage01 istgt[26343]: istgt.c:2480:istgt_acceptor: ***ERROR*** istgt_create_conn() failed
Jul 23 01:33:50 shfstorage01 istgt[26343]: istgt_iscsi.c:5551:worker: ***ERROR*** pthread_create() failed
Jul 23 01:33:52 shfstorage01 istgt[26343]: istgt_iscsi.c:5551:worker: ***ERROR*** pthread_create() failed
Jul 23 01:33:52 shfstorage01 istgt[26343]: istgt_iscsi.c:6256:istgt_create_conn: ***ERROR*** pthread_create() failed
Jul 23 01:33:52 shfstorage01 istgt[26343]: istgt.c:2480:istgt_acceptor: ***ERROR*** istgt_create_conn() failed
Jul 23 01:33:53 shfstorage01 istgt[26343]: istgt_iscsi.c:5551:worker: ***ERROR*** pthread_create() failed
Jul 23 01:33:55 shfstorage01 istgt[26343]: istgt_iscsi.c:5551:worker: ***ERROR*** pthread_create() failed
Jul 23 01:33:55 shfstorage01 istgt[26343]: istgt_iscsi.c:6256:istgt_create_conn: ***ERROR*** pthread_create() failed
Jul 23 01:33:55 shfstorage01 istgt[26343]: istgt.c:2480:istgt_acceptor: ***ERROR*** istgt_create_conn() failed
Jul 23 01:33:56 shfstorage01 istgt[26343]: istgt_iscsi.c:5551:worker: ***ERROR*** pthread_create() failed
Jul 23 01:33:58 shfstorage01 istgt[26343]: istgt_iscsi.c:5551:worker: ***ERROR*** pthread_create() failed
Jul 23 01:33:58 shfstorage01 istgt[26343]: istgt_iscsi.c:6256:istgt_create_conn: ***ERROR*** pthread_create() failed
Jul 23 01:33:58 shfstorage01 istgt[26343]: istgt.c:2480:istgt_acceptor: ***ERROR*** istgt_create_conn() failed
Jul 23 01:33:59 shfstorage01 istgt[26343]: istgt_iscsi.c:5551:worker: ***ERROR*** pthread_create() failed
Jul 23 01:34:01 shfstorage01 istgt[26343]: istgt_iscsi.c:5551:worker: ***ERROR*** pthread_create() failed
Jul 23 01:34:01 shfstorage01 istgt[26343]: istgt_iscsi.c:6256:istgt_create_conn: ***ERROR*** pthread_create() failed
Jul 23 01:34:01 shfstorage01 istgt[26343]: istgt.c:2480:istgt_acceptor: ***ERROR*** istgt_create_conn() failed
Jul 23 01:34:02 shfstorage01 istgt[26343]: istgt_iscsi.c:5551:worker: ***ERROR*** pthread_create() failed

Any ideas?

Those errors point to the inability to create a new thread. Probably due to RAM or other resource starvation.
Can you get on the machine when it's happening? What's CPU, RAM and network usage like?

Perhaps this is relevant:

http://forums.freenas.org/index.php?threads/iscsi-istgt-pthread_create-error.7252/

cwilliams · Jul 24, 2014

cyberjock said:
Whoa, it took 25 hours to run your last scrub!? I'd have expected it to complete MUCH faster than that.

-How full is your pool?
-Do you have SMART monitoring and SMART testing setup on your disks? If so can you post the SMART data on all of your disks?

~# zpool get all
NAME PROPERTY VALUE SOURCE
vol1 size 8.16T -
vol1 capacity 77% -
vol1 altroot /mnt local
vol1 health ONLINE -
vol1 guid 11481643397605389947 default
vol1 version - default
vol1 bootfs - default
vol1 delegation on default
vol1 autoreplace off default
vol1 cachefile /data/zfs/zpool.cache local
vol1 failmode continue local
vol1 listsnapshots off default
vol1 autoexpand on local
vol1 dedupditto 0 default
vol1 dedupratio 1.00x -
vol1 free 1.83T -
vol1 allocated 6.32T -
vol1 readonly off -
vol1 comment - default
vol1 expandsize 0 -
vol1 freeing 0 default
vol1 feature@async_destroy enabled local
vol1 feature@empty_bpobj active local
vol1 feature@lz4_compress active local
vol1 feature@multi_vdev_crash_dump disabled local
vol1 feature@spacemap_histogram disabled local
vol1 feature@enabled_txg disabled local
vol1 feature@hole_birth disabled local
vol1 feature@extensible_dataset disabled local
vol1 feature@bookmarks disabled local

Smart attached, was wrong about the disks, we are running normal barracuda's, except for one which failed which was replaced with Constellation CS's, all SATA not SAS. We had planned to replace them all with Constellation's but other things got in the way

cwilliams · Jul 24, 2014

c32767a said:
Those errors point to the inability to create a new thread. Probably due to RAM or other resource starvation.
Can you get on the machine when it's happening? What's CPU, RAM and network usage like?

So occasionally SSH will work, mostly have to go to it physically and access console. When I have managed to get onto it I don't recall seeing anything weird from top but I'm not an expert with *nix OS's, will do some further analyse next time it happens, probably about 5-6 days from now

Thanks for all the responses so far

cyberjock · Jul 24, 2014

So you are confusing to me. The first disk has never had SMART tests done on it. The second, third, and fourth disk has SMART short tests every 2 hours (yikes!). The fifth disk has had one test done but it was last done more than a year ago. The sixth disk had a bunch of tests done early in the disk's life, but they again haven't been run in about a year.

Fortunately (or unfortunately depending on how you look at it) those are no indicator of the problem.

The CPU you are using, Opteron 6272, is that one of the AMD CPUs that had that bug that forced AMD to do a recall? I can't remember the name of the bug or I'd Googe it. :(

cwilliams · Jul 24, 2014

Sorry, didn't even realise that the results weren't consistent. According to the GUI smart should be running on all drives every 30minutes! Clearly that isn't the case.

As for the CPU, the only bug I know of affected 6100 series opterons and not newer 6200 or 6300's

cwilliams · Jul 28, 2014

So issue occured again this morning. Had to hard reset it, couldn't reboot from console.

/var/log/messages at time of issue looks like this

Jul 27 23:55:01 shfstorage01 mountd[3915]: mount request succeeded from x.x.x.x for /mnt/vol1/vvv
Jul 27 23:55:33 shfstorage01 mountd[3915]: umount request succeeded from x.x.x.x for /mnt/vol1/vvv
Jul 28 00:14:07 shfstorage01 mountd[3915]: umount request succeeded from x.x.x.x for /mnt/vol1/aaa
Jul 28 00:14:08 shfstorage01 mountd[3915]: mount request succeeded from x.x.x.x for /mnt/vol1/aaa
Jul 28 08:53:23 shfstorage01 syslogd: kernel boot file is /boot/kernel/kernel
Jul 28 08:53:23 shfstorage01 kernel: Copyright (c) 1992-2013 The FreeBSD Project.
Jul 28 08:53:23 shfstorage01 kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994

Console was full of iscsi errors again, guessing they didn't get committed to the log. Any ideas anyone? The only common occurance both times recently is that its fairly shortly after several machines back up via NFS to freenas but these are daily jobs in cron, crashes are weekly Most are small <10mb backups but one is 40gb or so. I will start backing them up elsewhere just incase they are somehow contributing

cyberjock · Jul 28, 2014

I can't point you to the exact problem, but I can say this. Freezing is almost always a hardware failure. I tend to think it's not a bad disk because bad disks shouldn't cause this kind of problem. This sounds like a CPU, RAM or motherboard problem.

Unfortunately I can't help much more than that. I know AMD isn't providing code to FreeBSD like they used to, so this may be some anomaly that just hasn't been fixed. If the system ran fine for a while without an OS update or anything and then this suddenly started it's probably a hardware failure and not a code problem. If you have spare parts now might be a time to start swapping them and see if the problem persists.

Good luck!

cwilliams · Jul 28, 2014

Thanks for the reply, I also am at a loss. We have plenty of spare motherboards and CPU's so I will swap them out when I can get the downtime. I have a feeling that its something I/O related however. We had a couple of NFS backups, a couple of scheduled snapshots, and several of the VM's all backing up at the same time. We've staggered these to see if it helps. VMware is obviously very I/O intensive and since all the I/O is on the same array it might be having unforeseen circumstances in Freenas, I guess we shall see

cyberjock · Jul 28, 2014

I've seen pools that sit at >90% of their max for weeks on end. Some of iXsystem's customers buy the cheapest server they can and max out the server.

So I seriously doubt its a FreeNAS specific issue.

Is your firmware version on your M1015 the same version as the driver version? (P16 for 9.2.1.x)

cwilliams · Jul 30, 2014

So I don't know how to tell which card is which but I suspect the cross-flashed M1015 is the 2nd one (the 1st being a LSI 9212-4i)

mps0: Firmware: 17.00.01.00, Driver: 16.00.00.00-fbsd
mps1: Firmware: 15.00.00.00, Driver: 16.00.00.00-fbsd

So it would appear firmware is 1 version out, I will get it updated

cyberjock · Jul 30, 2014

I believe if you do sas2flash -listall will tell you which is which. ;)

But to be honest, both should be on v16 since that's the driver version.

Important Announcement for the TrueNAS Community.

Freenas iSCSI errors / crash

cwilliams

Dabbler

cyberjock

Inactive Account

cwilliams

Dabbler

diehard

Contributor

cyberjock

Inactive Account

c32767a

Patron

cwilliams

Dabbler

Attachments

cwilliams

Dabbler

cyberjock

Inactive Account

cwilliams

Dabbler

cwilliams

Dabbler

cyberjock

Inactive Account

cwilliams

Dabbler

cyberjock

Inactive Account

cwilliams

Dabbler

cyberjock

Inactive Account

Similar threads