Alan Johnson
Dabbler
- Joined
- Jul 2, 2013
- Messages
- 12
Let it be known that I have very little experience with FreeNAS and FreeBSD. I am very solid with Linux and know my way around iSCSI fairly well. I searched and read quite a bit on the forums and in the manual, but have not found anything yet. My apologies if I missed something I should have caught. On to the problem...
Last week, I put FreeNAS 8.3.1-p2 64b on a new Dell R515 (bb01: 6 AMD cores 64GB RAM, 12x 4TB SATA disk + 2 512SSD). (My first FreeBSD install in years.) It was working great for about 5 days until our oVirt hosts (7x CentOS 6 blades running the openiscsi service for initiators) all reported they could no longer communicate with the target about 2 days after I set it up and pointed them all at them. This happened after 11PM, so it is very unlikely anyone was in there mucking with it. I believe I am the only one who has logged in and very confident that I am the only one making changes.
After quite a bit of trouble shooting, I found that the istgt service was hung on bb01. The first symptom was that when I tried to stop the service in the WUI, it just sat there with the spinner spinning and never completed. I could close the services control tab, go back in, and it would show iSCSI as off. When I clicked it again, it just spun again, until I closed and reopened it again to find the indicator set back to on.
When I first configured the service, I took my best guess at making the Target Global Configuration parameters more robust based on the basic info in the manual. I could not find more useful info on these settings particularly regarding what the defaults are in the openiscsi for the initiators (as suggested by the manual). I can upload a screen shot of what I had them set to if anyone feels it relevant. So, I set them all back to their defaults, hoping that might help it start. I tried starting first after turning off LUC, then again after setting the restart back to defaults, and I just got the same behavior from the Control Services iSCSI switch.
After some amount more fumbling in the WUI, I found nothing relevant in the Reports graphs or the messages log. I went to the command line try and find some other logs. I looked again at /var/log/messages, and found very little, and nothing related other than... (Dang it. Apparently, messages is cleared at reboot and no old ones are stored? So, I can't share the exact line. I'll have to setup a log archiver. Anyway...) ... other than an indication that the Control Services iSCSI switch was not going to turn istgt off because it was already marked as disabled, or something like that. I grepped /var/log/* for iscsi and istgt but found nothing.
Finally, I grepped `ps uax` output for iscsi and istgt. I found the "/usr/local/bin/istgt -c /usr/local/etc/istgt/istgt.conf" process running even though Control Services indicated iSCSI was off. I also found 8 pairs of processes that were associated with the WUI trying to forcestop istgt. I toggled the iSCSI switch in Control Services again, and it just added to the list of forcestop processes.
So, I tried the implied command of those pairs: service istgt forcestop, and unfortunately, I didn't capture the output, but basically it just said something similar to the log entries mentioned above: "nah, it is already marked off". I tried status and stop as well, but similar output. I tried `kill istgt` and it stayed alive. I tried `kill -n 9 istgt` and still it would not die.
Finally, I bit the bullet and rebooted only to find shutdown hung waiting for processes to die. I assume it was istgt. I give it many minutes before I power cycled it. When it finished booting, iSCSI seemed fine. oVirt happily reactivated the iSCSI storage (all 7 initiators reconnected with no problem) and the virtual drives (Linux logical volumes (LVM) under the hood) that had been created there where still there and happy.
Through all this (except during the reboot of course), everything else seemed to be working fine, including the NFS share I had setup prior to the iSCSI target, which is running on the same zpool. It has been running for almost a day like this and I will update if it crashes again.
In the mean time, I'd like to figure out better ways to troubleshoot this kind of problem. Has anyone seen istgt hang hard like this before? Where should I expect to find logs and error messages for iSCSI/istgt? How does one really-really-kill a process in FreeBSD, like when kill -9 does not work? (BTW, I saw nothing indicating zombie status, but I'm not sure if it is the same as in Linux.) Where can I find good docs on safe levels for those Target Global Configuration parameters, particularly max. sessions and max. connections? Any other tips and tricks for this stuff?
Thanks much in advance for any help and efforts. I will keep looking in the mean time.
Last week, I put FreeNAS 8.3.1-p2 64b on a new Dell R515 (bb01: 6 AMD cores 64GB RAM, 12x 4TB SATA disk + 2 512SSD). (My first FreeBSD install in years.) It was working great for about 5 days until our oVirt hosts (7x CentOS 6 blades running the openiscsi service for initiators) all reported they could no longer communicate with the target about 2 days after I set it up and pointed them all at them. This happened after 11PM, so it is very unlikely anyone was in there mucking with it. I believe I am the only one who has logged in and very confident that I am the only one making changes.
After quite a bit of trouble shooting, I found that the istgt service was hung on bb01. The first symptom was that when I tried to stop the service in the WUI, it just sat there with the spinner spinning and never completed. I could close the services control tab, go back in, and it would show iSCSI as off. When I clicked it again, it just spun again, until I closed and reopened it again to find the indicator set back to on.
When I first configured the service, I took my best guess at making the Target Global Configuration parameters more robust based on the basic info in the manual. I could not find more useful info on these settings particularly regarding what the defaults are in the openiscsi for the initiators (as suggested by the manual). I can upload a screen shot of what I had them set to if anyone feels it relevant. So, I set them all back to their defaults, hoping that might help it start. I tried starting first after turning off LUC, then again after setting the restart back to defaults, and I just got the same behavior from the Control Services iSCSI switch.
After some amount more fumbling in the WUI, I found nothing relevant in the Reports graphs or the messages log. I went to the command line try and find some other logs. I looked again at /var/log/messages, and found very little, and nothing related other than... (Dang it. Apparently, messages is cleared at reboot and no old ones are stored? So, I can't share the exact line. I'll have to setup a log archiver. Anyway...) ... other than an indication that the Control Services iSCSI switch was not going to turn istgt off because it was already marked as disabled, or something like that. I grepped /var/log/* for iscsi and istgt but found nothing.
Finally, I grepped `ps uax` output for iscsi and istgt. I found the "/usr/local/bin/istgt -c /usr/local/etc/istgt/istgt.conf" process running even though Control Services indicated iSCSI was off. I also found 8 pairs of processes that were associated with the WUI trying to forcestop istgt. I toggled the iSCSI switch in Control Services again, and it just added to the list of forcestop processes.
So, I tried the implied command of those pairs: service istgt forcestop, and unfortunately, I didn't capture the output, but basically it just said something similar to the log entries mentioned above: "nah, it is already marked off". I tried status and stop as well, but similar output. I tried `kill istgt` and it stayed alive. I tried `kill -n 9 istgt` and still it would not die.
Finally, I bit the bullet and rebooted only to find shutdown hung waiting for processes to die. I assume it was istgt. I give it many minutes before I power cycled it. When it finished booting, iSCSI seemed fine. oVirt happily reactivated the iSCSI storage (all 7 initiators reconnected with no problem) and the virtual drives (Linux logical volumes (LVM) under the hood) that had been created there where still there and happy.
Through all this (except during the reboot of course), everything else seemed to be working fine, including the NFS share I had setup prior to the iSCSI target, which is running on the same zpool. It has been running for almost a day like this and I will update if it crashes again.
In the mean time, I'd like to figure out better ways to troubleshoot this kind of problem. Has anyone seen istgt hang hard like this before? Where should I expect to find logs and error messages for iSCSI/istgt? How does one really-really-kill a process in FreeBSD, like when kill -9 does not work? (BTW, I saw nothing indicating zombie status, but I'm not sure if it is the same as in Linux.) Where can I find good docs on safe levels for those Target Global Configuration parameters, particularly max. sessions and max. connections? Any other tips and tricks for this stuff?
Thanks much in advance for any help and efforts. I will keep looking in the mean time.