I've been running ESXi at home for a while now, and have set up my datastores using a Linux mdadm / VMFS / iSCSI system. One of my drives was beginning to go bad in a mirror, and I decided to give FreeNAS a try to take advantage of ZFS.
I set up a new FreeNAS box, set up a new volume and shared out some zvolumes with iSCSI. My intention was to get them the guest machines to this box temporarily, then rebuild the 12-slot server I was moving them from and imprt the volumes there. I began to move some systems over using the VSphere console, and everything seemed to be working without an issue.
After getting most of the smaller systems migrated over the course of a week, I began to copy my large file store (about 2TB total) and let it run for a while. At some point, it failed and I began to get messages on the console:
I would get a few of these every minute, so I decided to restart the ESX host to reset the iSCSI connection to see if that would help. It had quite the opposite affect, and I began to get several of these every second. Some of my guest machines also became unavailable.
VMware has logs that it is trying to playback a journal, but that is failing. After trying several reboots and diconnects/reconnects I was able to see all but 2 of the VM guests. I've copied those out to a USB drive, and I'm pushing them back into my old Linux LVM box.
I found a page that suggested to scrub my ZFS volume, and that completed without errors. Also, S.M.A.R.T tests found nothing wrong with the drives. This doesn't seem to be a hardware issue with the disk subsystem, but something wrong with the istgt / VMware connection.
One of the guest machines that has disappeared has data on it that I don't have a recent backup for. When I look at the disk free pie chart in VMware, it shows much more used than I can view in the datastore browser. I think if I can get this write error to go away, it may replay the journal and I can get access to that machine long enough to copy it off.
My initial setup before the failure was:
After the failure, I tried:
Any suggestion on how to fix this? I'm at the point of scrapping FreeNAS and going back to Linux LVM, but don't want to give up on the orphan machine until I have to.
I set up a new FreeNAS box, set up a new volume and shared out some zvolumes with iSCSI. My intention was to get them the guest machines to this box temporarily, then rebuild the 12-slot server I was moving them from and imprt the volumes there. I began to move some systems over using the VSphere console, and everything seemed to be working without an issue.
After getting most of the smaller systems migrated over the course of a week, I began to copy my large file store (about 2TB total) and let it run for a while. At some point, it failed and I began to get messages on the console:
Nov 16 00:40:33 freenas istgt[7437]: istgt_lu_disk.c:3960:istgt_lu_disk_lbwrite: ***ERROR*** lu_disk_write() failed
Nov 16 00:40:33 freenas istgt[7437]: istgt_lu_disk.c:6051:istgt_lu_disk_execute: ***ERROR*** lu_disk_lbwrite() failed
Nov 16 00:40:33 freenas istgt[7437]: istgt_lu_disk.c:3960:istgt_lu_disk_lbwrite: ***ERROR*** lu_disk_write() failed
Nov 16 00:40:33 freenas istgt[7437]: istgt_lu_disk.c:6051:istgt_lu_disk_execute: ***ERROR*** lu_disk_lbwrite() failed
I would get a few of these every minute, so I decided to restart the ESX host to reset the iSCSI connection to see if that would help. It had quite the opposite affect, and I began to get several of these every second. Some of my guest machines also became unavailable.
VMware has logs that it is trying to playback a journal, but that is failing. After trying several reboots and diconnects/reconnects I was able to see all but 2 of the VM guests. I've copied those out to a USB drive, and I'm pushing them back into my old Linux LVM box.
I found a page that suggested to scrub my ZFS volume, and that completed without errors. Also, S.M.A.R.T tests found nothing wrong with the drives. This doesn't seem to be a hardware issue with the disk subsystem, but something wrong with the istgt / VMware connection.
One of the guest machines that has disappeared has data on it that I don't have a recent backup for. When I look at the disk free pie chart in VMware, it shows much more used than I can view in the datastore browser. I think if I can get this write error to go away, it may replay the journal and I can get access to that machine long enough to copy it off.
My initial setup before the failure was:
- Install FreeNAS 8.0.2-RELEASE x86 - this was a temporary machine
- Create 4 x 2TB RAIDZ2 zpool
- Create 3 x 1TB zvolumes
- Configure iSCSI initiator / target / device extents (1 target, 3LUNs)
- Attach ESXi initiator
- Use VSphere console to move guest machines
After the failure, I tried:
- Rebooting the ESX server
- Shutting down the ESX and rebooting the NAS
- Scrubing the zpool with ESX offline (15 hours, and no corruption found)
- Exporting the zpool, rebuilding using 8.0.2-RELEASE-amd64 and re-importing the volume
- Re-configuring the iSCSI settings to have 3 targets, each with only 1 LUN (VMware didn't recognize the datastore and I had to revert back)
Any suggestion on how to fix this? I'm at the point of scrapping FreeNAS and going back to Linux LVM, but don't want to give up on the orphan machine until I have to.