I have FreeNAS (FreeNAS-9.1.1-RELEASE-x64) running on an HP ProLiant G7 with 12GB RAM, acting as an iSCSI target. A Mac Mini connects to it via the globalSAN iSCSI initiator.
Up until the early hours of this morning it had been running 24/7 for six months without any issues. Overnight, one of my backups failed, and when I starting investigating I found the following errors in dmesg (there were many more than this):
I figured from the UNC errors that this meant a failed or failing hard drive, and indeed when I swapped the drive out, it began the resilvering process and so far all seems to be OK again.
However, two things concern me a little here:
1. FreeNAS didn't seem to notice that the drive was bad. The output from zpool status gave me a clean bill of health even as more and more UNC errors were showing up in dmesg. Can anyone suggest any reasons why this wasn't picked up? Hard drive failures are obviously just a fact of life, but I'm slightly worried that FreeNAS didn't seem to be aware of what was going on and continued behaving as though the pool was healthy.
2. When the drive failed, OS X became unable to write to the iSCSI extent, and any attempt to do so caused it to be forcibly unmounted. The pool is four 3TB drives in raidz2, so a single drive failure shouldn't have had any effect at all - it should have just kept going, albeit in a degraded state. I don't think I've lost any data, but I wouldn't have expected any interruption from the end-user's perspective.
From what I've read, it seems like this may have something to do with the swap settings (I accepted the defaults when I was initially setting things up, and as far as I remember this means there's joint swap space across each of the drives). Can anyone confirm if this is likely to be the case?
Thanks for any insights, and say the word if you need any more info about the setup and so on.
Up until the early hours of this morning it had been running 24/7 for six months without any issues. Overnight, one of my backups failed, and when I starting investigating I found the following errors in dmesg (there were many more than this):
Code:
+(ada1:ata2:0:1:0): RES: 51 40 50 90 ab 3d 3d 01 00 00 00 +(ada1:ata2:0:1:0): Retrying command +(ada1:ata2:0:1:0): READ_DMA48. ACB: 25 00 50 90 ab 40 3d 01 00 00 38 00 +(ada1:ata2:0:1:0): CAM status: ATA Status Error +(ada1:ata2:0:1:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC ) +(ada1:ata2:0:1:0): RES: 51 40 50 90 ab 3d 3d 01 00 00 00 +(ada1:ata2:0:1:0): Retrying command +(ada1:ata2:0:1:0): READ_DMA48. ACB: 25 00 50 90 ab 40 3d 01 00 00 38 00 +(ada1:ata2:0:1:0): CAM status: ATA Status Error +(ada1:ata2:0:1:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC ) +(ada1:ata2:0:1:0): RES: 51 40 50 90 ab 3d 3d 01 00 00 00 +(ada1:ata2:0:1:0): Error 5, Retries exhausted +(ada1:ata2:0:1:0): READ_DMA48. ACB: 25 00 68 71 9f 40 10 01 00 00 28 00 +(ada1:ata2:0:1:0): CAM status: ATA Status Error +(ada1:ata2:0:1:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC ) +(ada1:ata2:0:1:0): RES: 51 40 78 71 9f 10 10 01 00 00 00 +(ada1:ata2:0:1:0): Retrying command
I figured from the UNC errors that this meant a failed or failing hard drive, and indeed when I swapped the drive out, it began the resilvering process and so far all seems to be OK again.
However, two things concern me a little here:
1. FreeNAS didn't seem to notice that the drive was bad. The output from zpool status gave me a clean bill of health even as more and more UNC errors were showing up in dmesg. Can anyone suggest any reasons why this wasn't picked up? Hard drive failures are obviously just a fact of life, but I'm slightly worried that FreeNAS didn't seem to be aware of what was going on and continued behaving as though the pool was healthy.
2. When the drive failed, OS X became unable to write to the iSCSI extent, and any attempt to do so caused it to be forcibly unmounted. The pool is four 3TB drives in raidz2, so a single drive failure shouldn't have had any effect at all - it should have just kept going, albeit in a degraded state. I don't think I've lost any data, but I wouldn't have expected any interruption from the end-user's perspective.
From what I've read, it seems like this may have something to do with the swap settings (I accepted the defaults when I was initially setting things up, and as far as I remember this means there's joint swap space across each of the drives). Can anyone confirm if this is likely to be the case?
Thanks for any insights, and say the word if you need any more info about the setup and so on.