Why didn't FreeNAS fail the drive?

Status
Not open for further replies.

mez

Cadet
Joined
Feb 11, 2014
Messages
3
I have FreeNAS (FreeNAS-9.1.1-RELEASE-x64) running on an HP ProLiant G7 with 12GB RAM, acting as an iSCSI target. A Mac Mini connects to it via the globalSAN iSCSI initiator.

Up until the early hours of this morning it had been running 24/7 for six months without any issues. Overnight, one of my backups failed, and when I starting investigating I found the following errors in dmesg (there were many more than this):

Code:
+(ada1:ata2:0:1:0): RES: 51 40 50 90 ab 3d 3d 01 00 00 00
+(ada1:ata2:0:1:0): Retrying command
+(ada1:ata2:0:1:0): READ_DMA48. ACB: 25 00 50 90 ab 40 3d 01 00 00 38 00
+(ada1:ata2:0:1:0): CAM status: ATA Status Error
+(ada1:ata2:0:1:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
+(ada1:ata2:0:1:0): RES: 51 40 50 90 ab 3d 3d 01 00 00 00
+(ada1:ata2:0:1:0): Retrying command
+(ada1:ata2:0:1:0): READ_DMA48. ACB: 25 00 50 90 ab 40 3d 01 00 00 38 00
+(ada1:ata2:0:1:0): CAM status: ATA Status Error
+(ada1:ata2:0:1:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
+(ada1:ata2:0:1:0): RES: 51 40 50 90 ab 3d 3d 01 00 00 00
+(ada1:ata2:0:1:0): Error 5, Retries exhausted
+(ada1:ata2:0:1:0): READ_DMA48. ACB: 25 00 68 71 9f 40 10 01 00 00 28 00
+(ada1:ata2:0:1:0): CAM status: ATA Status Error
+(ada1:ata2:0:1:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
+(ada1:ata2:0:1:0): RES: 51 40 78 71 9f 10 10 01 00 00 00
+(ada1:ata2:0:1:0): Retrying command


I figured from the UNC errors that this meant a failed or failing hard drive, and indeed when I swapped the drive out, it began the resilvering process and so far all seems to be OK again.

However, two things concern me a little here:

1. FreeNAS didn't seem to notice that the drive was bad. The output from zpool status gave me a clean bill of health even as more and more UNC errors were showing up in dmesg. Can anyone suggest any reasons why this wasn't picked up? Hard drive failures are obviously just a fact of life, but I'm slightly worried that FreeNAS didn't seem to be aware of what was going on and continued behaving as though the pool was healthy.

2. When the drive failed, OS X became unable to write to the iSCSI extent, and any attempt to do so caused it to be forcibly unmounted. The pool is four 3TB drives in raidz2, so a single drive failure shouldn't have had any effect at all - it should have just kept going, albeit in a degraded state. I don't think I've lost any data, but I wouldn't have expected any interruption from the end-user's perspective.

From what I've read, it seems like this may have something to do with the swap settings (I accepted the defaults when I was initially setting things up, and as far as I remember this means there's joint swap space across each of the drives). Can anyone confirm if this is likely to be the case?

Thanks for any insights, and say the word if you need any more info about the setup and so on.
 

raidflex

Guru
Joined
Mar 14, 2012
Messages
531
Are you running regular scrubs?
 

gpsguy

Active Member
Joined
Jan 22, 2012
Messages
4,472
Did you have it configured to run SMART tests on each of the disks?


Sent from my phone
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I can confirm that the swap space is joined across all the drives. I'm not a big fan of it but it is what it is.

You raise good questions here about why FreeNAS didn't notify you of the errors and report the pool as degraded and the fact it did impact the user for a RAIDZ2 configuration. I'd submit a bug report if you don't get a good answer just to raise awareness to the developers. They would actually appreciate it. And I'm not sure if the iSCSI is the reason for things not working as expected.
 

mez

Cadet
Joined
Feb 11, 2014
Messages
3
Did you have it configured to run SMART tests on each of the disks?

Yes I did. If I'd have had more time, I would have used something like smartctl to investigate before removing the drive. As it was though, it was more important to get it working again as soon as possible.

I can confirm that the swap space is joined across all the drives. I'm not a big fan of it but it is what it is.

You raise good questions here about why FreeNAS didn't notify you of the errors and report the pool as degraded and the fact it did impact the user for a RAIDZ2 configuration. I'd submit a bug report if you don't get a good answer just to raise awareness to the developers. They would actually appreciate it. And I'm not sure if the iSCSI is the reason for things not working as expected.


Thanks for this. I'll definitely be making time to investigate further. I've used FreeNAS for some time and I think it's fantastic, but this has shaken my faith slightly. I'm still hoping that it's something easily explainable or something that I've done wrong rather than a fault with the OS itself.
 
Status
Not open for further replies.
Top