Why didn't FreeNAS fail the drive?

mez · Feb 11, 2014

I have FreeNAS (FreeNAS-9.1.1-RELEASE-x64) running on an HP ProLiant G7 with 12GB RAM, acting as an iSCSI target. A Mac Mini connects to it via the globalSAN iSCSI initiator.

Up until the early hours of this morning it had been running 24/7 for six months without any issues. Overnight, one of my backups failed, and when I starting investigating I found the following errors in dmesg (there were many more than this):

Code:

+(ada1:ata2:0:1:0): RES: 51 40 50 90 ab 3d 3d 01 00 00 00
+(ada1:ata2:0:1:0): Retrying command
+(ada1:ata2:0:1:0): READ_DMA48. ACB: 25 00 50 90 ab 40 3d 01 00 00 38 00
+(ada1:ata2:0:1:0): CAM status: ATA Status Error
+(ada1:ata2:0:1:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
+(ada1:ata2:0:1:0): RES: 51 40 50 90 ab 3d 3d 01 00 00 00
+(ada1:ata2:0:1:0): Retrying command
+(ada1:ata2:0:1:0): READ_DMA48. ACB: 25 00 50 90 ab 40 3d 01 00 00 38 00
+(ada1:ata2:0:1:0): CAM status: ATA Status Error
+(ada1:ata2:0:1:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
+(ada1:ata2:0:1:0): RES: 51 40 50 90 ab 3d 3d 01 00 00 00
+(ada1:ata2:0:1:0): Error 5, Retries exhausted
+(ada1:ata2:0:1:0): READ_DMA48. ACB: 25 00 68 71 9f 40 10 01 00 00 28 00
+(ada1:ata2:0:1:0): CAM status: ATA Status Error
+(ada1:ata2:0:1:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
+(ada1:ata2:0:1:0): RES: 51 40 78 71 9f 10 10 01 00 00 00
+(ada1:ata2:0:1:0): Retrying command

I figured from the UNC errors that this meant a failed or failing hard drive, and indeed when I swapped the drive out, it began the resilvering process and so far all seems to be OK again.

However, two things concern me a little here:

1. FreeNAS didn't seem to notice that the drive was bad. The output from zpool status gave me a clean bill of health even as more and more UNC errors were showing up in dmesg. Can anyone suggest any reasons why this wasn't picked up? Hard drive failures are obviously just a fact of life, but I'm slightly worried that FreeNAS didn't seem to be aware of what was going on and continued behaving as though the pool was healthy.

2. When the drive failed, OS X became unable to write to the iSCSI extent, and any attempt to do so caused it to be forcibly unmounted. The pool is four 3TB drives in raidz2, so a single drive failure shouldn't have had any effect at all - it should have just kept going, albeit in a degraded state. I don't think I've lost any data, but I wouldn't have expected any interruption from the end-user's perspective.

From what I've read, it seems like this may have something to do with the swap settings (I accepted the defaults when I was initially setting things up, and as far as I remember this means there's joint swap space across each of the drives). Can anyone confirm if this is likely to be the case?

Thanks for any insights, and say the word if you need any more info about the setup and so on.

raidflex · Feb 11, 2014

Are you running regular scrubs?

mez · Feb 11, 2014

raidflex said:
Are you running regular scrubs?

Yes. I think it had been 10 days since the last one.

gpsguy · Feb 11, 2014

Did you have it configured to run SMART tests on each of the disks?

Sent from my phone

joeschmuck · Feb 11, 2014

I can confirm that the swap space is joined across all the drives. I'm not a big fan of it but it is what it is.

You raise good questions here about why FreeNAS didn't notify you of the errors and report the pool as degraded and the fact it did impact the user for a RAIDZ2 configuration. I'd submit a bug report if you don't get a good answer just to raise awareness to the developers. They would actually appreciate it. And I'm not sure if the iSCSI is the reason for things not working as expected.

mez · Feb 12, 2014

gpsguy said:
Did you have it configured to run SMART tests on each of the disks?

Yes I did. If I'd have had more time, I would have used something like smartctl to investigate before removing the drive. As it was though, it was more important to get it working again as soon as possible.

joeschmuck said:
I can confirm that the swap space is joined across all the drives. I'm not a big fan of it but it is what it is.

You raise good questions here about why FreeNAS didn't notify you of the errors and report the pool as degraded and the fact it did impact the user for a RAIDZ2 configuration. I'd submit a bug report if you don't get a good answer just to raise awareness to the developers. They would actually appreciate it. And I'm not sure if the iSCSI is the reason for things not working as expected.

Thanks for this. I'll definitely be making time to investigate further. I've used FreeNAS for some time and I think it's fantastic, but this has shaken my faith slightly. I'm still hoping that it's something easily explainable or something that I've done wrong rather than a fault with the OS itself.

Important Announcement for the TrueNAS Community.

Why didn't FreeNAS fail the drive?

mez

Cadet

raidflex

Guru

mez

Cadet

gpsguy

Active Member

joeschmuck

Old Man

mez

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Why didn't FreeNAS fail the drive?

mez

Cadet

raidflex

Guru

mez

Cadet

gpsguy

Active Member

joeschmuck

Old Man

mez

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Why didn't FreeNAS fail the drive?"

Similar threads