Parity/CRC Errors When Accessing Drive During Scrub

bbddpp · Nov 3, 2013

Hi folks,

I am running FreeNAS 9.1.1 on an old PC with 6 internal drives, and one connected external enclosure that can hold up to 4 drives (just have one drive in there right now), connected via a port multiplier card which I know FreeNAS LOVES (sarcasm).

Before 9.1.1 the port multiplier connected box would not even survive a reboot without a hard power cycle on itself...Since 9.1.1 it had been behaving a lot better. However, last night it looks like I hit the wall again.

It appears that at midnight an automated scrub process started on the drive in the enclosure (it resumes after a reboot and is currently at around 25%). It turns out I was trying to access a file on that drive at the same time (watching a movie). The movie would run for 30 minutes or so and then eventually would just totally die and kill the NAS with the error:

ahcich1: Timeout on slot 24 port 0
ahcich1: (bunch of numbers)
(ada0:ahci1:0:0:0): READ_PPDMA_QUEUED. (bunch of numbers)
(ada0:ahci1:0:0:0): CAM status: Command timeout
(ada0:ahci1:0:0:0): Retrying command

The entire system then froze.

Each drive is its own zpool currently, I have no RAID at all, this is a JBOD setup full of media files. I have not yet upgraded any of the individual ZFS pools to the new version of ZFS since upgrading. Could that have anything to do with it or is this likely the enclosure/port multiplier card combination rearing its ugly head again?

Please let me know if I can provide any more detail, boot logs, etc. And thanks!!! I'd very much like to get this resolved so I can use my enclosure and continue to expand my media server.

B.

warri · Nov 3, 2013

The ZFS version is irrelevant here, I guess it's the port multiplier/cable.
To check that it's not your drive, look at the smart output (smartctl -a -q noserial /dev/adaX) and zpool status (zpool status). You can also post them here if you want somebody else to look at it.

With your JBOD setup and external enclosure you should be aware that you might loose data unexpectedly, since you have a shaky connection and no redundancy.

bbddpp · Nov 3, 2013

Thanks so much for the reply. Your suggestion to change out the Port Multiplier card was sound. I had a spare around, rebooted, and things seem much more stable now. In looking at the logs, the port multiplier card I had was unsupported by FreeNAS. Now when booting, the system tells me this card IS supported. Good stuff. Definitely a lesson in making sure your hardware is supported.

As for the backups, thanks also for the concern -- I have been weighing my options as well. I currently have around 15 TB of data spread across the 7 drives and around 2 TB of free space. I'm trying to determine if I want to spend the money on another 9 TB or so of space to just be the redundancy and create a pool, etc. As it stands now I realize that if a drive fails I can lose the entire contents of only that drive, which is why I didn't set up a pool (because I understand I'd basically lose it all if one drive failed). Still, losing one isn't good either, my hope is that for now, FreeNAS can at least alert me if it sees one of the drives starting to fail.

warri · Nov 3, 2013

Make sure to configure SMART reporting via email and regular scrubs, those will notify you in case of problems.

Also your understanding of a pool might not be complete: If you setup a pool with redundancy (Mirror, Raid-Z) you are protected against one or multiple hard disk failure(s). Of course this will reduce the amount of space you have available on your pool - on 7 disks people would usually use a Raid-Z2, so that you have the capacity of 5 drives and protection against two concurrent drive failures.

bbddpp · Nov 11, 2013

Warri, you're right, I'm far from an expert on pools and ZFS.

When I built the Freenas box the basic idea was just a centralized place to store all my media. I built it with so many different sized and branded drives inside that I had left over that I didn't think it was even possible to run a true RAID setup anyway, and just figured if a drive was failing I'd at least get enough notice to get files moved off of it and elsewhere before it failed.

My ZFS scrubs all seem to be at the default of the 1st Sunday of every month, for each drive, is that enough or should I be staggering or doing these more often? I will say that the drives don't change much...Once they're full, I pretty much move onto the next drive.

SMART tests also seem to be set at default of quick self tests every 3 days for each drive.

warri · Nov 12, 2013

Some people recommend scrubs on a two-weekly schedule for consumer drives. I personally use the default 30 days though.

Make sure your SMART reporting is working correctly (i.e. you get an email in case of a problem). You can test your email settings in the Settings tab. Also I think smart reports get automatically send to the root user, make sure the email is set correctly on that one.

If you have the chance to rebuild the system with redundancy in mind, I'd do it. A lot less things to worry about if set up properly, and you don't have to rush to save data from a failing drive (which might not always work anyway).

bbddpp · Feb 21, 2014

Thanks again for all your help with this. I stuck with the 30 days for now. Forgot to set up the email sends as well which are now sent so I'm golden.

I'd love to build redundancy in but I'd really have to downsize my media collection. Wish I wasn't such a pack rat but I just like having all my bought media digitally as well for some reason.

Important Announcement for the TrueNAS Community.

Parity/CRC Errors When Accessing Drive During Scrub

bbddpp

Explorer

warri

Guru

bbddpp

Explorer

warri

Guru

bbddpp

Explorer

warri

Guru

bbddpp

Explorer

Similar threads

Important Announcement for the TrueNAS Community.

Parity/CRC Errors When Accessing Drive During Scrub

bbddpp

Explorer

warri

Guru

bbddpp

Explorer

warri

Guru

bbddpp

Explorer

warri

Guru

bbddpp

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Parity/CRC Errors When Accessing Drive During Scrub"

Similar threads