Offline & Replace, or Scrub?

yottabit · Nov 6, 2013

Hi FreeNAS Wizards,

I woke up to an e-mail alert telling me that one of my disks had been removed from my Z2.

Took a look at the server and this disk's activity LED was solid on. I reseated the drive but the kernel didn't even notice. So I pulled the disk and ran a full diagnostic on it from a bench system and it checked out just fine, including a full sector scan.

Thinking it must be a controller fluke, I just shut down the server, put the disk back in, and booted the server up again. I expected to see the disk in Offline status and have to remove and add it again with zpool, and let the array resilver.

However, zpool status shows state ONLINE with status saying one or more devices has experienced an unrecoverable error. The troublesome disk shows 3 checksum errors. But that's it. Everything else seems fine. There wasn't a lot of write activity to the array during this degraded time, but there was some. I'm a little surprised nothing more is shown.

So here I am wondering if I should just kick off a scrub and let ZFS work out disparities that way, or if I should kick the disk out of the array and add it again, resulting in a re-silver.

What say ye Wizards? What's the best approach? Does it matter?

Thanks!

warri · Nov 6, 2013

You're running on double redundancy, so I'd do a scrub and see how many errors ZFS can find. If there are still only 3 errors after the scrub, just reset the counter and obverse the drive for a while to make sure it's ok. If errors start surfacing again, replace the drive.

The smart values of the drive are ok?

yottabit · Nov 6, 2013

Yep, SMART says everything is just fine. :-/

I kicked off the scrub. It usually takes around 14h to complete on this array. But something isn't right now... I checked a few minutes later and it says "scrub repaired 0 in 0h0m with 0 errors".

The checksum count on that suspect disk is steadily increasing. Started at 3 before the scrub and now up to 45. Scratch that, now 47.

Should I kick the disk out, add it back, and resilver?

btw:
FreeBSD hostname 8.3-RELEASE-p10 FreeBSD 8.3-RELEASE-p10 #0 r255095M: Sat Aug 31 12:46:47 PDT 2013 jpaetzel@roadrash.ixsystems.com:/usr/home/jpaetzel/fn/freenas/os-base/amd64/usr/home/jpaetzel/fn/freenas/FreeBSD/src/sys/FREENAS.amd64 amd64

warri · Nov 6, 2013

Resilvering is technically just like running a scrub, so it won't help - I think you need a new disk here. To make sure it's not the controller or a cable, you could try to switch the port/cable first.

yottabit · Nov 6, 2013

Well I reseated both ends of the cable to see if it will help, but nothing around this server has moved in 6+ months. Not saying a cable or port or controller wouldn't spontaneously fail (gremlins!) but it just seems less likely.

I offlined the disk before pulling the cable. Did a camcontrol reset on the bus and the kernel sees the disk again. Unfortunately no meddling around with it would let me replace the disk with the same disk, due to gptid matching in the metadata.

I ended up zeroing the beginning many bytes (probably GB by the time I stopped it) of the disk, then using the GUI to create a new zpool on it because I wanted to preserve the FreeNAS system of 2 GB swap, etc., and this was easier than creating it myself.

Finally I detached that zpool, and then used -f on the CLI to force the new gptid slice to replace the old slice.

It's resilvering now.

Next time I guess I should just shut the server down and replug/move the cable/port to avoid all this hassle, but shutting down is more of a hassle in this case since so many VMs and applications are running on top of this NAS.

The strangest part of all this is why the scrub seemed to abort, or at least say it completed in 0 seconds and not show any progress. I'm hoping the full on resilver will go better. If I start seeing errors again I'll try another port and/or cable.

I'll report back on behavior when the resilver completes in around 14h or so. (I wish I had enough budget to use all SSDs in the array; after using SSD in workstations for a couple years now I have no patience for HDDs, LOL.)

yottabit · Nov 7, 2013

The resilver to the original disk completed successfully with no errors. Just for fun I also checked the reallocated sector count on all of the disks in this array:

Code:

[root@nas1] ~# bash
[root@nas1 ~]# for i in {0..4}; do smartctl -a /dev/ada$i | grep _Ct; done
  5 Reallocated_Sector_Ct  0x0033  100  100  005    Pre-fail  Always      -      0
  5 Reallocated_Sector_Ct  0x0033  100  100  005    Pre-fail  Always      -      0
  5 Reallocated_Sector_Ct  0x0033  100  100  005    Pre-fail  Always      -      0
  5 Reallocated_Sector_Ct  0x0033  100  100  005    Pre-fail  Always      -      0
  5 Reallocated_Sector_Ct  0x0033  100  100  005    Pre-fail  Always      -      0

Everything looks good. I guess I'm going to say this one was most likely a controller error, but could have also been a cable/dust/oxidation, etc. In the end, the same cable and port on the controller are being used. I'll just keep an eye on it.

cyberjock · Nov 7, 2013

I think you are on the right track. Just watch it. If it drops again then look at troubleshooting further. Obviously a disk that randomly drops from your pool is not a reliable way to store your data.

Important Announcement for the TrueNAS Community.

Offline & Replace, or Scrub?

yottabit

Contributor

warri

Guru

yottabit

Contributor

warri

Guru

yottabit

Contributor

yottabit

Contributor

cyberjock

Inactive Account

Similar threads

Important Announcement for the TrueNAS Community.

Offline & Replace, or Scrub?

yottabit

Contributor

warri

Guru

yottabit

Contributor

warri

Guru

yottabit

Contributor

yottabit

Contributor

cyberjock

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Offline & Replace, or Scrub?"

Similar threads