Mobo died mid-resilvering - is my pool safe?

TheCowGod · Jan 16, 2016

OK, so I have a question about the hardiness of ZFS. To give a little background, my server, which was built about 8 months ago, is an ASRock Rack C2550D4i with 16GB Crucial ECC RAM in a SilverStone DS380B, populated with 8 drives arranged in 2 RAIDZ1 VDEVs of 4 drives each (I'm aware of the discouraged use of RAIDZ1 at this point, but that's neither here nor there). One VDEV is all 3 TB drives, and one had originally been all 2 TB drives, but I've been replacing them with 6 TB WD Red drives one by one as budget allows, with the intention of eventually ending up with all four 6 TB drives, thus unlocking that extra chunk of space. I've had one 6 TB drive in the array for months, and recently bought a second, but hadn't taken the time to install it yet.

However, a few weeks ago when I returned home from christmas vacation, I found the server unresponsive, and on reboot it would refuse to POST (even after waiting a minute or two to get past the initial BMC sync, and even after disconnecting all peripherals etc). ASRock agreed I had done all the troubleshooting I could, and RMA'd the motherboard.

I received the replacement a few days ago, and the server happily booted right back up. After letting it run for 24 hours, everything seemed to be working fine, so I replaced one of the 2 TB drives in the second VDEV with the new 6 TB drive I had on hand. The server started working on resilvering the drive, and another 24 hours or so passed.

Then I awoke the next day to the server unresponsive again. All the same tests seem to point to the motherboard being bad. I know it seems oddly coincidental for the same part to fail twice back to back, but I'm pretty confident I've eliminated any other variables. The server is powered by a rackmount APC UPS, so it should be receiving nice clean power. To rule out any of my other components, I bought a new PSU and a stick of RAM to test with, and neither of those could get it to boot. So I'm fairly confident it's a bad motherboard again, as odd as it may seem.

ASRock is RMA'ing the motherboard again. As a side note, at this point, I've lost confidence in the reliability of AS Rock, or at least this specific model of motherboard, so yesterday I ordered a SuperMicro X10SL7-F and Pentium G3258, which will be housed in a SuperMicro SC-846 as soon as I can find a good deal on one on eBay.

My concern, though, is with my data. I checked the status of the resilver before going to bed the night before, and it said it had about 8 hours left. I got an automated email at 3am saying it had about 4 hours left:

Code:

  pool: tank
state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Jan 12 18:22:55 2016
        21.1T scanned out of 23.9T at 189M/s, 4h19m to go
        2.42T resilvered, 88.30% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/6edde696-e850-11e4-8cc5-d0509964c21c  ONLINE       0     0     0
            gptid/6fede701-e850-11e4-8cc5-d0509964c21c  ONLINE       0     0     0
            gptid/70ee95c4-e850-11e4-8cc5-d0509964c21c  ONLINE       0     0     0
            gptid/71eecd16-e850-11e4-8cc5-d0509964c21c  ONLINE       0     0     0
          raidz1-1                                      ONLINE       0     0     0
            gptid/d5c41fa2-ed21-11e4-a6a6-d0509964c21c  ONLINE       0     0     0
            gptid/5f82fd5c-b983-11e5-b6ca-bc5ff4fda84e  ONLINE       0     0     0  (resilvering)
            gptid/d723824f-ed21-11e4-a6a6-d0509964c21c  ONLINE       0     0     0
            gptid/d7d59e65-ed21-11e4-a6a6-d0509964c21c  ONLINE       0     0     0

When I woke up and checked the server at around 9am, it was dead. So I don't know if the resilvering managed to complete before it failed, but I suspect it hadn't.

If it hadn't finished resilvering, my question is this -- When I bring the server back up on its third motherboard, will ZFS be robust enough to just pick up on the resilvering where it left off? Or will the pool be corrupted?

Robert Trevellyan · Jan 16, 2016

TheCowGod said:
When I bring the server back up on its third motherboard, will ZFS be robust enough to just pick up on the resilvering where it left off?

It should do exactly that.

BigDave · Jan 16, 2016

It's my understanding, if there was power interuption during a write to the drive being resilvered,
there may be corrupted data on the drive. But a scrub would then correct it after the pool status
is listed as Healthy.

edit: spelling

TheCowGod · Jan 16, 2016

Wonderful. Follow-up question -- would it do the same thing if I built the new server, installed a fresh copy of FreeNAS, and imported my pool? Or would that be more risky?

I'll have all the parts for the new X10SL7-F-based server in a few days, but I won't get my RMA'd motherboard back for a week or two. I'd love to be able to go ahead and build the new server and restore acces to my data, but of course not if it risks losing that data.

As an alternative, would my existing FreeNAS install be upset/confused if I just booted it up in the new server?

Robert Trevellyan · Jan 16, 2016

TheCowGod said:
would my existing FreeNAS install be upset/confused if I just booted it up in the new server?

It should work fine. If you had manually configured any of the network settings, you might have to redo those, but everything else should work, including the resilver picking up where it left off.

TheCowGod · Jan 16, 2016

Robert Trevellyan said:
It should work fine. If you had manually configured any of the network settings, you might have to redo those, but everything else should work, including the resilver picking up where it left off.

Awesome. Truly an impressive piece of engineering. Thanks for the responses, everyone.

Important Announcement for the TrueNAS Community.

Mobo died mid-resilvering - is my pool safe?

TheCowGod

Dabbler

Robert Trevellyan

Pony Wrangler

BigDave

FreeNAS Enthusiast

TheCowGod

Dabbler

Robert Trevellyan

Pony Wrangler

TheCowGod

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Mobo died mid-resilvering - is my pool safe?

TheCowGod

Dabbler

Robert Trevellyan

Pony Wrangler

BigDave

FreeNAS Enthusiast

TheCowGod

Dabbler

Robert Trevellyan

Pony Wrangler

TheCowGod

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Mobo died mid-resilvering - is my pool safe?"

Similar threads