OK, so I have a question about the hardiness of ZFS. To give a little background, my server, which was built about 8 months ago, is an ASRock Rack C2550D4i with 16GB Crucial ECC RAM in a SilverStone DS380B, populated with 8 drives arranged in 2 RAIDZ1 VDEVs of 4 drives each (I'm aware of the discouraged use of RAIDZ1 at this point, but that's neither here nor there). One VDEV is all 3 TB drives, and one had originally been all 2 TB drives, but I've been replacing them with 6 TB WD Red drives one by one as budget allows, with the intention of eventually ending up with all four 6 TB drives, thus unlocking that extra chunk of space. I've had one 6 TB drive in the array for months, and recently bought a second, but hadn't taken the time to install it yet.
However, a few weeks ago when I returned home from christmas vacation, I found the server unresponsive, and on reboot it would refuse to POST (even after waiting a minute or two to get past the initial BMC sync, and even after disconnecting all peripherals etc). ASRock agreed I had done all the troubleshooting I could, and RMA'd the motherboard.
I received the replacement a few days ago, and the server happily booted right back up. After letting it run for 24 hours, everything seemed to be working fine, so I replaced one of the 2 TB drives in the second VDEV with the new 6 TB drive I had on hand. The server started working on resilvering the drive, and another 24 hours or so passed.
Then I awoke the next day to the server unresponsive again. All the same tests seem to point to the motherboard being bad. I know it seems oddly coincidental for the same part to fail twice back to back, but I'm pretty confident I've eliminated any other variables. The server is powered by a rackmount APC UPS, so it should be receiving nice clean power. To rule out any of my other components, I bought a new PSU and a stick of RAM to test with, and neither of those could get it to boot. So I'm fairly confident it's a bad motherboard again, as odd as it may seem.
ASRock is RMA'ing the motherboard again. As a side note, at this point, I've lost confidence in the reliability of AS Rock, or at least this specific model of motherboard, so yesterday I ordered a SuperMicro X10SL7-F and Pentium G3258, which will be housed in a SuperMicro SC-846 as soon as I can find a good deal on one on eBay.
My concern, though, is with my data. I checked the status of the resilver before going to bed the night before, and it said it had about 8 hours left. I got an automated email at 3am saying it had about 4 hours left:
When I woke up and checked the server at around 9am, it was dead. So I don't know if the resilvering managed to complete before it failed, but I suspect it hadn't.
If it hadn't finished resilvering, my question is this -- When I bring the server back up on its third motherboard, will ZFS be robust enough to just pick up on the resilvering where it left off? Or will the pool be corrupted?
However, a few weeks ago when I returned home from christmas vacation, I found the server unresponsive, and on reboot it would refuse to POST (even after waiting a minute or two to get past the initial BMC sync, and even after disconnecting all peripherals etc). ASRock agreed I had done all the troubleshooting I could, and RMA'd the motherboard.
I received the replacement a few days ago, and the server happily booted right back up. After letting it run for 24 hours, everything seemed to be working fine, so I replaced one of the 2 TB drives in the second VDEV with the new 6 TB drive I had on hand. The server started working on resilvering the drive, and another 24 hours or so passed.
Then I awoke the next day to the server unresponsive again. All the same tests seem to point to the motherboard being bad. I know it seems oddly coincidental for the same part to fail twice back to back, but I'm pretty confident I've eliminated any other variables. The server is powered by a rackmount APC UPS, so it should be receiving nice clean power. To rule out any of my other components, I bought a new PSU and a stick of RAM to test with, and neither of those could get it to boot. So I'm fairly confident it's a bad motherboard again, as odd as it may seem.
ASRock is RMA'ing the motherboard again. As a side note, at this point, I've lost confidence in the reliability of AS Rock, or at least this specific model of motherboard, so yesterday I ordered a SuperMicro X10SL7-F and Pentium G3258, which will be housed in a SuperMicro SC-846 as soon as I can find a good deal on one on eBay.
My concern, though, is with my data. I checked the status of the resilver before going to bed the night before, and it said it had about 8 hours left. I got an automated email at 3am saying it had about 4 hours left:
Code:
pool: tank state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Tue Jan 12 18:22:55 2016 21.1T scanned out of 23.9T at 189M/s, 4h19m to go 2.42T resilvered, 88.30% done config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 gptid/6edde696-e850-11e4-8cc5-d0509964c21c ONLINE 0 0 0 gptid/6fede701-e850-11e4-8cc5-d0509964c21c ONLINE 0 0 0 gptid/70ee95c4-e850-11e4-8cc5-d0509964c21c ONLINE 0 0 0 gptid/71eecd16-e850-11e4-8cc5-d0509964c21c ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 gptid/d5c41fa2-ed21-11e4-a6a6-d0509964c21c ONLINE 0 0 0 gptid/5f82fd5c-b983-11e5-b6ca-bc5ff4fda84e ONLINE 0 0 0 (resilvering) gptid/d723824f-ed21-11e4-a6a6-d0509964c21c ONLINE 0 0 0 gptid/d7d59e65-ed21-11e4-a6a6-d0509964c21c ONLINE 0 0 0
When I woke up and checked the server at around 9am, it was dead. So I don't know if the resilvering managed to complete before it failed, but I suspect it hadn't.
If it hadn't finished resilvering, my question is this -- When I bring the server back up on its third motherboard, will ZFS be robust enough to just pick up on the resilvering where it left off? Or will the pool be corrupted?