ZFS Raid1z Disk Replace

Status
Not open for further replies.

egrimisu

Dabbler
Joined
Aug 30, 2011
Messages
14
I have been running ZFS Raid1z with 5 disks for 3 years now with no problems at all.
Unfortunately the day of failing disk has come. I have lost a disk in the array, he simply went offline and after a few days the second one started to drop errors as well. As the system detected check sum errors on the second disk that has started to fail (some bad sectors according to SMART) it started to re-silver the array and when i got to the PC and saw the re-silvering was already at 40%, in order to avoid a catastrophe I have decided to stop the server asap.

So basically my array looks like almost like this, and somewhere it is mentioned that data's were lost :

NAME STATE READ WRITE CKSUM
Misu DEGRADED 0 0 0
raidz1-0 ONLINE 0 0 0
scsi-SATA_ST3000DM001-9YN_Z1F1587B OFFLINE 0 0 0 (failed hdd)
scsi-SATA_ST3000DM001-9YN_Z1F14J7V ONLINE 0 0 0
scsi-SATA_ST3000DM001-9YN_Z1F14JYL ONLINE 0 0 0
scsi-SATA_ST3000DM001-1CH_W1F1G04F ONLINE 0 0 0
scsi-SATA_ST3000DM001-1CH_W1F1G1H7 ONLINE 134 5 139 (failing hdd)

Since the resilver process take some time i'm quite afraid of replacing the first disk and hope that the second one, the one that has checksum errors will not fail. So i have decided to replace the PCB on the first failed disk since it had pcb problems and not mecanical problems.

So, if i manage to make the first disk running what shall i do next, how will zfs know that the disk was not replace (not sure but i believe that changing the pcb will change the serial number and stuff for that disk) and detect the disk as the original member?

Any other information that can help me not to make this worse?
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Any other information that can help me not to make this worse?

Don't reboot again, as each time you reboot you stop the resilvering process and you put a bit more stress on the drives.

Changing the PCB will change the serial number but as FreeNAS uses the GPTID to identify the drives this should work. However there might be some problems because of the S/N and GPTID mismatch but you should be able to mount the pool anyway (maybe using the CLI, but be careful, don't do anything unless some experienced member tell you what to do) ;)

NB: this is the perfect example of why I recommend RAID-Z2 (or 3) over RAID-Z1 :)
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Pretty sure you're not using FreeNAS 8 or FreeNAS 9. What ARE you using?
 

egrimisu

Dabbler
Joined
Aug 30, 2011
Messages
14
Pretty sure you're not using FreeNAS 8 or FreeNAS 9. What ARE you using?

Ubuntu, hope that's not an issue, in the end bsd ported zfs from solaris.

Don't reboot again, as each time you reboot you stop the resilvering process and you put a bit more stress on the drives.

Changing the PCB will change the serial number but as FreeNAS uses the GPTID to identify the drives this should work. However there might be some problems because of the S/N and GPTID mismatch but you should be able to mount the pool anyway (maybe using the CLI, but be careful, don't do anything unless some experienced member tell you what to do) ;)

NB: this is the perfect example of why I recommend RAID-Z2 (or 3) over RAID-Z1 :)

We use at work a lot of storages, all of them are raid5 with one hot spare, guess what, drives fail once in a while, non of the arrays died while resilvering. I realy think these home drives are not reliable enogh..,
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
ZFS on FreeBSD has the ability to resilver before removing the offending drive, helping to avoid degraded pools, if possible.

No idea about ZFS on Linux.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Ubuntu, hope that's not an issue, in the end bsd ported zfs from solaris.

Well, this is an issue...

Forget the part on the PCB change in my previous post, it's irrelevant as you use something very different from FreeNAS.

Why you ask on the FreeNAS forum if you're using Ubuntu?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Well, this is an issue... Forget the bit on the S/N and GPTID in my previous post, it's irrelevant as you use something very different from FreeNAS.

Why you ask on the FreeNAS forum if you're using Ubuntu?

The last time I answered that question, I got a warning from jgreco that I was defaming the forum by saying it had a reputation for solving people's problems good enough to attract people from other OSes. :p
 

egrimisu

Dabbler
Joined
Aug 30, 2011
Messages
14
ZFS on FreeBSD has the ability to resilver before removing the offending drive, helping to avoid degraded pools, if possible.

No idea about ZFS on Linux.

I'm think i don't understand what that means, the behavior that i saw is that after the disk went offline nothing happened, it started to resilver only after the second disk started to drop some checksum errors, i believe that was the thing that triggered the resilver.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Yeah, that's behavior we don't see around here. If a drive is down, it stays down. We do have hot spare support, though.

Honestly, replacing the disk's PCB can be a rather bad idea. It depends on how much information is stored on the disk proper and how much information is kept in flash/EEPROM.

Bottom line on any OS: Don't let problems grow unmanageable. To have a greater margin for maneuvers, use RAIDZ2 or RAIDZ3.
 

egrimisu

Dabbler
Joined
Aug 30, 2011
Messages
14
Well, this is an issue...

Forget the part on the PCB change in my previous post, it's irrelevant as you use something very different from FreeNAS.

Why you ask on the FreeNAS forum if you're using Ubuntu?
Because it's a larger community that uses ZFS of course.

At the time i built my NAS something pointed me to ubuntu, freenas was still in version 7 , i do remember trying solaris, openindiana+napp-it, freenas and ubuntu. Unfortunately i forgot why i have chosen ubuntu
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Ok, but they're very different, we just can't answer because we don't know how ZFS is implemented on Ubuntu. I think you'll have better answers (or answers at all) if you ask on the Ubuntu forum ;)
 

egrimisu

Dabbler
Joined
Aug 30, 2011
Messages
14
Ok, but they're very different, we just can't answer because we don't know how ZFS is implemented on Ubuntu. I think you'll have better answers (or answers at all) if you ask on the Ubuntu forum ;)

Done that, still waiting for they to react, anyway it seems that not a lot of options available ;)

Thanks again!
Bon soirée
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Ok, but they're very different, we just can't answer because we don't know how ZFS is implemented on Ubuntu. I think you'll have better answers (or answers at all) if you ask on the Ubuntu forum ;)
Forum usefulness is proportional to aggressiveness towards people who don't RTFM/are completely lost.
Let's just say that the people over at the Ubuntu forums are very friendly ;)
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Forum usefulness is proportional to aggressiveness towards people who don't RTFM/are completely lost.
Let's just say that the people over at the Ubuntu forums are very friendly ;)

Yep, saw that with the stackexchange forums, very high SNR :)

Then if we assume that a Linux is a Linux you can ask on others Linux forums (debian, ...), maybe you'll have better answers.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The last time I answered that question, I got a warning from jgreco that I was defaming the forum by saying it had a reputation for solving people's problems good enough to attract people from other OSes. :p

I said *what*?
 

mjws00

Guru
Joined
Jul 25, 2014
Messages
798
I'll take a guess. But take it with a grain of salt. You built your Ubuntu pool using /dev/disk/by-id/ so it remains consistent. It looks like linux takes the serial no as part of that moniker. So when you change the pcb... your disk will NOT look the same to the system. Your zpool.cache will be wrong. However, you should be able to add the disk and just re-import the pool to rebuild it. It should resilver and be fine.

With a little luck it should go pretty smoothly. I could also be wrong, but I'd take my chances all day in your shoes. I'd likely skip changing the pcb, just grab a new drive and resilver were it not for the additional errors in your remaining (non redundant) system. There may be less risk using the new drive/pcb combo if all the intact data saves a full resilver. Then you dump the failing disk. And once resilvered also dump the pcb modded disk.

Truth is any z1 pool is scary as fsck when there is a failed device and another throwing errors. Hope you have a backup.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I said *what*?

Don't ask for the exact wording - only Watson could find that now, amid thousands of posts.

It was a joke about the outside view of this forum, so no need to panic.
 
Last edited:

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
@mjws00 I guess the purpose of changing the PCB is to use the good data on this drive to resilver the other. I don't think it'll go smoothly with a failed drive and a second one throwing errors unfortunately...
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Moderator note: evicting thread from FreeNAS support area to off-topic.
 

mjws00

Guru
Joined
Jul 25, 2014
Messages
798
@mjws00 I guess the purpose of changing the PCB is to use the good data on this drive to resilver the other. I don't think it'll go smoothly with a failed drive and a second one throwing errors unfortunately...
Yep. If changing the PCB results in a 100% working disk that is magical. We have less risk than a full resilver with a new drive. There will be parity available for bad blocks on the failing drive. It might save the pool, or just a loss of files with errors.

Tough spot. But all you can do is try. 3TB Barracuda's in Z1 is gross. I'd already be migrating to z2 and restoring from backup. ;)
 
Status
Not open for further replies.
Top