Replacing HDDs in Offsite Backup - Best Practice Question

How many drives would you replace at a time?

  • 1

    Votes: 1 25.0%
  • 2

    Votes: 2 50.0%
  • 3

    Votes: 1 25.0%

  • Total voters
    4
Status
Not open for further replies.

fricker_greg

Explorer
Joined
Jun 4, 2016
Messages
71
Hey gang,

I currently have an offsite server that does daily backups of my primary sever. My offsite server is a 6 drive raidz3:

Supermicro X10SDV-4C-TLN2F
32 GB ECC Ram
6 x 4TB WD red
Fractal Node 304

I recently purchased 6 8TB HDDs and want to replace them on the backup to get the larger pool. I know how to replace the drives and know how I could do so 1 at a time.

As this box is 6 hours away and it would be a hassle to stay while drives resilver, I wanted to ask the community how many drives you would feel comfortable replacing at the same time in this raidz3 set up. I am a little sheepish, but the idea of off lining two drives at a time and resilvering both at once. All the current drives are in great condition and could be swapped back in if a resilver failed.

How many drives would you replace at a time? Would you wait to "offline" the drive so that it could be reinserted later on and not have a terminally degraded pool?
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,079
How many drives would you replace at a time? Would you wait to "offline" the drive so that it could be reinserted later on and not have a terminally degraded pool?
You have a healthy pool, so the risk is low. If the drives are not very old, and they are reporting as healthy, and your source pool is healthy. Since this is just a backup, and you have RAID-z3, you can replace 3 drives at the same time. I have a replaced two drives at once in my RAID-z2 and it doesn't take much longer to resilver two drives than it does to resilver a single drive. I wouldn't suggest that if this was your only copy of the data, but it will take quite a while depending on how much data you have. You might have to spend the night and do the other three drives on the next day. On the drives I have in one of my servers at work where there is about 3TB of data on a 6TB drive, they take over 12 hours to resilver.
How much data do you have on the system?
 

fricker_greg

Explorer
Joined
Jun 4, 2016
Messages
71
The pool is 55% used. So each disk should be about half full and the 8tb would need to get about a quarter full.
 

fricker_greg

Explorer
Joined
Jun 4, 2016
Messages
71
Thanks for the response. My source pool is healthy and raidz2. Both get scrubbed every week and have no errors or reallocated sectors.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,079
The pool is 55% used. So each disk should be about half full and the 8tb would need to get about a quarter full.
It depends on some variables I can only guess about, but you should expect it to take 8 to 10 hours to resilver, so if you do one drive at a time, it will take days.
The way you say the pool is organized, six drives total and three of them are parity, you could do it in two sets of 3 and be done in about 24 hours.
 
Joined
May 10, 2017
Messages
838
Never used raidz3 but resilvering two disks at once on raidz2 took the same time as resilvering just one, so it should be much faster doing 3 at the same time.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
One thing I prefer with ZFS, is free slots. If your backup server has free disk slots, you can fill it/them with replacement disks, (warm spares).
Then replace 1 or 2 of the existing backup pool disks as well. If you have 1 free slot and replace 2 pool disks at the same time, that's half of
your work to upgrade 6 disks.

ZFS has a neat option to replace disks with other disks already available to the OS. This is not Hot-Sparing. This is manual replacement, what
I call warm sparing, as the disks are spinning.

With free slots, the old 4TB disks may be used as spares, (warm or hot), until you have replaced the last one and grown your pool.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,079
One thing I prefer with ZFS, is free slots. If your backup server has free disk slots, you can fill it/them with replacement disks, (warm spares).
Then replace 1 or 2 of the existing backup pool disks as well. If you have 1 free slot and replace 2 pool disks at the same time, that's half of
your work to upgrade 6 disks.

ZFS has a neat option to replace disks with other disks already available to the OS. This is not Hot-Sparing. This is manual replacement, what
I call warm sparing, as the disks are spinning.

With free slots, the old 4TB disks may be used as spares, (warm or hot), until you have replaced the last one and grown your pool.
Unless they fixed it, online replacement is much slower than removing the old drive and resilver on the new one.
I tested it a few years ago, so results may be different today.

Sent from my SAMSUNG-SGH-I537 using Tapatalk
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Unless they fixed it, online replacement is much slower than removing the old drive and resilver on the new one.
I tested it a few years ago, so results may be different today.
Perhaps, but it's much SAFER. Since the on-line replacement is mirroring the existing disk, if failure occurs on a different disk, we'd have more redundancy.
In the worse case example using RAID-Z1, upgrading it's disks means you have a loss of redundancy with the pull and replace.

That said, for the original poster, if he does have free disk slots, it can save him a trip or 2 to his backup server.
 

fricker_greg

Explorer
Joined
Jun 4, 2016
Messages
71
thanks for looking out Arwen, but unfortunately I run it in a small mini itx case with only 6 drive slots and my mobo only has 6 SATA ports.

I have been doing my normal HDD burn in testing at my main rig and it has really shown me the value of having extra sata ports - something to think about for my next rig.
 

fricker_greg

Explorer
Joined
Jun 4, 2016
Messages
71
So I am sold on at least doing 2 at a time. From some testing I have been doing here, it looks like if I offline the three disks and hold onto them before wiping, that if anything happened during the resilver, I could resilver from the 3 I removed. Essentially, I would have two tries at failure in that unlikely event of failure in the 8-12 hours during each resilver job.

If so, I could reasonably drive over, start the resilver with 3, find somewhere to sleep / take photos, then come back and replace the last three and get on the road. I have a VPN and can watch / monitor from afar.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,079
Let us know how it goes.

Sent from my SAMSUNG-SGH-I537 using Tapatalk
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
thanks for looking out Arwen, but unfortunately I run it in a small mini itx case with only 6 drive slots and my mobo only has 6 SATA ports.

I have been doing my normal HDD burn in testing at my main rig and it has really shown me the value of having extra sata ports - something to think about for my next rig.
Not just free SATA or SAS ports, but a free disk slot can help. Even if you just use it for a single disk pool of scratch / temp data until you need it for replacement or backups.
 

fricker_greg

Explorer
Joined
Jun 4, 2016
Messages
71
Not just free SATA or SAS ports, but a free disk slot can help. Even if you just use it for a single disk pool of scratch / temp data until you need it for replacement or backups.

I completely agree - My next build will likely incorporate an extra disk as scratch that can be remotely replaced in the event of disk failure.

Let us know how it goes.

Will Do!

Thanks all!
 

fricker_greg

Explorer
Joined
Jun 4, 2016
Messages
71
Hey all, just finished replacing the drives now. I replaced the drives 3 at a time, and I believe that it went smoothly. However, I had to shut down the box in the middle of the second three drive resilvering and I am worried that I may have created some errors, since when it came back it had some checksum errors.
 

fricker_greg

Explorer
Joined
Jun 4, 2016
Messages
71
zpool status before shutting down

Code:
  pool: BrianNASbackup

 state: ONLINE

status: One or more devices is currently being resilvered.  The pool will

	continue to function, possibly in a degraded state.

action: Wait for the resilver to complete.

  scan: resilver in progress since Tue Jan  2 12:12:15 2018

	4.62T scanned at 602M/s, 1.63T issued at 212M/s, 12.0T total

	792G resilvered, 13.55% done, 0 days 14:17:17 to go

config:


	NAME											STATE	 READ WRITE CKSUM

	BrianNASbackup								  ONLINE	   0	 0	 0

	  raidz3-0									  ONLINE	   0	 0	 0

		gptid/d294ba3d-ef88-11e7-86d7-0cc47a7e3de0  ONLINE	   0	 0	 0

		gptid/ef744f82-efdf-11e7-b01b-0cc47a7e3de0  ONLINE	   0	 0	 0  (resilvering)

		gptid/c3abd8dd-efdf-11e7-b01b-0cc47a7e3de0  ONLINE	   0	 0	 0  (resilvering)

		gptid/80c26e18-efdf-11e7-b01b-0cc47a7e3de0  ONLINE	   0	 0	 0  (resilvering)

		gptid/5d12e65c-ef89-11e7-86d7-0cc47a7e3de0  ONLINE	   0	 0	 0

		gptid/8fa96007-ef89-11e7-86d7-0cc47a7e3de0  ONLINE	   0	 0	 0


errors: No known data errors


  pool: freenas-boot

 state: ONLINE

  scan: scrub repaired 0 in 0 days 00:08:30 with 0 errors on Tue Jan  2 03:53:31 2018

config:


	NAME										  STATE	 READ WRITE CKSUM

	freenas-boot								  ONLINE	   0	 0	 0

	  gptid/067a05fc-358f-11e6-a190-000000000000  ONLINE	   0	 0	 0


errors: No known data errors



output of zpool status after restarting:

Code:
  pool: BrianNASbackup

 state: ONLINE

status: One or more devices is currently being resilvered.  The pool will

	continue to function, possibly in a degraded state.

action: Wait for the resilver to complete.

  scan: resilver in progress since Tue Jan  2 12:12:15 2018

	1.40T scanned at 1.45G/s, 720K issued at 2.74K/s, 12.0T total

	0 resilvered, 0.00% done, no estimated completion time

config:


	NAME											STATE	 READ WRITE CKSUM

	BrianNASbackup								  ONLINE	   0	 0	 0

	  raidz3-0									  ONLINE	   0	 0	 0

		gptid/d294ba3d-ef88-11e7-86d7-0cc47a7e3de0  ONLINE	   0	 0	 0

		gptid/ef744f82-efdf-11e7-b01b-0cc47a7e3de0  ONLINE	   0	 0	 2

		gptid/c3abd8dd-efdf-11e7-b01b-0cc47a7e3de0  ONLINE	   0	 0	 1

		gptid/80c26e18-efdf-11e7-b01b-0cc47a7e3de0  ONLINE	   0	 0	 1

		gptid/5d12e65c-ef89-11e7-86d7-0cc47a7e3de0  ONLINE	   0	 0	 0

		gptid/8fa96007-ef89-11e7-86d7-0cc47a7e3de0  ONLINE	   0	 0	 0


errors: No known data errors


  pool: freenas-boot

 state: ONLINE

  scan: scrub repaired 0 in 0 days 00:08:30 with 0 errors on Tue Jan  2 03:53:31 2018

config:


	NAME										  STATE	 READ WRITE CKSUM

	freenas-boot								  ONLINE	   0	 0	 0

	  gptid/067a05fc-358f-11e6-a190-000000000000  ONLINE	   0	 0	 0


errors: No known data errors


and now:

Code:
  pool: BrianNASbackup

 state: ONLINE

status: One or more devices has experienced an unrecoverable error.  An

	attempt was made to correct the error.  Applications are unaffected.

action: Determine if the device needs to be replaced, and clear the errors

	using 'zpool clear' or replace the device with 'zpool replace'.

   see: http://illumos.org/msg/ZFS-8000-9P

  scan: resilvered 5.71T in 0 days 12:27:15 with 0 errors on Wed Jan  3 00:39:30 2018

config:


	NAME											STATE	 READ WRITE CKSUM

	BrianNASbackup								  ONLINE	   0	 0	 0

	  raidz3-0									  ONLINE	   0	 0	 0

		gptid/d294ba3d-ef88-11e7-86d7-0cc47a7e3de0  ONLINE	   0	 0	 0

		gptid/ef744f82-efdf-11e7-b01b-0cc47a7e3de0  ONLINE	   0	 0	 2

		gptid/c3abd8dd-efdf-11e7-b01b-0cc47a7e3de0  ONLINE	   0	 0	 1

		gptid/80c26e18-efdf-11e7-b01b-0cc47a7e3de0  ONLINE	   0	 0	 1

		gptid/5d12e65c-ef89-11e7-86d7-0cc47a7e3de0  ONLINE	   0	 0	 0

		gptid/8fa96007-ef89-11e7-86d7-0cc47a7e3de0  ONLINE	   0	 0	 0


errors: No known data errors


  pool: freenas-boot

 state: ONLINE

  scan: scrub repaired 0 in 0 days 00:08:30 with 0 errors on Tue Jan  2 03:53:31 2018

config:


	NAME										  STATE	 READ WRITE CKSUM

	freenas-boot								  ONLINE	   0	 0	 0

	  gptid/067a05fc-358f-11e6-a190-000000000000  ONLINE	   0	 0	 0


errors: No known data errors
 

fricker_greg

Explorer
Joined
Jun 4, 2016
Messages
71
I now got this message from freenas once it finished resilvering:

Code:
status: One or more devices has experienced an unrecoverable error.  An

	attempt was made to correct the error.  Applications are unaffected.

action: Determine if the device needs to be replaced, and clear the errors

	using 'zpool clear' or replace the device with 'zpool replace


as above.

I checked the smart output of each drive and they are unchanged from before in my testing, I introduced 1 UDMA CRC error in two of the drives via a bad cable and pulling cable in / out a few times while drive was still on, but no new smart errors.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,079
Did it finish resilvering the pool after the reboot?

Did you try doing a scrub of the pool since the checksum error came up?
 

fricker_greg

Explorer
Joined
Jun 4, 2016
Messages
71
I am scrubbing the pool now again, and I will follow up with long smart tests of the drives, but is this just an error due to shutting down during the resilver and one that I don't need to worry about since it fixed the problem?
 

fricker_greg

Explorer
Joined
Jun 4, 2016
Messages
71
Did it finish resilvering the pool after the reboot?

Did you try doing a scrub of the pool since the checksum error came up?


yep, it started back resilvering without any intervention from me upon start up again, and I am scrubbing the pool now.
 
Status
Not open for further replies.
Top