Replacing HDDs in Offsite Backup - Best Practice Question

fricker_greg · Dec 9, 2017

Hey gang,

I currently have an offsite server that does daily backups of my primary sever. My offsite server is a 6 drive raidz3:

Supermicro X10SDV-4C-TLN2F
32 GB ECC Ram
6 x 4TB WD red
Fractal Node 304

I recently purchased 6 8TB HDDs and want to replace them on the backup to get the larger pool. I know how to replace the drives and know how I could do so 1 at a time.

As this box is 6 hours away and it would be a hassle to stay while drives resilver, I wanted to ask the community how many drives you would feel comfortable replacing at the same time in this raidz3 set up. I am a little sheepish, but the idea of off lining two drives at a time and resilvering both at once. All the current drives are in great condition and could be swapped back in if a resilver failed.

How many drives would you replace at a time? Would you wait to "offline" the drive so that it could be reinserted later on and not have a terminally degraded pool?

Chris Moore · Dec 9, 2017

fricker_greg said:
How many drives would you replace at a time? Would you wait to "offline" the drive so that it could be reinserted later on and not have a terminally degraded pool?

You have a healthy pool, so the risk is low. If the drives are not very old, and they are reporting as healthy, and your source pool is healthy. Since this is just a backup, and you have RAID-z3, you can replace 3 drives at the same time. I have a replaced two drives at once in my RAID-z2 and it doesn't take much longer to resilver two drives than it does to resilver a single drive. I wouldn't suggest that if this was your only copy of the data, but it will take quite a while depending on how much data you have. You might have to spend the night and do the other three drives on the next day. On the drives I have in one of my servers at work where there is about 3TB of data on a 6TB drive, they take over 12 hours to resilver.
How much data do you have on the system?

fricker_greg · Dec 9, 2017

The pool is 55% used. So each disk should be about half full and the 8tb would need to get about a quarter full.

fricker_greg · Dec 9, 2017

Thanks for the response. My source pool is healthy and raidz2. Both get scrubbed every week and have no errors or reallocated sectors.

Chris Moore · Dec 9, 2017

fricker_greg said:
The pool is 55% used. So each disk should be about half full and the 8tb would need to get about a quarter full.

It depends on some variables I can only guess about, but you should expect it to take 8 to 10 hours to resilver, so if you do one drive at a time, it will take days.
The way you say the pool is organized, six drives total and three of them are parity, you could do it in two sets of 3 and be done in about 24 hours.

Johnnie Black · Dec 10, 2017

Never used raidz3 but resilvering two disks at once on raidz2 took the same time as resilvering just one, so it should be much faster doing 3 at the same time.

Arwen · Dec 10, 2017

One thing I prefer with ZFS, is free slots. If your backup server has free disk slots, you can fill it/them with replacement disks, (warm spares).
Then replace 1 or 2 of the existing backup pool disks as well. If you have 1 free slot and replace 2 pool disks at the same time, that's half of
your work to upgrade 6 disks.

ZFS has a neat option to replace disks with other disks already available to the OS. This is not Hot-Sparing. This is manual replacement, what
I call warm sparing, as the disks are spinning.

With free slots, the old 4TB disks may be used as spares, (warm or hot), until you have replaced the last one and grown your pool.

Chris Moore · Dec 10, 2017

Arwen said:
One thing I prefer with ZFS, is free slots. If your backup server has free disk slots, you can fill it/them with replacement disks, (warm spares).
Then replace 1 or 2 of the existing backup pool disks as well. If you have 1 free slot and replace 2 pool disks at the same time, that's half of
your work to upgrade 6 disks.

ZFS has a neat option to replace disks with other disks already available to the OS. This is not Hot-Sparing. This is manual replacement, what
I call warm sparing, as the disks are spinning.

With free slots, the old 4TB disks may be used as spares, (warm or hot), until you have replaced the last one and grown your pool.

Unless they fixed it, online replacement is much slower than removing the old drive and resilver on the new one.
I tested it a few years ago, so results may be different today.

Sent from my SAMSUNG-SGH-I537 using Tapatalk

Arwen · Dec 10, 2017

Chris Moore said:
Unless they fixed it, online replacement is much slower than removing the old drive and resilver on the new one.
I tested it a few years ago, so results may be different today.

Perhaps, but it's much SAFER. Since the on-line replacement is mirroring the existing disk, if failure occurs on a different disk, we'd have more redundancy.
In the worse case example using RAID-Z1, upgrading it's disks means you have a loss of redundancy with the pull and replace.

That said, for the original poster, if he does have free disk slots, it can save him a trip or 2 to his backup server.

fricker_greg · Dec 10, 2017

thanks for looking out Arwen, but unfortunately I run it in a small mini itx case with only 6 drive slots and my mobo only has 6 SATA ports.

I have been doing my normal HDD burn in testing at my main rig and it has really shown me the value of having extra sata ports - something to think about for my next rig.

fricker_greg · Dec 10, 2017

So I am sold on at least doing 2 at a time. From some testing I have been doing here, it looks like if I offline the three disks and hold onto them before wiping, that if anything happened during the resilver, I could resilver from the 3 I removed. Essentially, I would have two tries at failure in that unlikely event of failure in the 8-12 hours during each resilver job.

If so, I could reasonably drive over, start the resilver with 3, find somewhere to sleep / take photos, then come back and replace the last three and get on the road. I have a VPN and can watch / monitor from afar.

Chris Moore · Dec 11, 2017

Let us know how it goes.

Sent from my SAMSUNG-SGH-I537 using Tapatalk

Arwen · Dec 11, 2017

fricker_greg said:
thanks for looking out Arwen, but unfortunately I run it in a small mini itx case with only 6 drive slots and my mobo only has 6 SATA ports.

I have been doing my normal HDD burn in testing at my main rig and it has really shown me the value of having extra sata ports - something to think about for my next rig.

Not just free SATA or SAS ports, but a free disk slot can help. Even if you just use it for a single disk pool of scratch / temp data until you need it for replacement or backups.

fricker_greg · Dec 11, 2017

Arwen said:
Not just free SATA or SAS ports, but a free disk slot can help. Even if you just use it for a single disk pool of scratch / temp data until you need it for replacement or backups.

I completely agree - My next build will likely incorporate an extra disk as scratch that can be remotely replaced in the event of disk failure.

Chris Moore said:
Let us know how it goes.

Will Do!

Thanks all!

fricker_greg · Jan 3, 2018

Hey all, just finished replacing the drives now. I replaced the drives 3 at a time, and I believe that it went smoothly. However, I had to shut down the box in the middle of the second three drive resilvering and I am worried that I may have created some errors, since when it came back it had some checksum errors.

fricker_greg · Jan 3, 2018

zpool status before shutting down

Code:

  pool: BrianNASbackup

 state: ONLINE

status: One or more devices is currently being resilvered.  The pool will

	continue to function, possibly in a degraded state.

action: Wait for the resilver to complete.

  scan: resilver in progress since Tue Jan  2 12:12:15 2018

	4.62T scanned at 602M/s, 1.63T issued at 212M/s, 12.0T total

	792G resilvered, 13.55% done, 0 days 14:17:17 to go

config:


	NAME											STATE	 READ WRITE CKSUM

	BrianNASbackup								  ONLINE	   0	 0	 0

	  raidz3-0									  ONLINE	   0	 0	 0

		gptid/d294ba3d-ef88-11e7-86d7-0cc47a7e3de0  ONLINE	   0	 0	 0

		gptid/ef744f82-efdf-11e7-b01b-0cc47a7e3de0  ONLINE	   0	 0	 0  (resilvering)

		gptid/c3abd8dd-efdf-11e7-b01b-0cc47a7e3de0  ONLINE	   0	 0	 0  (resilvering)

		gptid/80c26e18-efdf-11e7-b01b-0cc47a7e3de0  ONLINE	   0	 0	 0  (resilvering)

		gptid/5d12e65c-ef89-11e7-86d7-0cc47a7e3de0  ONLINE	   0	 0	 0

		gptid/8fa96007-ef89-11e7-86d7-0cc47a7e3de0  ONLINE	   0	 0	 0


errors: No known data errors


  pool: freenas-boot

 state: ONLINE

  scan: scrub repaired 0 in 0 days 00:08:30 with 0 errors on Tue Jan  2 03:53:31 2018

config:


	NAME										  STATE	 READ WRITE CKSUM

	freenas-boot								  ONLINE	   0	 0	 0

	  gptid/067a05fc-358f-11e6-a190-000000000000  ONLINE	   0	 0	 0


errors: No known data errors

output of zpool status after restarting:

Code:

  pool: BrianNASbackup

 state: ONLINE

status: One or more devices is currently being resilvered.  The pool will

	continue to function, possibly in a degraded state.

action: Wait for the resilver to complete.

  scan: resilver in progress since Tue Jan  2 12:12:15 2018

	1.40T scanned at 1.45G/s, 720K issued at 2.74K/s, 12.0T total

	0 resilvered, 0.00% done, no estimated completion time

config:


	NAME											STATE	 READ WRITE CKSUM

	BrianNASbackup								  ONLINE	   0	 0	 0

	  raidz3-0									  ONLINE	   0	 0	 0

		gptid/d294ba3d-ef88-11e7-86d7-0cc47a7e3de0  ONLINE	   0	 0	 0

		gptid/ef744f82-efdf-11e7-b01b-0cc47a7e3de0  ONLINE	   0	 0	 2

		gptid/c3abd8dd-efdf-11e7-b01b-0cc47a7e3de0  ONLINE	   0	 0	 1

		gptid/80c26e18-efdf-11e7-b01b-0cc47a7e3de0  ONLINE	   0	 0	 1

		gptid/5d12e65c-ef89-11e7-86d7-0cc47a7e3de0  ONLINE	   0	 0	 0

		gptid/8fa96007-ef89-11e7-86d7-0cc47a7e3de0  ONLINE	   0	 0	 0


errors: No known data errors


  pool: freenas-boot

 state: ONLINE

  scan: scrub repaired 0 in 0 days 00:08:30 with 0 errors on Tue Jan  2 03:53:31 2018

config:


	NAME										  STATE	 READ WRITE CKSUM

	freenas-boot								  ONLINE	   0	 0	 0

	  gptid/067a05fc-358f-11e6-a190-000000000000  ONLINE	   0	 0	 0


errors: No known data errors

and now:

Code:

  pool: BrianNASbackup

 state: ONLINE

status: One or more devices has experienced an unrecoverable error.  An

	attempt was made to correct the error.  Applications are unaffected.

action: Determine if the device needs to be replaced, and clear the errors

	using 'zpool clear' or replace the device with 'zpool replace'.

   see: http://illumos.org/msg/ZFS-8000-9P

  scan: resilvered 5.71T in 0 days 12:27:15 with 0 errors on Wed Jan  3 00:39:30 2018

config:


	NAME											STATE	 READ WRITE CKSUM

	BrianNASbackup								  ONLINE	   0	 0	 0

	  raidz3-0									  ONLINE	   0	 0	 0

		gptid/d294ba3d-ef88-11e7-86d7-0cc47a7e3de0  ONLINE	   0	 0	 0

		gptid/ef744f82-efdf-11e7-b01b-0cc47a7e3de0  ONLINE	   0	 0	 2

		gptid/c3abd8dd-efdf-11e7-b01b-0cc47a7e3de0  ONLINE	   0	 0	 1

		gptid/80c26e18-efdf-11e7-b01b-0cc47a7e3de0  ONLINE	   0	 0	 1

		gptid/5d12e65c-ef89-11e7-86d7-0cc47a7e3de0  ONLINE	   0	 0	 0

		gptid/8fa96007-ef89-11e7-86d7-0cc47a7e3de0  ONLINE	   0	 0	 0


errors: No known data errors


  pool: freenas-boot

 state: ONLINE

  scan: scrub repaired 0 in 0 days 00:08:30 with 0 errors on Tue Jan  2 03:53:31 2018

config:


	NAME										  STATE	 READ WRITE CKSUM

	freenas-boot								  ONLINE	   0	 0	 0

	  gptid/067a05fc-358f-11e6-a190-000000000000  ONLINE	   0	 0	 0


errors: No known data errors

fricker_greg · Jan 3, 2018

I now got this message from freenas once it finished resilvering:

Code:

status: One or more devices has experienced an unrecoverable error.  An

	attempt was made to correct the error.  Applications are unaffected.

action: Determine if the device needs to be replaced, and clear the errors

	using 'zpool clear' or replace the device with 'zpool replace

as above.

I checked the smart output of each drive and they are unchanged from before in my testing, I introduced 1 UDMA CRC error in two of the drives via a bad cable and pulling cable in / out a few times while drive was still on, but no new smart errors.

Chris Moore · Jan 3, 2018

Did it finish resilvering the pool after the reboot?

Did you try doing a scrub of the pool since the checksum error came up?

fricker_greg · Jan 3, 2018

I am scrubbing the pool now again, and I will follow up with long smart tests of the drives, but is this just an error due to shutting down during the resilver and one that I don't need to worry about since it fixed the problem?

fricker_greg · Jan 3, 2018

Chris Moore said:
Did it finish resilvering the pool after the reboot?

Did you try doing a scrub of the pool since the checksum error came up?

yep, it started back resilvering without any intervention from me upon start up again, and I am scrubbing the pool now.

Important Announcement for the TrueNAS Community.

Replacing HDDs in Offsite Backup - Best Practice Question

How many drives would you replace at a time?

1

2

3

Explorer

Hall of Famer

Explorer

Explorer

Hall of Famer

Guru

MVP

Hall of Famer

MVP

Explorer

Explorer

Hall of Famer

MVP

Explorer

Explorer

Explorer

Explorer

Hall of Famer

Explorer

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Replacing HDDs in Offsite Backup - Best Practice Question"

Similar threads