Pool Upgrade Procedure Review, Please

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
I’m hoping for review and comments on my situation:

BACKGROUND:
  • I have two TrueNAS 12.0-U4 installs – nas3 and nas4 – both created on “new” hardware with salvaged HDD’s after a house fire a couple of years ago.
  • nas3 has 4 HDD’s recovered from my old FreeNAS mini with a Z2 pool Volume1
  • nas4 has 6 HDD’s recovered from my old Dell C2100/FS12-TY with a Z2 pool tank
  • Both of the “new” SM X9SCM-F -based boxes have been running 18 months or so, with a couple of drive replacements – all is stable.
  • nas4 is a backup machine (replication target) and contains a full set of 6-hourly snapshots of Volume1 of nas3.
  • nas3 is shortly going to hit 80% pool capacity so I want to add 2 drives (all 6 will be 4TB CMR WD Reds) using my two burned-in spares (I have 2 4TB Red+’s on burn-in now to become my two future spares).
  • As soon as those 2 new spares are available, I’ll install my current spares in nas3 to give it 6 disks total, providing capability of a new 6-disk Z2 pool.

TASK FOR COMMENT/CHECK/SUGGESTIONS:

I know that I have to nuke the existing 4-HDD pool nas3 Volume1 and replicate the backup snapshots over from nas4 tank to the new 6-HDD pool Volume1, then import that new Volume1. I did a lot of reading. The old “the more I read , the less certain I became” syndrome kicked in, so to try to get my head straight on the procedure I spun up a VM and tested the above with two pools on one machine – works fine. I added another VM and tested between the two VM’s – also works fine.

So here’s my final – open for comment, correction, caveat – whatever. All contributions gratefully received.

  • Step 1. Export nas3 System Dataset from pool Volume1 to Boot Drive.
  • Step 2. Immediately after a scheduled total Volume1 replication nas3 to nas4, shut down any scheduled tasks involving nas3 pending reestablishment of new 6-disk pool.
  • Step 3. Save TN config of nas3 via GUI.
  • Step 4. Export pool Volume1 and destroy it to recover the disk space.
  • Step 5. Create new pool Volume1 from the 6 disks .
  • Step 6. Confirm 10G SSH connections between nas3 and nas4.
  • Step 7. Open a “pull” Replication Task on nas3, select the new pool nas3 “Volume1” as the destination and the nas4 backup directory containing the latest root snapshot of nas3 as the source, check “Recursive”, then “Run Once”; allow to finish.
  • Step 8. Reload saved nas3 configuration (which I guess is insensitive to pool size?).
  • Step 9. Import pool Volume1. (it seems upload of saved config in Step 8 may import the pool?)
  • Step 10. Restart and test all tasks shut down in Step 2.
  • Step 11. Breathe again…
 
Joined
Jan 4, 2014
Messages
1,644
@Redcoat I had the same uneasy feeling when I had to restore a pool recently. I've documented my experience in this thread Pool restoration journey TrueNAS 12. I think the feeling of uneasiness comes about because we don't restore pools often enough to feel comfortable with the process. That coupled with the change in OS and GUI between restores makes the process seem even more unfamiliar.

Having a look at your checklist:

Step 1 : I'm curious. What's the purpose of this step? i don't think I had the opportunity to consider this when I restored my pool.
Step 3: Important! I went through considerable pain because I didn't do this.
Step 6: Change this to restore config. You'll find that the original configuration details for the SSH connection and replication tasks are restored from the saved config.
Step 7: TN 12 has a new feature that makes it very easy to create a pull replication from the original replication task. See https://www.truenas.com/community/threads/pool-restoration-journey-truenas-12.93581/post-647742.
Step 9: Not relevant? You're restoring Volume1 via the pull replication.

Any pre-existing jails?

If you rework your checklist, I'll look over it again.
 
Last edited:

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Thanks @Basil Hendroff for the thoughtful response. I'll read your thread - thanks for the link. I don't know why I didn't already see it, but apparently it will be based on some of the same concerns at least.

Step 1 - today's default behavior puts the system on the first-created pool. I didn't want to risk nuking my current system.
Step 6 - thanks for the tip.
Step 7 - I'm indeed planning on using the new feature - actually that's part of the stimulus for writing this up. None of the past sets of notes I found used that procedure,
step 9 - see my ? ? I was surprised that the restore included the import as I hadn't recognized that in any reading.

Pre-existing jails?? I have not given them "special consideration". Sounds like I've missed something. I have 3 - Plex, Unifi and Nextcloud. What's the poop on them?
 
Joined
Jan 4, 2014
Messages
1,644
Pre-existing jails?? I have not given them "special consideration".
I'm afraid you're going to have to be the point man on this.

I didn't have to consider jails during my restore operation. My focus was around a Ubuntu VM serving docker containers that I had set up on that server. Like my jails, container data is saved outside the container and stored in a pool dataset. All this came across nicely apart from one little glitch identified as Issue #3 in this post https://www.truenas.com/community/threads/pool-restoration-journey-truenas-12.93581/post-647746. This has been addressed and the fix will be available in 12.0-U5.

I've only ever backed up datasets and not whole pools. I'm not sure whether jails will have to be recreated or whether they are saved as part of pool replication? I'm assuming your data for each of the jails is served outside the jail. If you're confident that jails will be restored, then ignore my rambling. If not, you might want to export your jails and make sure the exports are replicated across to your backup system. I found this Lawrence Systems video Moving, Backing Up & Restoring FreeNAS IO Cage Jails particularly useful when considering jails.

Btw, one of your final steps should be to change the read-only state of the restored datasets (pool?) from true to false.

step 9 - see my ? ? I was surprised that the restore included the import as I hadn't recognized that in any reading.

Hmm... you're not actually importing the pool, you're restoring it. My understanding of importing is that it is attaching an exiting pool to a TrueNAS server, whereas restoring it is recreating the pool data by replicating it from a backup.
 
Last edited:

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
@Basil Hendroff , thanks again. New drive burn-in successfully completed, Time to take the plunge on the new pool creation and restore.

Thanks for prompting me to read up on jail backup/restore. I am sill uncertain that they won't restore from a full recursive snapshot backup, but I made a separate iocage export backup anyway. I'll inform later when I do the restore...

Here's the current checklist:
  • Step 1. Export nas3 System Dataset from pool Volume1 to Boot Drive.
  • Step 2. Immediately after a scheduled total Volume1 replication nas3 to nas4, shut down any scheduled tasks involving nas3 pending reestablishment of new 6-disk pool.
  • Step 3. Save TN config of nas3 via GUI.
  • Step 4. Export pool Volume1 and destroy it to recover the disk space.
  • Step 5. Create new pool Volume1 from the 6 disks .
  • Step 6. Restore config. Confirm 10G SSH connections between nas3 and nas4.
  • Step 7. Open a TN12 “pull” Replication Task on nas3, select the new pool nas3 “Volume1” as the destination and the nas4 backup directory containing the latest root snapshot of nas3 as the source, check “Recursive”, then “Run Once”; allow to finish.
  • Step 8. Reload saved nas3 configuration (which I guess is insensitive to pool size?).
  • Step 9. Import pool Volume1. (it seems upload of saved config in Step 8 may import the pool?)
  • Step 10. Restart and test all tasks shut down in Step 2.
  • Step 11. Breathe again…
Expect to start this after midnight this eve.
 
Joined
Jan 4, 2014
Messages
1,644
Somewhere between Steps 7 and 10, you'll need to change the read-only state of the restored datasets.

Looks good,. I think you've walked through this in your mind several times now and had done all you can, within reason, to plan and streamline the process of rebuilding the system, and minimise any surprises along the way.

Any last minute thoughts from other community members?

Good luck tonight. I'm keen to hear back on how you went and, with the benefit of hindsight, how you might further tweak the checklist.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Somewhere between Steps 7 and 10, you'll need to change the read-only state of the restored datasets.
Thx for the reminder, and the encouragement!
Later...
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Good luck tonight. I'm keen to hear back on how you went and, with the benefit of hindsight, how you might further tweak the checklist.
Regret to report that, while some things went well, other major matters didn't

Your alert to me on jails was a needed heads up - I backed then up and restored them just fine. Thanks for that.

I struggled with setting up the pull replication until I realized that the automatic populating of the naming schema was missing the "-2w" lifetime information suffix. I messed with that for probably 20-30 minutes, including trying a push from the other side, before finally realizing what the error message was not conveying to me!

As you said, I did have to reset the read-only permissions - time-consuming and tedious, but got it done.

On the bad side, I'm still assessing how much trouble I am in. Permissions went to hell and I've some research to do to work myself through it...

Though I thought I had it under control, turned out that not all the backup restored - I missed a folder that had a lot of subfolders under it. I can see the content in WinSCP but can't yet transfer it. I thought that the content was snapshots, but it appears not to be. I cannot see those folders under "pools" on nas4 - this I definitely don't understand. I wish I had made screen shots before I started - I certainly will in future.

Another issue is that I failed to properly disable a 4 hourly snapshot task (I know I did so, but I now believe that reload of my config had enabled it again). So, as I watched, at 04h00, a replication began, failed to find appropriate material in the target and so deleted all the folders and contents with the snapshots that had just been replicated!

Tasks to be done now:
Fix all the permissions issues. I can tell I have a lot to learn in this regard (not sure how I managed up to this point, but I had so little difficulty I never really focused on it). I actually not yet sure where to start.
Move the balance of the data that did not transfer from nas4 to nas3
Clean up nas4 and set up new replication task nas3>nas4

I'll be back at it after I have taken a nap.
 
Joined
Jan 4, 2014
Messages
1,644
I struggled with setting up the pull replication until I realized that the automatic populating of the naming schema was missing the "-2w" lifetime information suffix. I messed with that for probably 20-30 minutes, including trying a push from the other side, before finally realizing what the error message was not conveying to me!
Curious. I wasn't presented with this issue. I recall there was a flurry of threads on the forum around the time the replication engine was changed in 11.3. Replication was patched through maintenance releases and it took some time for community members to become comfortable with the changes. I recognise the suffix from pre-11.3 schemas, but I don't remember using it post-11.3. I must have rebuilt my replication tasks somewhere along the way. I'm pretty sure I used the replication wizard for this. I've been using the default schema successfully under 12.0. Here's an example:

tn02.jpg


tn03.jpg


As you said, I did have to reset the read-only permissions - time-consuming and tedious, but got it done.
Unless you have a lot of datasets, that shouldn't have been the case. It should have just been a matter of changing the state of the read-only flag through the UI for each dataset (see this post). Note: If you set the flag to 'inherit', the read-only state of a dataset will be determined by the pool read-only state.

On the bad side, I'm still assessing how much trouble I am in. Permissions went to hell and I've some research to do to work myself through it...
That's unexpected. Ownership and permission on files should have carried across in both backup and resotre operations.

Though I thought I had it under control, turned out that not all the backup restored - I missed a folder that had a lot of subfolders under it. I can see the content in WinSCP but can't yet transfer it. I thought that the content was snapshots, but it appears not to be. I cannot see those folders under "pools" on nas4 - this I definitely don't understand. I wish I had made screen shots before I started - I certainly will in future.
That's surprising given you're replicating the volume as opposed to individual datasets (which I do).

Another issue is that I failed to properly disable a 4 hourly snapshot task (I know I did so, but I now believe that reload of my config had enabled it again). So, as I watched, at 04h00, a replication began, failed to find appropriate material in the target and so deleted all the folders and contents with the snapshots that had just been replicated!
Bummer!

Clean up nas4 and set up new replication task nas3>nas4
Switch away from the LEGACY transport and use one of the newer transports, such as SSH or SSH+NETCAT, when you do this.
 
Last edited:

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
@Basil Hendroff , thanks again for supportive comments.

. I recognise the suffix from pre-11.3 schemas, but I don't remember using it post-11.3. I must have rebuilt my replication tasks somewhere along the way. I'm pretty sure I used the replication wizard for this. I've been using the default schema successfully under 12.0.
Yes - I chose to keep the -2w suffix as I found it convenient to be able to instantly identify lifetimes. I was surprised that the default schema was used to populate the field when a past replication task (that had the -2w) was selected as a starting point.

It should have just been a matter of changing the state of the read-only flag through the UI for each datase
Ah - I did not recognize this - I ssh'd in and used "zfs set readonly=off <datasetname>. I'll try the GUI next time. At least I got chance to brush up on my ssh command line skills...

That's unexpected. Ownership and permission on files should have carried across in both backup and restore operations.
I'm going to dive in to this today.

That was my "highest potential for damage" error, I think. But everything that I considered really important seems to exist in the restored datasets so I think I dodged a bullet there.

Switch away from the LEGACY transport and use one of the newer transports, such as SSH or SSH+NETCAT, when you do this.
Yes, I did that way back when I installed my 10G cards for inter-server transfer.

Thanks again, have a good day.
 
Top