Backup and Restore Habitat Depot

Hello,

I am working through a basic script to backup and restore a habitat depot. The backup portion seems to be working (most of my code is roughly based on this GH issue thread: https://github.com/habitat-sh/on-prem-builder/issues/60).

However, when I go to restore our production habitat depot on a new habitat depot instance (clean-slate) using the archive(s) created from the above GH issue, everything restores correctly (postgres tables, minio datastore, etc.), but when I restart services, the packages from the production habitat depot are not present on the newly created and restored dev instance. The files are on disk, but when I log in to the habitat UI and search for my packages, they don’t come up.

It feels like I need to trigger some other type of reindex or refresh of the backend metadata, so the UI can “catch up”.

Has anyone else run into this, or have pointers about where I should start looking?

Hi Kyle - it is possible that you will need to run the shard migration script on your new instance, as detailed in the on-prem README - https://github.com/habitat-sh/on-prem-builder

It is likely that when you created the new instance, you have installed the latest versions of the builder services, and they require the shard migration. You may also need to do a Minio migration, as the package files are now in Minio, and not on the file system. Again, the migration information is in the README.

You can check the Builder services versions by doing a sudo hab svc status, and compare the versions between your old instance and your new instance.

Please post the service versions, and any output from journalctl -fu hab-sup if the above does not work. Thanks!

1 Like

All of the versions are the same on prod vs dev, with only a minor version change for builder-sessionsrv & builder-originsrv (prod = v7519, dev= v7582).

# Prod

package                                     type        desired  state  elapsed (s)  pid   group
habitat/builder-sessionsrv/7519/20180731190110  standalone  up  up  1646944  6050  builder-sessionsrv.default
habitat/builder-router/7519/20180731190111  standalone  up       up     1646945      5981  builder-router.default
habitat/builder-originsrv/7519/20180731190110  standalone  up  up  1646943  6248  builder-originsrv.default
habitat/builder-minio/0.1.0/20180612201128  standalone  up  up  96044  7208  builder-minio.default
habitat/builder-datastore/7311/20180426183913  standalone  up  up  1646945  5993  builder-datastore.default
habitat/builder-api-proxy/7519/20180731190110  standalone  up  up  1646944  6210  builder-api-proxy.default
habitat/builder-api/7554/20180808175204  standalone  up  up  1646944  6111  builder-api.default
# Dev

package                                         type        desired  state  elapsed (s)  pid    group
habitat/builder-sessionsrv/7582/20180822212645  standalone  up       up     378          19003  builder-sessionsrv.default
habitat/builder-router/7519/20180731190111      standalone  up       up     87833        9924   builder-router.default
habitat/builder-originsrv/7582/20180822212645   standalone  up       up     382          18903  builder-originsrv.default
habitat/builder-minio/0.1.0/20180612201128      standalone  up       up     527          18740  builder-minio.default
habitat/builder-datastore/7311/20180426183913   standalone  up       up     461          18799  builder-datastore.default
habitat/builder-api-proxy/7519/20180731190110   standalone  up       up     87813        10061  builder-api-proxy.default
habitat/builder-api/7554/20180808175204         standalone  up       up     87821        9945   builder-api.default

Attaching a log from journalctl -fu hab-sup. I stopped all services, started the journalctl command, then started all habitat services. Once they were done (according to the hab-sup.log), I refreshed the builder UI (which asked me to authenticate again, which I did successfully). Once I was in, I searched for one of my packages, and it did not show up.

EDIT: Can’t figure out how to attach files, should I send the file somewhere?

Also, I did try running through the steps on https://github.com/habitat-sh/on-prem-builder#migration-1, but after I do the ./uninstall.sh and ./install.sh, I don’t see a ./scripts/migrate.sh file, so I’m not sure if I should continue?

weird - How did you download the on-prem-depot repo?

We cloned it from the main GH repo, and are using our cloned repo to deploy.

oh I see the issue - the script is called merge-shards.sh.

Instructions are here https://github.com/habitat-sh/on-prem-builder#merging-database-shards but the script name is incorrect in the docs. Fixing that now

Note that our repo doesn’t have a copy of that merge-shards.sh script, but I pulled a copy down from the current on-prem-depot repo, and attempted to run it anyways.

drwxrwxr-x. 2 centos centos    111 Sep 24 20:10 .
drwxrwxr-x. 3 centos centos    119 Sep 24 20:10 ..
-rwxr-xr-x. 1 centos centos    779 Sep  7 19:43 hab-sup.service.sh
-rwxr-xr-x. 1 centos centos    315 Sep  7 19:43 install-hab.sh
-rw-r--r--. 1 centos centos   9755 Sep  7 19:43 on-prem-archive.sh
-rwxr-xr-x. 1 centos centos   7408 Sep  7 19:43 provision.sh

I got further this time, but it error’d out when I attempted to run the script on our DEV instance (this being the instance we restored the pg backup to):

# PGPASSWORD=$(sudo cat /hab/svc/builder-datastore/config/pwfile) ./scripts/merge-shards.sh originsrv migrate
[ ... snip ... ]
current schema = shard_30
Count for shard_30.origins = 1
ERROR:  duplicate key value violates unique constraint "origins_name_key"
DETAIL:  Key (name)=(core) already exists.

Hi Kyle - in general, it is important to not grab things piecemeal from the repo - there may be config in provision or other changes that might be impacting the migration and causing issues. The recommended path is to pull down the full repo (either directly, or by pulling all changes into your own repo), then do an uninstall (it is not destructive), and then do the install.

That said, we will look further at scoping down the root cause of the error you are seeing during migrate.

Hi Kyle,

As luck would have it, the two releases in question straddle a database migration that we did in the middle of August to merge all of the database schemas into the public schema. 7519 expects there to be 128 shards in the database and 7582 expects to see that data migrated in a very specific way.

The error you’re getting suggests that data has already been inserted into the origins table of the public schema before migrate-shards.sh is ever run, which shouldn’t happen until after the migration has run. If you open psql on your new dev instance, connect to the builder_originsrv database and run SELECT * FROM origins; I’m guessing you’ll see a record for the core origin.

When you do your restore, it’s important that the merge-shards.sh script be run after the services have been started up (to make sure the migrations have been run), but before any clients connect to the database and start making requests. It’s difficult to know exactly what’s happening here without seeing the code for your backup/restore process.

It’s also worth noting that you’re likely seeing this problem specifically because of the two different versions of the builder services that you’re running. If you were backing up and restoring the same versions of the Builder services, I don’t think you’d have this problem.

@salam, I understand that’s not the recommended approach (and it’s not necessarily how I would have approached it either), but I am not the owner of our repo, and so I am trying to figure out why the backup and restore wasn’t working.

Based on what @raskchanky said, I think that’s probably what is going on. I’ll look at getting our repo either to the same version as our PROD instance, or wiping prod and dev, and starting over so they are both consistent, having the same starting point.

Thank you both for your feedback and direction, it’s been greatly appreciated!

Cool, let us know how things go!