I am working through a basic script to backup and restore a habitat depot. The backup portion seems to be working (most of my code is roughly based on this GH issue thread: https://github.com/habitat-sh/on-prem-builder/issues/60).
However, when I go to restore our production habitat depot on a new habitat depot instance (clean-slate) using the archive(s) created from the above GH issue, everything restores correctly (postgres tables, minio datastore, etc.), but when I restart services, the packages from the production habitat depot are not present on the newly created and restored dev instance. The files are on disk, but when I log in to the habitat UI and search for my packages, they don’t come up.
It feels like I need to trigger some other type of reindex or refresh of the backend metadata, so the UI can “catch up”.
Has anyone else run into this, or have pointers about where I should start looking?
Hi Kyle - it is possible that you will need to run the shard migration script on your new instance, as detailed in the on-prem README - https://github.com/habitat-sh/on-prem-builder
It is likely that when you created the new instance, you have installed the latest versions of the builder services, and they require the shard migration. You may also need to do a Minio migration, as the package files are now in Minio, and not on the file system. Again, the migration information is in the README.
You can check the Builder services versions by doing a sudo hab svc status, and compare the versions between your old instance and your new instance.
Please post the service versions, and any output from journalctl -fu hab-sup if the above does not work. Thanks!
All of the versions are the same on prod vs dev, with only a minor version change for builder-sessionsrv & builder-originsrv (prod = v7519, dev= v7582).
# Prod
package type desired state elapsed (s) pid group
habitat/builder-sessionsrv/7519/20180731190110 standalone up up 1646944 6050 builder-sessionsrv.default
habitat/builder-router/7519/20180731190111 standalone up up 1646945 5981 builder-router.default
habitat/builder-originsrv/7519/20180731190110 standalone up up 1646943 6248 builder-originsrv.default
habitat/builder-minio/0.1.0/20180612201128 standalone up up 96044 7208 builder-minio.default
habitat/builder-datastore/7311/20180426183913 standalone up up 1646945 5993 builder-datastore.default
habitat/builder-api-proxy/7519/20180731190110 standalone up up 1646944 6210 builder-api-proxy.default
habitat/builder-api/7554/20180808175204 standalone up up 1646944 6111 builder-api.default
# Dev
package type desired state elapsed (s) pid group
habitat/builder-sessionsrv/7582/20180822212645 standalone up up 378 19003 builder-sessionsrv.default
habitat/builder-router/7519/20180731190111 standalone up up 87833 9924 builder-router.default
habitat/builder-originsrv/7582/20180822212645 standalone up up 382 18903 builder-originsrv.default
habitat/builder-minio/0.1.0/20180612201128 standalone up up 527 18740 builder-minio.default
habitat/builder-datastore/7311/20180426183913 standalone up up 461 18799 builder-datastore.default
habitat/builder-api-proxy/7519/20180731190110 standalone up up 87813 10061 builder-api-proxy.default
habitat/builder-api/7554/20180808175204 standalone up up 87821 9945 builder-api.default
Attaching a log from journalctl -fu hab-sup. I stopped all services, started the journalctl command, then started all habitat services. Once they were done (according to the hab-sup.log), I refreshed the builder UI (which asked me to authenticate again, which I did successfully). Once I was in, I searched for one of my packages, and it did not show up.
EDIT: Can’t figure out how to attach files, should I send the file somewhere?
Also, I did try running through the steps on https://github.com/habitat-sh/on-prem-builder#migration-1, but after I do the ./uninstall.sh and ./install.sh, I don’t see a ./scripts/migrate.sh file, so I’m not sure if I should continue?
Note that our repo doesn’t have a copy of that merge-shards.sh script, but I pulled a copy down from the current on-prem-depot repo, and attempted to run it anyways.
I got further this time, but it error’d out when I attempted to run the script on our DEV instance (this being the instance we restored the pg backup to):
Hi Kyle - in general, it is important to not grab things piecemeal from the repo - there may be config in provision or other changes that might be impacting the migration and causing issues. The recommended path is to pull down the full repo (either directly, or by pulling all changes into your own repo), then do an uninstall (it is not destructive), and then do the install.
That said, we will look further at scoping down the root cause of the error you are seeing during migrate.
As luck would have it, the two releases in question straddle a database migration that we did in the middle of August to merge all of the database schemas into the public schema. 7519 expects there to be 128 shards in the database and 7582 expects to see that data migrated in a very specific way.
The error you’re getting suggests that data has already been inserted into the origins table of the public schema before migrate-shards.sh is ever run, which shouldn’t happen until after the migration has run. If you open psql on your new dev instance, connect to the builder_originsrv database and run SELECT * FROM origins; I’m guessing you’ll see a record for the core origin.
When you do your restore, it’s important that the merge-shards.sh script be run after the services have been started up (to make sure the migrations have been run), but before any clients connect to the database and start making requests. It’s difficult to know exactly what’s happening here without seeing the code for your backup/restore process.
It’s also worth noting that you’re likely seeing this problem specifically because of the two different versions of the builder services that you’re running. If you were backing up and restoring the same versions of the Builder services, I don’t think you’d have this problem.
@salam, I understand that’s not the recommended approach (and it’s not necessarily how I would have approached it either), but I am not the owner of our repo, and so I am trying to figure out why the backup and restore wasn’t working.
Based on what @raskchanky said, I think that’s probably what is going on. I’ll look at getting our repo either to the same version as our PROD instance, or wiping prod and dev, and starting over so they are both consistent, having the same starting point.
Thank you both for your feedback and direction, it’s been greatly appreciated!