Hardening habitat against upstream changes / failures


#1

I’ve been using habitat for close to a year now to manage the build process for a couple of internal applications. The experience has been mostly good, however we’ve had problems with breaking changes in package dependencies during rebuilds and some problems with bldr being offline, etc. Of course, these frequently seem to happen when we’re trying to get a release out, so it’s become a high visibility problem even though it’s only happen half a dozen times or so. As a result, I’m exploring ways to harden our build process against those issues. I haven’t found anything really in the way of docs that covers best practices around this though, so I’m starting this as a place to collect ideas. In essence, what I’m trying to accomplish is the ability to perform builds reliably and consistently, even if all the habitat infrastructure is unreachable, and to only use updated versions of packages when I explicitly ask for it.

Here’s what I’m thinking so far:

  • Pin package versions - I generally don’t like doing this as it tends to create maintenance issues, and in habitat’s case in particular, I don’t know how it resolves transitive dependencies. I’m wondering if in the normal case this would actually cause more problems than it solves.
  • Cache all dependencies locally - My hope is that if a build has everything it needs it won’t attempt to download deps at build time. In limited experimentation, making this happen is non-obvious. Even if I install depedencies when I create my container that builds happen in, they seem to still be re-downloaded during builds.
  • Run a local bldr - Sort of an extension of the above, but this introduces quite a bit of overhead, so I’d rather not go there if possible.

Thoughts? Am I forgetting anything? Is “offline habitat building” already documented somewhere and I missed it?

Thanks!


#2

So, nobody else is chewing on this?


#3

Oh dang @qhartman this definitely fell through the cracks, I am really sorry :frowning: .

There are a few things I think you could do to protect against builder downtime and breaking changes but mostly it comes down to workflow with only a splash of tooling.

The first thing you could do is check out https://github.com/habitat-sh/on-prem-builder which would allow you to run an on-premises depot that could sync packages from core public builder. Then you can point your pipelines/supervisors to this on-prem depot which should mitigate any issues with public bldr being down.

The next thing WRT to breaking package changes is going to be more heavily tied to your workflow. Generally one of the things i suggest people do is use a sort of configuration package pattern for services you want to consume directly from core. What this looks like with mysql for example is (right now at least, we really want to take this and make it more of a first class experience) creating a new planfile something like <appname>-database that has a single dependency on core/mysql, with a copy of the hooks and config files. Then you dont really need to pin any versions because as core/mysql gets updated your wrapper will get an unstable version created that can be tested before promotion and you’ve effectively given yourself a nice little buffer between an inbound change.

That being said, we are going to be doing more with core plans in relation to communication of breaking changes and the like in the future which should also help out!


#4

@eeyun I think this is the link you’re looking for: https://github.com/habitat-sh/on-prem-builder ? (your link 404s for me)

So to put it in “Chef” terms, make a “wrapper plan”? That… actually would make my life a whole lot easier since (for instance) I only want to change one line in the core/consul config file/run hook… then I don’t have to worry if upstream changes something… (it would be even cooler if I could just override specific hooks… like, the run hook is fine, but I want to change the health_monitor hook)


#5

Thats correct thank you for fixing that link. Yeah thats totally the vision for that experience. Ideally, declare the shit you care about, dont change anything you dont, and dont have to lug files between packages. Now there are some other clever ways you can do this today by sourcing the files from the package on disk, but i’d like it to be even cleaner than that.


#6

No problem, I appreciate your reply. I’ve had to back-burner this stuff anyway, so the delay was no problem for me. What you said lines up well with what I’ve been thinking, so that’s a reassurance.

Do you have any specific thoughts on the pinning of package versions? It’s not a practice I like generally like, but with how hab has it deps managed, especially transitive deps, it seems like it might be necessary. I’m thinking specifically the cases where a top-level package requires a rebuild because it and another top-level package both depend on the same thing, but the peer package at the top level has had an update which caused that dependency to get bumped. Even if there is no actual version change, because the build changed, the dependencies as defined by the top-level packages conflict, causing builds to fail. The solution in the past has been to request a rebuild of the now out-of-date package, or to temporarily pin the updated package to the previous version where the deps matched. Have you guys updated how you’re managing rebuilds to make that less likely to happen? I haven’t seen this particular issue in awhile, so maybe I’m concerned about it out of superstition at this point.

Given that the main solution sounds like it would be to run a local bldr, the real question at this point then is whether or not habitat is providing enough value to be worth the trouble, or if it makes more sense for us to migrate to something else. We’re really only using hab for build and package, and we’re pretty strongly married to docker containers at this point, so the flexibility to target other things isn’t something we need anymore. We’re not using the supervisor stuff at all. Now that I have Traefik deployed in my cluster I have better http routing options available to me than what ECS offers natively, so maybe I can get the supervisor piece working and providing value as well, and that would tip the balance.


#7

I personally would advise against the version pinning in most cases. Pinning versions will more frequently lead to version mismatches in your reverse dep graph at build time which can be a pain in the ass to unwind unless you pin all of the versions in your package but then you could be doing dep management down to your libc version. BUT, I will say adopting the above wrapper pattern EVEN WITHOUT an on-prem depot, should definitely help to alleviate some of the version change pain. It at least puts you in a place where if you’re building in a jenkins pipeline you could (if even temporarily) pin a single dep version to a previous release in order to allow an internal code change to go out. We’re also looking at some more integrations and tooling for people that aren’t using builder’s services for the future though that doesn’t help much right now. Part of the benefit of having the on-prem-depot is both access to upstream continuous builds and the ability to use channels (even without using any of the update strategies or clustering on the supervisor) and a locally accessible API to integrate into your CI pipeline and package auditability (though we may have some other tools in the future for this as well).

The idea for on-prem (which not 100% of this functionality actually exists today though its all on the roadmap) is that when a change hits core, your local packages can get rebuilt and placed into a configurable channel. At this point tying that into a CI pipeline means: PR to core triggers openssl rebuild. App FOO uses openssl, it gets rebuilt and dropped in a new build channel. CI polling the API finds the change, spins up scripted functional test environment, validates state and on success promotes the package into a release channel at which point notification that a new package is tested/available for promotion goes out. From there either your CI pipeline or builder can add the appropriate labels to your freshly minted containers in your docker repo and you’re ready to push button deploy or use repo tags to auto deploy.

Now the question of whether or not Habitat without this behavior is a value to your organization can only be answered by you. I think for folks managing larger container fleets simply having atomic, deterministically built transparent containers is a pretty big win. But for folks that are building a handful of statically compiled golang binaries or perhaps an org that isnt running enough containers where fleet audit-ability is a concern then the build tools might be the only value add. At that point it could be that pinning/building from a local cache is the right call and it could simplify your life.

WRT traefik, we’ve probably got some internal examples of this we could scrape together if you end up having an interest as some of our software products use traefik with hab.


#8

Great. Thanks as always for the thoughtful reply. I’ll be chewing on this in the next couple of weeks. Despite the bumps we’ve had, I really like hab, so I’m going to see about getting the supervisors working. I would love to have a “build once run anywhere” setup, and that might provide a path forward for that. If so, it’s a clear win and would justify the effort of setting up a local bldr.