How to cancel 'stuck' Builder builds


#1

If for some reason, a build gets ‘stuck’, where it is not progressing for an extended period of time, it can be canceled by issuing the following command:

hab bldr job cancel <GROUP_ID>

This command may require admin credentials (eg, have WORKER privilege as part of being on the core-maintainers group in Github) if the caller is not a member of the origin for the group.

The group ID can be found in a number of ways:

  • If the job was initiated from the CLI, the group ID is returned via STDOUT
  • From the Datadog dashboard
  • From Sumologic logs
  • From the DB (see steps below)

If you suspect a build group is stuck (symptoms: cannot kick off new builds of that package), here’s a way to check it via the DB.

Log into the ‘builder_jobsrv’ postgres instance, from the builder datastore node (which is where we are running the Postgres instances), eg:

$ sudo su hab
$ /hab/pkgs/core/postgresql/9.6.6/20180208190339/bin/psql builder_jobsrv
psql (9.6.3, server 9.6.6)
Type "help" for help.

builder_jobsrv=#

Then, you can issue more specific commands:

set search_path to shard_0;
select * from groups order by created_at desc;

Now you should be able to hopefully spot what groups are in a ‘Dispatching’ state. If you see your package in the group list, then make a note of the group ID and proceed to issue the cancel command.

The below steps should ONLY be run if for some reason the ‘hab bldr job cancel’ command is not available or not working.

select project_name, project_state, updated_at from group_projects where owner_id = '<owner id>'

This should give a list of the projects and their states. If you see a project in the ‘InProgress’ state that has not been updated in a while, then that is a suspect for where the stall is.

You can unstick the pipeline by issuing a DELETE on that group, eg:

http DELETE https://bldr.habitat.sh/v1/depot/pkgs/schedule/<group id> Authorization:Bearer:${HAB_AUTH_TOKEN}

Note that the DELETE can only be issued if you have worker privilege in the auth token.

Canceling mass builds

Occasionally when we rebuild a plan that many other plans depend on, we kick off an exponential amount of builds. You can check how many builds are pending through this Datadog link (Note - you will need a Datadog account and be added to the Habitat organization on Datadog)

You can get a list of all current groups builds for the core origin with:

$ hab bldr job status -s --origin core

Then cancel those builds individually with:

$ hab bldr job cancel <GROUP_ID>

To cancel ALL dispatching builds, please refer to this article.


Canceling all dispatching Builder builds