Troubleshooting Builder services

builder

#1

Troubleshooting Builder Services

This document serves as a guide to troubleshooting Builder services in the production environment.

It is assumed that the pre-requisites outlined in the On Call Engineering Duties document have been met - eg, access to the dashboards, ability to ssh into the services nodes, and familiarity with making SQL queries on the databases, etc.

It is also assumed that the Known Issues that are called out in the On Call Engineering Duties doc have been checked and resolved - this document will not duplicate those issues.

Builder-related incident? Stay Calm, See the Common Issues section below***

Assessing System Health

Datadog dashboard

The Datadog dashboard Live Builder Job Status is generally the first thing to look at - if the system is running correctly, there will be build events that will flow through the dashboard, with both Completed and Failed build events that can be seen.

As a baseline or to check a repro, you can kick off a test build (eg, core/testapp) from the Habitat depot web site and then watch it getting Dispatched and then Completed in the Datadog dashboard.

Sumologic logs

The Sumologic logs are essential to checking the health of the build workers and jobs status. Check the Live Workers query to get a view of the build status from the build workers. This will also let you know the IP address of the node that the log is coming from, which will be very useful for finding the specific worker to look at if there are issues (see section below).

Assessing Job Server Health

The Job Server has two main internal sub-systems: it manages groups of builds (the ‘scheduler’), as well as co-ordination of individual build jobs to build workers (the ‘worker manager’). It is also responsible for validating the package uploads do not result in circular dependencies, and also providing package stats to the web UI. Therefore, it can be a suspect if any of these activities are not functioning as expected.

Generally, the syslog (journalctl -fu hab-sup) is the first thing to look at. If it is showing repeated errors, it may be a candidate for a service restart.

You can also check the scheduler status by doing a tail /tmp/builder-scheduler.log. This is the log file generated by the scheduler (and in fact, is what the Datadog agent looks at for it’s events).

Assessing Worker Health

Assessing build worker health is a little trickier since there are a lot of workers, and occasionally the problem is with specific workers.

The best way to pick a worker to look at is to first identify the IP address of the worker - this can be done via the Sumologic logs - for example, the below snippet shows that the worker IP address is 10.0.0.226

2017-11-17 01:55:55,W,852428499479642112,852428499836542976,Dispatched,chefsalim/testapp,,,,,
Host: ip-10-0-0-226  Name: /mnt/hab/svc/builder-worker/logs/builder-worker.log  Category: live/worker 

Once the worker IP address is known, log into the Habitat AWS (via Okta), and locate the EC2 instance with that IP address (make sure you are looking in the live X-Environment). That will give you the friendly name (and public DNS name) of the node. For example, in this case the friendly name is builder-worker-2.

builder-worker-2
builder
live
i-04a20b2c3e1684b34
c4.2xlarge
us-west-2a
 running
10.0.0.226

Once you have the friendly name, you can then SSH into that instance (assuming you have your ssh config set up as described in the On Call Engineering Duties) doc.

ssh live-builder-worker-2

and then examine the syslog on the worker:

journalctl -fu hab-sup

Once on the worker, you can also look at the text log (this is the log file that is aggregated by Sumologic).

tail /hab/svc/builder-worker/logs/builder-worker.log

Let’s say you see something like the below in the log:

2017-11-16 16:40:55,W,852140975242477568,852145961955606528,Dispatched,core/yarn,,,,,
2017-11-16 16:41:22,Failed studio build for core/yarn, err=BuildFailure(2)
2017-11-16 16:41:22,W,852140975242477568,852145961955606528,Failed,core/yarn,,,2017-11-16 16:40:57,2017-11-16 16:41:22,code: BUILD msg: "wk:run:5"

Now you can also look at the build job log for the job above (the job id is the second number after the W in the last line above):

sudo tail /hab/svc/builder-worker/data/852145961955606528/log_pipe-852145961955606528.log

Tracking job state in the Job Server DB

The Job Server DB is the ultimate ‘source of truth’ regarding the state of the Builder.

In order to log into the DB, first SSH to the ‘live-builder-datastore-0’ node.
Next, do the following steps:

sudo su - hab
$(hab pkg path core/postgresql)/bin/psql builder_jobsrv

Once in the psql console, you should see a welcome message and some useful helper commands:

Welcome to Builder

You are connected to database "builder_jobsrv" as user "hab" via socket in "/tmp" at port "5432".

Current time is : Tue May 29 16:20:20 UTC 2018

Useful shortcuts (for builder_jobsrv):
:au  -- Most recent audit
:gr  -- Most recent groups
:jo  -- Most recent jobs
:pr  -- Most recent projects

Setting search path to shard_0
SET
Time: 0.137 ms
psql (9.6.6)
Type "help" for help.

As noted in the text above, the psql login script automatically sets the shard to 0.

If for some reason the shard is not set correctly, you can set it manually:

set search_path to shard_0;

Now, from the psql console you can look at groups, group_projects, and jobs.

A group is the top level construct that aggregates a set of group projects, which define the names of packages to be built, and their states. The builds are specifically identified by jobs, which are sent to build workers.

In order to get a quick high level look at the state of groups, you can do the following:

:gr

or, you can issue the same command more verbosely:

select * from groups order by created_at desc;

This will give output that looks something like this:

        id         | group_state |             project_name              |          created_at           |          updated_at
--------------------+-------------+---------------------------------------+-------------------------------+-------------------------------
 853130726943727616 | Complete    | core/hab-pkg-dockerize                | 2017-11-18 01:08:18.850659+00 | 2017-11-18 01:08:18.895481+00
 853130456478220288 | Dispatching | core/hab-studio                       | 2017-11-18 01:07:46.609769+00 | 2017-11-18 01:07:46.922309+00
 853129731459211264 | Complete    | core/hab-plan-build                   | 2017-11-18 01:06:20.180895+00 | 2017-11-18 01:07:20.331054+00
 853129450340171776 | Dispatching | core/hab                              | 2017-11-18 01:05:46.667501+00 | 2017-11-18 01:07:47.127775+0

From this you can see the ‘root’ group projects (ie, what triggered the initial build), what groups are in a Dispatching state (ie, there are still things pending for that build group), and what the group creation and last update time was (which can give an indication of whether the group is making progress).

If you want to look more at a specific group, you can do the following (where owner_id is one of the group ids above)

select * from group_projects where owner_id = '<owner id>';

This will give more information about the state of the individual group projects. You can also drill down and look at what build jobs have been kicked off for the group by issuing the following command:

select * from jobs where owner_id = '<owner id>';

Each job has a job id, and is assigned to a build worker. Currently, the build worker assigned to build jobs is not persisted outside of a `busy_workers’ table. So the best way to identify what worker was assigned to a build is through the Sumologic logs as per the section above. If you want to look at what workers are currently assigned to build jobs, you can do the following:

select * from busy_workers;

Common Issues

Build does not get triggered or is stuck without any progress

If a build is not getting triggered, the most likely causes are:

  • A previous build is in progress (or stuck), and there is already a build queued. You can check this by looking at the groups via the DB as in the section above. If it looks like there is a stuck build, you can unstick it by following the instructions in the page here: https://github.com/habitat-sh/habitat/wiki/Checking-for-stuck-Builder-builds-[Debugging]. Alternately, you should now be able to use hab bldr job cancel command to cancel the offending job group.

  • The job server is not getting or responding to build requests. If the job server is in a bad state, the builds will not even get queued. You can check the job server state by checking the syslog (journalctl -fu hab-sup). If it looks like there are errors, you can restart the hab services (sudo systemctl restart hab-sup).

  • The router is in a bad state. SSH into the live-builder-router-0 node and check the syslog. If you see routing errors, the job server AND/OR the origin server may need to be restarted so that the shards can be re-registered.

  • There are no build workers available. This is unlikely, and the Datadog dashboard will show a lot of builds in progress if this happens. Either add more build workers, or wait for the build jobs to complete.

Builds are failing at the worker with Airlock error

If you suspect an Airlock issue, you can use the airlock.sh script to check (and fix) things. This assumes you have set up your ssh config as per the Pro Tip in the On Call Engineering Duties doc.

Builds are stuck in ‘Pending’ state after new workers are provisioned

If builds are stuck in a ‘Pending’ state, and workers have been recently provisioned, the issue might be that the workers are waiting for the required configuration (.PEM files, etc) to become available.

Switch over to the Datastore node, and run the apply_config.sh script to re-publish the configuration to the supervisors. Then re-check the workers after a minute or so.


On-call engineering duties