Troubleshooting Builder Services
This document serves as a guide to troubleshooting Builder services in the production environment.
It is assumed that the pre-requisites outlined in the On Call Engineering Duties document have been met - eg, access to the dashboards, ability to ssh into the services nodes, and familiarity with making SQL queries on the databases, etc.
It is also assumed that the Known Issues that are called out in the On Call Engineering Duties doc have been checked and resolved - this document will not duplicate those issues.
Builder-related incident? Stay Calm, See the Common Issues section below***
Assessing System Health
The current steady state for the API is around 20-40 reqs/sec. We run 2 API nodes, which gives us a capacity to handle upward of 4000 requests/sec, so there is generally plenty of headroom.
For the build jobs, all the main build events that will flow through the dashboard, with both Completed and Failed build status as well as build times available to be monitored. If there is any kind of build storm, the number of Waiting jobs will be the metric to look at.
As a baseline or to check a repro, you can kick off a test build (eg, core/testapp) from the Habitat depot web site and then watch it getting Dispatched and then Completed in the Datadog dashboard.
The Sumologic logs are essential to checking the health of the build workers and jobs status. Check the
Live Workers query to get a view of the build status from the build workers. This will also let you know the IP address of the node that the log is coming from, which will be very useful for finding the specific worker to look at if there are issues (see section below).
Assessing Job Server Health
The Job Server has two main internal sub-systems: it manages groups of builds (the ‘scheduler’), as well as co-ordination of individual build jobs to build workers (the ‘worker manager’). It is also responsible for validating the package uploads do not result in circular dependencies, and also providing package stats to the web UI. Therefore, it can be a suspect if any of these activities are not functioning as expected.
Generally, the syslog (
journalctl -fu hab-sup) is the first thing to look at. If it is showing repeated errors, it may be a candidate for a service restart.
You can also check the scheduler status by doing a
tail /tmp/builder-scheduler.log. This is the log file generated by the scheduler (and in fact, is what the Datadog agent looks at for it’s events).
Assessing Worker Health (Linux)
Assessing build worker health is a little trickier since there are a lot of workers, and occasionally the problem is with specific workers.
The best way to pick a worker to look at is to first identify the IP address of the worker - this can be done via the Sumologic logs - for example, the below snippet shows that the worker IP address is 10.0.0.226
2017-11-17 01:55:55,W,852428499479642112,852428499836542976,Dispatched,chefsalim/testapp,,,,, Host: ip-10-0-0-226 Name: /mnt/hab/svc/builder-worker/logs/builder-worker.log Category: live/worker
Once the worker IP address is known, log into the Habitat AWS (via Okta), and locate the EC2 instance with that IP address (make sure you are looking in the
live X-Environment). That will give you the friendly name (and public DNS name) of the node. For example, in this case the friendly name is
builder-worker-2 builder live i-04a20b2c3e1684b34 c4.2xlarge us-west-2a running 10.0.0.226
Once you have the friendly name, you can then SSH into that instance (assuming you have your ssh config set up as described in the On Call Engineering Duties) doc.
and then examine the syslog on the worker:
journalctl -fu hab-sup
Once on the worker, you can also look at the text log (this is the log file that is aggregated by Sumologic).
Let’s say you see something like the below in the log:
2017-11-16 16:40:55,W,852140975242477568,852145961955606528,Dispatched,core/yarn,,,,, 2017-11-16 16:41:22,Failed studio build for core/yarn, err=BuildFailure(2) 2017-11-16 16:41:22,W,852140975242477568,852145961955606528,Failed,core/yarn,,,2017-11-16 16:40:57,2017-11-16 16:41:22,code: BUILD msg: "wk:run:5"
Now you can also look at the build job log for the job above (the job id is the second number after the W in the last line above):
sudo tail /hab/svc/builder-worker/data/852145961955606528/log_pipe-852145961955606528.log
Assessing Worker Health (Windows)
Accessing and troubleshooting Window build workers is quite different from the Linux workers.
Currently, we don’t have SSH access set up for Windows workers.
In order to access the Windows workers, you will need to go the EC2 dashboard, select a worker, and click Connect. Download and run the RDP file to connect to the Windows instance. The RDP password is available in the team 1Password.
Once you have RDP’d in, you can open a new Powershell window to look at the logs, and restart services, etc
The worker log is in the following location: `C:\hab\svc\windows-service\logs\Habitat.log’
In order to restart the hab sup Windows service, you can use the following:
Stop-Service Habitat Start-Service Habitat
Note: there is a limit of only two simultaneous RDP sessions per node
Tracking job state in the Job Server DB
The Job Server DB is the ultimate ‘source of truth’ regarding the state of the Builder.
In order to log into the DB, first SSH to the ‘live-builder-datastore-0’ node.
Next, do the following steps:
sudo su - hab $(hab pkg path core/postgresql)/bin/psql builder
Now, from the psql console you can look at groups, group_projects, and jobs.
A group is the top level construct that aggregates a set of group projects, which define the names of packages to be built, and their states. The builds are specifically identified by jobs, which are sent to build workers.
In order to get a quick high level look at the state of groups, you can do the following:
select * from groups order by created_at desc;
This will give output that looks something like this:
id | group_state | project_name | created_at | updated_at --------------------+-------------+---------------------------------------+-------------------------------+------------------------------- 853130726943727616 | Complete | core/hab-pkg-dockerize | 2017-11-18 01:08:18.850659+00 | 2017-11-18 01:08:18.895481+00 853130456478220288 | Dispatching | core/hab-studio | 2017-11-18 01:07:46.609769+00 | 2017-11-18 01:07:46.922309+00 853129731459211264 | Complete | core/hab-plan-build | 2017-11-18 01:06:20.180895+00 | 2017-11-18 01:07:20.331054+00 853129450340171776 | Dispatching | core/hab | 2017-11-18 01:05:46.667501+00 | 2017-11-18 01:07:47.127775+0
From this you can see the ‘root’ group projects (ie, what triggered the initial build), what groups are in a
Dispatching state (ie, there are still things pending for that build group), and what the group creation and last update time was (which can give an indication of whether the group is making progress).
If you want to look more at a specific group, you can do the following (where owner_id is one of the group ids above)
select * from group_projects where owner_id = '<owner id>';
This will give more information about the state of the individual group projects. You can also drill down and look at what build jobs have been kicked off for the group by issuing the following command:
select * from jobs where owner_id = '<owner id>';
Each job has a job id, and is assigned to a build worker. Currently, the build worker assigned to build jobs is not persisted outside of a `busy_workers’ table. So the best way to identify what worker was assigned to a build is through the Sumologic logs as per the section above. If you want to look at what workers are currently assigned to build jobs, you can do the following:
select * from busy_workers;
Build does not get triggered or is stuck without any progress
If a build is not getting triggered, the most likely causes are:
A previous build is in progress (or stuck), and there is already a build queued. You can check this by looking at the groups via the DB as in the section above. If it looks like there is a stuck build, you can unstick it by following the instructions in the page here: https://github.com/habitat-sh/habitat/wiki/Checking-for-stuck-Builder-builds-[Debugging]. Alternately, you should now be able to use
hab bldr job cancelcommand to cancel the offending job group.
The job server is not getting or responding to build requests. If the job server is in a bad state, the builds will not even get queued. You can check the job server state by checking the syslog (
journalctl -fu hab-sup). If it looks like there are errors, you can restart the hab services (
sudo systemctl restart hab-sup).
The router is in a bad state. SSH into the
live-builder-router-0node and check the syslog. If you see routing errors, the job server AND/OR the origin server may need to be restarted so that the shards can be re-registered.
There are no build workers available. This is unlikely, and the Datadog dashboard will show a lot of builds in progress if this happens. Either add more build workers, or wait for the build jobs to complete.
Builds are failing at the worker with Airlock error
If you suspect an Airlock issue, you can use the airlock.sh script to check (and fix) things. This assumes you have set up your ssh config as per the Pro Tip in the On Call Engineering Duties doc.
Builds are stuck in ‘Pending’ state after new workers are provisioned
If builds are stuck in a ‘Pending’ state, and workers have been recently provisioned, the issue might be that the workers are waiting for the required configuration (.PEM files, etc) to become available.
Switch over to the Datastore node, and run the
apply_config.sh script to re-publish the configuration to the supervisors. Then re-check the workers after a minute or so.