2016-06-23 - Habitat app.habitat.sh depot upload/download errors
- This is a blameless Post Mortem.
- We will not focus on the past events as they pertain to “could’ve”, “should’ve”, etc.
- All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don’t make it a follow up item.
Incident Leader: Dave Parfitt
There were 2 issues that the Habitat team dealt with during this incident.
- the Habitat depot was returning HTTP 503 and 504 errors upon package download.
- the Habitat depot was returning an HTTP 503 after a large package upload.
All times UTC.
- 5:02 PM: Adam Jacob declares the incident, Dave Parfitt is incident commander
new incident: disk space full on the depot
ubuntu@ip-10-0-0-190:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 2.0G 0 2.0G 0% /dev tmpfs 396M 41M 355M 11% /run /dev/xvda1 7.8G 7.2G 175M 98% / tmpfs 2.0G 0 2.0G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup tmpfs 396M 0 396M 0% /run/user/1000
5:06 PM: Adam Jacob creates a new EBS volume w/ 1.5 TB
5:09 PM: Adam Jacob attaches new volume to depot server
5:12 PM: brief outage announced in Habitat Slack #general channel.
- nginx, director stopped
- files copied from
- removed all files from
- update fstab
5:15 PM: from an internal discussion w/ Jamie Winsor “the aws instance resource in Terraform for the monolith doesn’t have the ebs_block_device stanza that the original gateway has”
5:18 PM: successful login to app.habitat.sh
5:21 PM: successful package install via
hab pkg install core/ruby
5:21 PM: disk space incident resolved.
ubuntu@ip-10-0-0-190:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 2.0G 0 2.0G 0% /dev tmpfs 396M 41M 355M 11% /run /dev/xvda1 7.8G 3.0G 4.4G 41% / tmpfs 2.0G 0 2.0G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup tmpfs 396M 0 396M 0% /run/user/1000 /dev/xvdf 1.5T 4.9G 1.4T 1% /hab
- 5:22 PM: reports of upload errors on larger artifacts, unrelated to disk space issue:
root@3f47518d40f6:/src/plans/results# hab pkg upload core-jruby-188.8.131.52-20160622160900-x86_64-linux.hart » Uploading core-jruby-184.108.40.206-20160622160900-x86_64-linux.hart → Exists core/bash/4.3.42/20160612075613 → Exists core/gcc-libs/5.2.0/20160612075020 → Exists core/glibc/2.22/20160612063629 → Exists core/jdk8/8u92/20160620143238 → Exists core/linux-headers/4.3/20160612063537 → Exists core/ncurses/6.0/20160612075116 → Exists core/readline/6.3.8/20160612075601 ↑ Uploading core-jruby-220.127.116.11-20160622160900-x86_64-linux.hart 83.00 MB / 83.00 MB \ [==========================================================================================================] 100.00 % 5.70 MB/s Unexpected response from remote ✗✗✗ ✗✗✗ 503 Service Unavailable ✗✗✗
- 5:30 PM: nginx/ELB seems healthy
- 5:50 PM: wiresharking processing request/response data in wireshark
- 5:53 PM: uploaded files appear on disk even though a 503 was returned
- 5:55 PM: nginx
keepalive_timeoutis set to
20s, bumped to
- 5:58 PM: ELB
60seconds, setting to
- 5:58 PM: changing nginx
300sto match ELB
- 6:00 PM: nginx restarted, unsuccessfully upload a new jdk8 package to the
- 6:08 PM: confirmed that our HTTP upload responses from Hyper are correct
- 6:12 PM: Uploading directly to the ELB instead of through Fastly works:
[default:/src:0]# export HAB_DEPOT_URL="https://builder-api-690653005.us-west-2.elb.amazonaws.com/v1/depot" [default:/src:0]# hab pkg upload ./results/metadave-jdk8-8u92-20160622180115-x86_64-linux.hart » Uploading ./results/metadave-jdk8-8u92-20160622180115-x86_64-linux.hart → Exists core/glibc/2.22/20160612063629 → Exists core/linux-headers/4.3/20160612063537 ↑ Uploading ./results/metadave-jdk8-8u92-20160622180115-x86_64-linux.hart 143.06 MB / 143.06 MB \ [========================================================================================================] 100.00 % 8.61 MB/s ✓ Uploaded metadave/jdk8/8u92/20160622180115 ★ Upload of metadave/jdk8/8u92/20160622180115 complete.
- 6:45 PM: “between bytes” setting in Fastly set to 5 minutes
- 7:10 PM: ping Fastly support in IRC
- 7:13 PM: changing Fastly “connection time” =
300000, “first byte” =
- 7:15 PM: successful upload with new Fastly settings:
[default:/src:0]# time hab pkg upload ./results/metadave-jdk8-8u92-20160622190929-x86_64-linux.hart » Uploading ./results/metadave-jdk8-8u92-20160622190929-x86_64-linux.hart → Exists core/glibc/2.22/20160612063629 → Exists core/linux-headers/4.3/20160612063537 ↑ Uploading ./results/metadave-jdk8-8u92-20160622190929-x86_64-linux.hart 143.07 MB / 143.07 MB | [========================================================================================================================] 100.00 % 6.03 MB/s ✓ Uploaded metadave/jdk8/8u92/20160622190929 ★ Upload of metadave/jdk8/8u92/20160622190929 complete.
- 7:17 PM: uploaded multiple large (140MB) artifacts successfully
- 7:24 PM: upload several packages to try and determine exactly which setting fixes the issue
- 7:29 PM: upload success, the tweak to Fastly’s
time to first byteresolves the issue.
- 7:41 PM: 2nd incident resolved, incident closed
The depot filesystem had 175M of disk space free. This prevented large file uploads and other misc errors in the builder-api service.
Fastly wasn’t configured for large file uploads. Tweaking “time to first byte” in Fastly resolved large file upload issues.
The depot server doesn’t have 5xx monitoring.
- Added 1.5TB of disk space to the depot.
- Set “time to first byte” in Fastly to 300000 milliseconds.
- some uploads/downloads returned 503/504’s over ~1 hour.
- Replacing the disk with a new volume caused a depot outage of ~6 minutes.
- uploading large artifacts would result in a 503, but retrying the upload would resolve the issue. This has been an issue since Habitat was released.
- Add 5xx monitoring to Fastly or something that tests the route through Fastly -> EBS -> EC2 (Dave Parfitt)
- Update Terraform - the monolith doesn’t have the ebs_block_device stanza that the original gateway has (Joshua Timberman)