Postmortem for app.habitat.sh depot upload/download errors 2016-06-23

outage
depot
ops

#1

2016-06-23 - Habitat app.habitat.sh depot upload/download errors

Meeting

Video

  1. This is a blameless Post Mortem.
  2. We will not focus on the past events as they pertain to “could’ve”, “should’ve”, etc.
  3. All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don’t make it a follow up item.

Incident Leader: Dave Parfitt

Description

There were 2 issues that the Habitat team dealt with during this incident.

  • the Habitat depot was returning HTTP 503 and 504 errors upon package download.
  • the Habitat depot was returning an HTTP 503 after a large package upload.

Timeline

All times UTC.

  • 5:02 PM: Adam Jacob declares the incident, Dave Parfitt is incident commander
    new incident: disk space full on the depot
ubuntu@ip-10-0-0-190:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            2.0G     0  2.0G   0% /dev
tmpfs           396M   41M  355M  11% /run
/dev/xvda1      7.8G  7.2G  175M  98% /
tmpfs           2.0G     0  2.0G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           2.0G     0  2.0G   0% /sys/fs/cgroup
tmpfs           396M     0  396M   0% /run/user/1000
  • 5:06 PM: Adam Jacob creates a new EBS volume w/ 1.5 TB

  • 5:09 PM: Adam Jacob attaches new volume to depot server

  • 5:12 PM: brief outage announced in Habitat Slack #general channel.

    • nginx, director stopped
    • files copied from /hab to /mnt/hab
    • removed all files from /hab
    • update fstab
    • unmound /mnt/hab
    • mount /hab
  • 5:15 PM: from an internal discussion w/ Jamie Winsor “the aws instance resource in Terraform for the monolith doesn’t have the ebs_block_device stanza that the original gateway has”

  • 5:18 PM: successful login to app.habitat.sh

  • 5:21 PM: successful package install via hab pkg install core/ruby

  • 5:21 PM: disk space incident resolved.

ubuntu@ip-10-0-0-190:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            2.0G     0  2.0G   0% /dev
tmpfs           396M   41M  355M  11% /run
/dev/xvda1      7.8G  3.0G  4.4G  41% /
tmpfs           2.0G     0  2.0G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           2.0G     0  2.0G   0% /sys/fs/cgroup
tmpfs           396M     0  396M   0% /run/user/1000
/dev/xvdf       1.5T  4.9G  1.4T   1% /hab
  • 5:22 PM: reports of upload errors on larger artifacts, unrelated to disk space issue:
root@3f47518d40f6:/src/plans/results# hab pkg upload core-jruby-9.1.2.0-20160622160900-x86_64-linux.hart
» Uploading core-jruby-9.1.2.0-20160622160900-x86_64-linux.hart
→ Exists core/bash/4.3.42/20160612075613
→ Exists core/gcc-libs/5.2.0/20160612075020
→ Exists core/glibc/2.22/20160612063629
→ Exists core/jdk8/8u92/20160620143238
→ Exists core/linux-headers/4.3/20160612063537
→ Exists core/ncurses/6.0/20160612075116
→ Exists core/readline/6.3.8/20160612075601
↑ Uploading core-jruby-9.1.2.0-20160622160900-x86_64-linux.hart
   83.00 MB / 83.00 MB \ [==========================================================================================================] 100.00 % 5.70 MB/s
Unexpected response from remote
✗✗✗
✗✗✗ 503 Service Unavailable
✗✗✗
  • 5:30 PM: nginx/ELB seems healthy
  • 5:50 PM: wiresharking processing request/response data in wireshark
  • 5:53 PM: uploaded files appear on disk even though a 503 was returned
  • 5:55 PM: nginx keepalive_timeout is set to 20s, bumped to 60s
  • 5:58 PM: ELB Idle timeout is 60 seconds, setting to 300 seconds
  • 5:58 PM: changing nginx keepalive_timeout to 300s to match ELB
  • 6:00 PM: nginx restarted, unsuccessfully upload a new jdk8 package to the metadave origin
  • 6:08 PM: confirmed that our HTTP upload responses from Hyper are correct
  • 6:12 PM: Uploading directly to the ELB instead of through Fastly works:
[13][default:/src:0]# export HAB_DEPOT_URL="https://builder-api-690653005.us-west-2.elb.amazonaws.com/v1/depot"
[14][default:/src:0]# hab pkg upload ./results/metadave-jdk8-8u92-20160622180115-x86_64-linux.hart
» Uploading ./results/metadave-jdk8-8u92-20160622180115-x86_64-linux.hart
→ Exists core/glibc/2.22/20160612063629
→ Exists core/linux-headers/4.3/20160612063537
↑ Uploading ./results/metadave-jdk8-8u92-20160622180115-x86_64-linux.hart
   143.06 MB / 143.06 MB \ [========================================================================================================] 100.00 % 8.61 MB/s  ✓ Uploaded metadave/jdk8/8u92/20160622180115
★ Upload of metadave/jdk8/8u92/20160622180115 complete.
  • 6:45 PM: “between bytes” setting in Fastly set to 5 minutes
  • 7:10 PM: ping Fastly support in IRC
  • 7:13 PM: changing Fastly “connection time” = 300000, “first byte” = 300000
  • 7:15 PM: successful upload with new Fastly settings:
[28][default:/src:0]# time hab pkg upload ./results/metadave-jdk8-8u92-20160622190929-x86_64-linux.hart
» Uploading ./results/metadave-jdk8-8u92-20160622190929-x86_64-linux.hart
→ Exists core/glibc/2.22/20160612063629
→ Exists core/linux-headers/4.3/20160612063537
↑ Uploading ./results/metadave-jdk8-8u92-20160622190929-x86_64-linux.hart
   143.07 MB / 143.07 MB | [========================================================================================================================] 100.00 % 6.03 MB/s  ✓ Uploaded metadave/jdk8/8u92/20160622190929
★ Upload of metadave/jdk8/8u92/20160622190929 complete.
  • 7:17 PM: uploaded multiple large (140MB) artifacts successfully
  • 7:24 PM: upload several packages to try and determine exactly which setting fixes the issue
  • 7:29 PM: upload success, the tweak to Fastly’s time to first byte resolves the issue.
  • 7:41 PM: 2nd incident resolved, incident closed

Contributing Factor(s)

  • The depot filesystem had 175M of disk space free. This prevented large file uploads and other misc errors in the builder-api service.

  • Fastly wasn’t configured for large file uploads. Tweaking “time to first byte” in Fastly resolved large file upload issues.

  • The depot server doesn’t have 5xx monitoring.

Stabilization Steps

  • Added 1.5TB of disk space to the depot.
  • Set “time to first byte” in Fastly to 300000 milliseconds.

Impact

  • some uploads/downloads returned 503/504’s over ~1 hour.
  • Replacing the disk with a new volume caused a depot outage of ~6 minutes.
  • uploading large artifacts would result in a 503, but retrying the upload would resolve the issue. This has been an issue since Habitat was released.

Corrective Actions

  • Add 5xx monitoring to Fastly or something that tests the route through Fastly -> EBS -> EC2 (Dave Parfitt)
  • Update Terraform - the monolith doesn’t have the ebs_block_device stanza that the original gateway has (Joshua Timberman)