Postmortem for habitat.sh DNS issues on 2016-07-22

outage
ops
postmortem

#1

2016-07-22 - Habitat DNS issue

Start every PM stating the following

  1. This is a blameless Post Mortem.
  2. We will not focus on the past events as they pertain to “could’ve”, “should’ve”, etc.
  3. All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don’t make it a follow up item.

Incident Leader: Dave Parfitt

Description

Habitat.sh DNS issues

Timeline

All times in UTC

Searching for core/hab-pkg-dockerize in remote https://willem.habitat.sh/v1/depot
» Installing core/hab-pkg-dockerize
✗✗✗
✗✗✗ failed to lookup address information: Try again
✗✗✗
  • 6:48 PM: Dave Parfitt checks https://cachecheck.opendns.com/, app.habitat.sh and willem.habitat.sh are returning SERVFAIL from around the world.
  • 6:49 PM: Jamie Winsor asks if DNS entries were entered manually since the last incident.
    • Route53 DNS entries WERE entered manually, it was determined at the previous postmortem that no actions were need to update Terraform.
  • 6:51 PM: Dave Parfitt declares the incident, starts a zoom session.
  • 6:53 PM: Jamie Winsor updates the Terraform DNS info via https://github.com/habitat-sh/cloud-environments/pull/13
  • 6:54 PM: PR has been Terraform applied
  • 7:09 PM: Route53 NS records are correct
  • 7:13 PM: TTL is 172800 (2 days)
  • 7:28 PM: periodically checking DNS via https://cachecheck.opendns.com/
  • 7:28 PM: contacted Chef ops, including Ben Rockwood, Josh Brand, Mark Harrison
  • 7:30 PM: Mark Harrison suggests clicking the “Refresh Cache” button on the OpenDNS check page.
  • 7:32 PM: Josh Brand flushes the Google DNS cache
  • 7:33 PM: all OpenDNS checks return success
  • 7:36 PM: incident closed

Contributing Factor(s)

Changes applied manually during ChefConf Habitat DNS issue were not committed to the Terraform repo.

Stabilization Steps

Apply the correct DNS settings to Habitat Terraform repo.

Impact

  • Some users couldn’t access the site. Terraform apply has been run 3 days prior, with a 2 day TTL.
  • hooks from Github couldn’t hit the site

Corrective Actions

  • process for updating Terraform pinned to the Habichat #operations room
  • clarify what should and shouldn’t be applied through Terraform for Habitat:
    • everything to do with AWS is applied through TF
    • Fastly’s TF provider doesn’t suit our needs
  • if there are changes that need to be made manually, escalate to the owner of the project.
    • there shouldn’t be manual changes.

Link to meeting recording

Note: videos will soon move to Youtube, the following link will contain the latest:


#2

Hello -

I’ve attached a permanent Youtube link for the postmortem on 2016-07-22.

Cheers -
Dave