2016-07-22 - Habitat DNS issue
Start every PM stating the following
- This is a blameless Post Mortem.
- We will not focus on the past events as they pertain to "could've", "should've", etc.
- All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don't make it a follow up item.
Incident Leader: Dave Parfitt
Habitat.sh DNS issues
All times in UTC
Searching for core/hab-pkg-dockerize in remote https://willem.habitat.sh/v1/depot
» Installing core/hab-pkg-dockerize
✗✗✗ failed to lookup address information: Try again
- 6:48 PM: Dave Parfitt checks https://cachecheck.opendns.com/, app.habitat.sh and willem.habitat.sh are returning SERVFAIL from around the world.
- 6:49 PM: Jamie Winsor asks if DNS entries were entered manually since the last incident.
- Route53 DNS entries WERE entered manually, it was determined at the previous postmortem that no actions were need to update Terraform.
- 6:51 PM: Dave Parfitt declares the incident, starts a zoom session.
- 6:53 PM: Jamie Winsor updates the Terraform DNS info via https://github.com/habitat-sh/cloud-environments/pull/13
- 6:54 PM: PR has been Terraform applied
- 7:09 PM: Route53 NS records are correct
- 7:13 PM: TTL is 172800 (2 days)
- 7:28 PM: periodically checking DNS via https://cachecheck.opendns.com/
- 7:28 PM: contacted Chef ops, including Ben Rockwood, Josh Brand, Mark Harrison
- 7:30 PM: Mark Harrison suggests clicking the "Refresh Cache" button on the OpenDNS check page.
- 7:32 PM: Josh Brand flushes the Google DNS cache
- 7:33 PM: all OpenDNS checks return success
- 7:36 PM: incident closed
Changes applied manually during ChefConf Habitat DNS issue were not committed to the Terraform repo.
Apply the correct DNS settings to Habitat Terraform repo.
- Some users couldn't access the site. Terraform apply has been run 3 days prior, with a 2 day TTL.
- hooks from Github couldn't hit the site
- process for updating Terraform pinned to the Habichat #operations room
- check the habitat.sh/cloud-environments README (Dave) DONE
- clarify what should and shouldn't be applied through Terraform for Habitat:
- everything to do with AWS is applied through TF
- Fastly's TF provider doesn't suit our needs
- if there are changes that need to be made manually, escalate to the owner of the project.
- there shouldn't be manual changes.
Link to meeting recording
Note: videos will soon move to Youtube, the following link will contain the latest: