2016-07-11 - habitat.sh DNS issues
- This is a blameless Post Mortem.
- We will not focus on the past events as they pertain to "could've", "should've", etc.
- All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don't make it a follow up item.
Incident Leader: Dave Parfitt (DP)
habitat.sh DNS resolution issues, partial outage
All times in UTC
- 3:02 PM - sporadic reports that users can't reach
- possibly ChefConf hotel wifi related
- possibly Chef VPN related
- 3:49 PM - incident declared
- 3:52 PM - Joshua Timberman investigated Route53
- 3:53 PM - The issue seems to affect people that aren't using Google DNS, such as FreeDNS
- 3:58 PM - team decision: we'd like to resolve the issue as quickly as possible for ChefConf demos, possibly doing things manually for now. We'll circle back and automate what we need via Terraform etc
- 4:07 PM - DP updating @opscode_status + Tumblr
- 4:08 PM - Josh Brand, Steven Danna, Nathan Smith discuss removing the
DEPRECATED hosted zone in Route53
- 4:34 PM - (Josh Brand) the nameserver records for Gandi are actually pointed at the Chef Secure zone, not the Habitat zones
- 4:34 PM - (Josh Brand)no, Gandi does't point to Chef Secure either
- 4:40 PM - DEPRECATED hosted zone has been removed (Joshua)
- 4:43 PM - deleting habitat.sh zone from chef-secure account (Josh Brand)
- 4:46 PM - ad hoc Pingdom DNS test still fails
- we later remove this test as Pingdom doesn't seem to cover our failure case
- 4:48 PM - team runs https://cachecheck.opendns.com/ to see failures from around the world.
- 4:50 PM - Ben Rockwood: Gandi looks good, but now the NS records on the zone don't match the real DNS servers
NS should be:
- 4:52 PM - Josh Brand - NS recorded updated, however it has a 172800s TTL
- 4:55 PM - OpenDNS check seems happy
- 5:15 PM - we think the root DNS issue has been resolved, but it may take awhile for the fix to propogate.
- 5:15 PM - paging folks at ChefConf to check site availability
- Ben Rockwood checking wifi outside of ChefConf hotel (Starbucks)
- 5:33 PM - updating @opscode_status to declare the issue as resolved
- 5:36 PM - incident closed
- NS records in Route53 zone didn't match those of the real DNS servers
- There were multiple hosted zones in Route53, one which was named
DEPRECATED. While removing this may not have resolved the issue, it did help clarify the issues.
- removal of
DEPRECATED Route53 hosted zone.
- update Habitat Route53 zone to match real DNS servers
- Unsure of the impact, DNS hasn't been touched since the initial release of Habitat. We had a few mentions of DNS issues since the launch, but nothing that affected more than 1 person.
- None at this time. The team has decided that the cost to enable monitoring for a situation as described in this PM would exceed the benefit gained.
Link to meeting recording
Link to #incident discussion