2016-07-11 - habitat.sh DNS issues
- This is a blameless Post Mortem.
- We will not focus on the past events as they pertain to “could’ve”, “should’ve”, etc.
- All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don’t make it a follow up item.
Incident Leader: Dave Parfitt (DP)
habitat.sh DNS resolution issues, partial outage
All times in UTC
3:02 PM - sporadic reports that users can’t reach
- possibly ChefConf hotel wifi related
- possibly Chef VPN related
3:49 PM - incident declared
3:52 PM - Joshua Timberman investigated Route53
3:53 PM - The issue seems to affect people that aren’t using Google DNS, such as FreeDNS
3:58 PM - team decision: we’d like to resolve the issue as quickly as possible for ChefConf demos, possibly doing things manually for now. We’ll circle back and automate what we need via Terraform etc
4:07 PM - DP updating @opscode_status + Tumblr
4:08 PM - Josh Brand, Steven Danna, Nathan Smith discuss removing the
DEPRECATEDhosted zone in Route53
4:34 PM - (Josh Brand) the nameserver records for Gandi are actually pointed at the Chef Secure zone, not the Habitat zones
4:34 PM - (Josh Brand)no, Gandi does’t point to Chef Secure either
4:40 PM - DEPRECATED hosted zone has been removed (Joshua)
4:43 PM - deleting habitat.sh zone from chef-secure account (Josh Brand)
4:46 PM - ad hoc Pingdom DNS test still fails
- we later remove this test as Pingdom doesn’t seem to cover our failure case
4:48 PM - team runs https://cachecheck.opendns.com/ to see failures from around the world.
4:50 PM - Ben Rockwood: Gandi looks good, but now the NS records on the zone don’t match the real DNS servers
NS should be: ns-580.awsdns-08.net ns-233.awsdns-29.com ns-1057.awsdns-04.org ns-1793.awsdns-32.co.uk
4:52 PM - Josh Brand - NS recorded updated, however it has a 172800s TTL
4:55 PM - OpenDNS check seems happy
5:15 PM - we think the root DNS issue has been resolved, but it may take awhile for the fix to propogate.
5:15 PM - paging folks at ChefConf to check site availability
- Ben Rockwood checking wifi outside of ChefConf hotel (Starbucks)
5:33 PM - updating @opscode_status to declare the issue as resolved
5:36 PM - incident closed
- NS records in Route53 zone didn’t match those of the real DNS servers
- There were multiple hosted zones in Route53, one which was named
DEPRECATED. While removing this may not have resolved the issue, it did help clarify the issues.
- removal of
DEPRECATEDRoute53 hosted zone.
- update Habitat Route53 zone to match real DNS servers
- Unsure of the impact, DNS hasn’t been touched since the initial release of Habitat. We had a few mentions of DNS issues since the launch, but nothing that affected more than 1 person.
- None at this time. The team has decided that the cost to enable monitoring for a situation as described in this PM would exceed the benefit gained.