Handling Incidents


#1

Incident Response

When an outage or other incident occurs:

  • If you are not On-Call, be sure to notify whomever is currently On-Call
  • Discussion of incident response should occur in the public maintainer channel #core-dev
  • Log into StatusPage.io (if you do not have a login, contact a core team member)
  • Click on “Dashboard” in the left hand menu (if it is not already selected)
  • Give the incident a name, and make sure the status is investigating
  • Include some details in the Message
  • Click “Create Incident”
  • This will auto post to the #general channel of the Habitat Slack
  • For non-security incidents, keep the incident related chatter in #core-dev
  • Make sure to also check the #operations channel when paged
  • This will also auto post to the @habitatsh twitter account
  • Throughout the incident - be sure to update the incident status when the issue is identified, when the work to correct the discovered issue has started, when the issue is being monitored, and especially when the issue is being resolved.
  • After the incident - update the incident log located at https://docs.google.com/document/d/1EKZmvBqBAxU2K9qc3gCfhq411mRLAeirS_QwPMQYkMc/edit?usp=sharing