Even the best designed systems can and will have outages. No matter how well you’ve hardened your infrastructure and put in place failover or self-healing automation, something you didn’t see coming will wreak havoc in your special snowflake of a system. In many cases a human is likely to be a contributing factor. In fact, Gartner has predicted that in 2015, 80% of outages will be caused by people and process issues.
Are you considering the Human element when revisiting incidents and outages with your infrastructure? If so, are you approaching it with a blameless mindset focused on removing the many forms of bias and searching for absolute truth. Do you believe that there is always a root cause to outages or is it more accurate to seek out additional aspects that may have contributed to the incident, especially with regard to the people and processes?
Regardless of your approach, the point of a postmortem is to accurately describe the "story" about what took place in as much detail as possible. The good, the bad, those involved, conversations had, actions taken, related timestamps, who was on-call, etc. You want to know absolutely everything that took place that was related in some degree so that you can review the data and learn from it.
How do we ensure that we are asking the right questions and seeking out relevant and important information that will help us understand what took place and ultimately how to become a better team, company, and product as a result?
The blameless culture (specifically blameless postmortems) is a topic of interest to many in the middle of a DevOps transformation within their organization. I'll outline important best practices for conducting effective postmortems and demonstrate methods to measure benefits from adopting postmortems especially those of a "blameless" nature.