Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Towards Operational Excellence

Towards Operational Excellence

Once systems are designed, implemented, and tested, we come to what is arguably one of the hardest aspects in the lifecycle of a system: bringing it to life and sustaining it in operations. In this series of posts, I’ll discuss Operational Excellence, focusing on the three essential interconnecting elements that enable you to successfully operate the technology you’ve built — Culture, Tools, and Processes.

Adrian Hornsby

June 17, 2020
Tweet

More Decks by Adrian Hornsby

Other Decks in Technology

Transcript

  1. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Towards Operational Excellence Adrian Hornsby Principal Evangelist - Architecture Amazon Web Services S e s s i o n I D @adhorn
  2. What is Operational Excellence? • Happy customers! • Consistently exceeding

    operational goals • Anticipating and addressing problems • Effectively responding to operational issues • Continuously improving …and doing all of this at significant scale.
  3. Culture: Amazon Leadership Principles 1. Customer Obsession 2. Ownership 3.

    Invent and Simplify 4. Are Right, A Lot 5. Hire and Develop the Best 6. Insist on the Highest Standards 7. Think Big 8. Bias for Action 9. Frugality 10. Learn and Be Curious 11. Earn Trust 12. Dive Deep 13. Have Backbone; Disagree and Commit 14. Deliver Results https://www.amazon.jobs/en/principles
  4. Culture: Amazon Leadership Principles 1. Customer Obsession 2. Ownership 3.

    Invent and Simplify 4. Are Right, A Lot 5. Hire and Develop the Best 6. Insist on the Highest Standards 7. Think Big 8. Bias for Action 9. Frugality 10. Learn and Be Curious 11. Earn Trust 12. Dive Deep 13. Have Backbone; Disagree and Commit 14. Deliver Results
  5. Culture: Amazon Leadership Principles 1. Customer Obsession 2. Ownership 3.

    Invent and Simplify 4. Are Right, A Lot 5. Hire and Develop the Best 6. Insist on the Highest Standards 7. Think Big 8. Bias for Action 9. Frugality 10. Learn and Be Curious 11. Earn Trust 12. Dive Deep 13. Have Backbone; Disagree and Commit 14. Deliver Results
  6. 2 Pizza Team Responsibilities Responsible for Their product Deployment tools

    CI/CD tools Monitoring tools Metrics tool Logging tools APM tools Infrastructure provisioning tools Security tools Database management tools Testing tools …. Not responsible for * *Unless their product belongs in the blue
  7. Tools to Operate the Cloud • Test Automation • Configuration

    Management • Software Deployment • Monitoring and Visualization • Reporting • Change Management • Incident Management • Trouble Ticketing • Security Auditing • Forecasting and Planning
  8. Calling Houston… Website Deployment team “website-push” perl script Command line

    tools Hand build Hand deploy to NFS % /opt/amazon/customer-service/bin/request-refund
  9. Conway’s law Architecture Organization THEIR PRODUCT Deployment tools CI/CD tools

    Monitoring tools Metrics tool Logging tools APM tools Infrastructure provisioning tools Security tools Database management tools Testing tools …. “Organizations which design systems … are constrained to produce designs which are copies of the communication structures of these organizations.” — M. Conway
  10. Culture: Amazon Leadership Principles 1. Customer Obsession 2. Ownership 3.

    Invent and Simplify 4. Are Right, A Lot 5. Hire and Develop the Best 6. Insist on the Highest Standards 7. Think Big 8. Bias for Action 9. Frugality 10. Learn and Be Curious 11. Earn Trust 12. Dive Deep 13. Have Backbone; Disagree and Commit 14. Deliver Results
  11. • Deployment service • No downtime deployments • Health checking

    • Versioned artifacts and rollbacks https://www.allthingsdistributed.com/2014/11/apollo-amazon-deployment-engine.html
  12. Pipelines • Path code takes from check-in to production •

    Where automation, testing, and approvals happen • Enabler of continuous deployment
  13. Example Pipeline and Stages Packages Revision history VersionSet Revision history

    Gamma Revision history Status Approval status - Diff PDX-Prod Revision history Compliance verification Status L1 approval L2 approval Deploy when ready Status Cancel Approval Workflow Prod - Rest Revision history Whitelisting Status Approval Workflow Approve Not Approve Not >> >>
  14. “Oh! Those tables always come back, and they’re always damaged.

    They’re not packaged right, so the surface of the table always gets scratched.”
  15. Toyota will not allow any defect that they know about

    to go down the manufacturing line.
  16. Jeff Bezos 2012 Shareholder Letter We noticed that you experienced

    poor video playback while watching the following rental on Amazon Video On Demand: Casablanca. We’re sorry for the inconvenience and have issued you a refund for the following amount: $2.99. We hope to see you again soon.
  17. Correction of Errors (COE) Mechanism to learn from our mistakes

    • technical flaws • process flaws • documentation flaws • organizational flaws • other flaws Mechanism to identify contributing factors to failures Mechanism to drive CONTINUOUS IMPROVEMENT
  18. Anatomy of a COE • What happened? • What data

    do you have to support this? • Metrics and graphs • What was the impact on customers and your business? • What are the contributing factors? • Don’t stop at operators. • What lessons did you learn? • What corrective actions are you taking? • Actions items • Related items (trouble tickets etc.) https://www.youtube.com/watch?v=yQiRli2ZPxU
  19. Culture: Amazon Leadership Principles 1. Customer Obsession 2. Ownership 3.

    Invent and Simplify 4. Are Right, A Lot 5. Hire and Develop the Best 6. Insist on the Highest Standards 7. Think Big 8. Bias for Action 9. Frugality 10. Learn and Be Curious 11. Earn Trust 12. Dive Deep 13. Have Backbone; Disagree and Commit 14. Deliver Results
  20. Audit Weekly Operational Metrics Review • Continuous inspection mechanism •

    Maintains focus on operations • Foundation of a healthy operations program Typical Agenda (~15min) • Share successes and failings • Action items follow up • Review COEs • Review key service metrics • Identify new best practices https://aws.amazon.com/blogs/opensource/the-wheel/
  21. Policy Engine • Automated risk and opportunity analyzer • Identifies

    potential risks to availability, infrastructure, security and more • Both inherited and direct • Highlights potential opportunities to optimize resource utilization • Extensible and configurable • Provides single-pane-of-glass view into policy compliance • Allows acknowledgment • Reports roll-up the organization hierarchy Mechanism to propagate local learnings globally
  22. In conclusion... Achieving operational excellence requires: an operationally focused culture

    a rich set of tools the right processes • Good Intentions Don’t Work • Mechanisms Work
  23. “The world, thankfully, is full of many high-performing, highly distinctive

    corporate cultures. We never claim that our approach is the right one – just that it’s ours – and over the last two decades, we’ve collected a large group of like-minded people. Folks who find our approach energizing and meaningful.” Jeff Bezos - 2015 Amazon.com letter to shareholders
  24. Thank you! © 2020, Amazon Web Services, Inc. or its

    affiliates. All rights reserved. Adrian Hornsby @adhorn https://medium.com/@adhorn https://dev.to/adhorn