Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Towards Operational Excellence

Towards Operational Excellence

Once systems are designed, implemented, and tested, we come to what is arguably one of the hardest aspects in the lifecycle of a system: bringing it to life and sustaining it in operations. In this series of posts, I’ll discuss Operational Excellence, focusing on the three essential interconnecting elements that enable you to successfully operate the technology you’ve built — Culture, Tools, and Processes.

Adrian Hornsby

June 17, 2020
Tweet

More Decks by Adrian Hornsby

Other Decks in Technology

Transcript

  1. View Slide

  2. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Towards Operational Excellence
    Adrian Hornsby
    Principal Evangelist - Architecture
    Amazon Web Services
    S e s s i o n I D
    @adhorn

    View Slide

  3. What is Operational Excellence?

    View Slide

  4. When your whole business is fundamentally
    dependent on technology,
    operational excellence is critical.

    View Slide

  5. 1995

    View Slide

  6. Internet Web Server
    customers
    Inventory
    Orders
    Database
    Customer Service
    Tools
    Fulfillment Center Tools

    View Slide

  7. View Slide

  8. View Slide

  9. What is Operational Excellence?

    View Slide

  10. View Slide

  11. What is Operational Excellence?
    • Happy customers!
    • Consistently exceeding operational goals
    • Anticipating and addressing problems
    • Effectively responding to operational issues
    • Continuously improving
    …and doing all of this at significant scale.

    View Slide

  12. How does a technology organization move
    toward OE?

    View Slide

  13. Achieving Operational Excellence
    Tools Processes
    Culture
    Technology

    View Slide

  14. Achieving Operational Excellence
    Culture

    View Slide

  15. Culture: Amazon Leadership Principles
    1. Customer Obsession
    2. Ownership
    3. Invent and Simplify
    4. Are Right, A Lot
    5. Hire and Develop the Best
    6. Insist on the Highest Standards
    7. Think Big
    8. Bias for Action
    9. Frugality
    10. Learn and Be Curious
    11. Earn Trust
    12. Dive Deep
    13. Have Backbone; Disagree
    and Commit
    14. Deliver Results
    https://www.amazon.jobs/en/principles

    View Slide

  16. Culture: Amazon Leadership Principles
    1. Customer Obsession
    2. Ownership
    3. Invent and Simplify
    4. Are Right, A Lot
    5. Hire and Develop the Best
    6. Insist on the Highest Standards
    7. Think Big
    8. Bias for Action
    9. Frugality
    10. Learn and Be Curious
    11. Earn Trust
    12. Dive Deep
    13. Have Backbone; Disagree
    and Commit
    14. Deliver Results

    View Slide

  17. Amazon Flywheel

    View Slide

  18. Innovation
    Convenience Fast Delivery
    Reduce Customer’s
    Costs
    Wide Selection of
    Products

    View Slide

  19. What would Low-Flying-Hawk say?”

    View Slide

  20. Culture: Amazon Leadership Principles
    1. Customer Obsession
    2. Ownership
    3. Invent and Simplify
    4. Are Right, A Lot
    5. Hire and Develop the Best
    6. Insist on the Highest Standards
    7. Think Big
    8. Bias for Action
    9. Frugality
    10. Learn and Be Curious
    11. Earn Trust
    12. Dive Deep
    13. Have Backbone; Disagree
    and Commit
    14. Deliver Results

    View Slide

  21. 2 Pizza Team Responsibilities
    Responsible for
    Their
    product
    Deployment tools
    CI/CD tools
    Monitoring tools
    Metrics tool
    Logging tools
    APM tools
    Infrastructure provisioning
    tools
    Security tools
    Database management
    tools
    Testing tools
    ….
    Not responsible for
    *
    *Unless their product belongs in the blue

    View Slide

  22. You build it; you ship it

    View Slide

  23. Achieving Operational Excellence
    Tools

    View Slide

  24. Tools to Operate the Cloud
    • Test Automation
    • Configuration Management
    • Software Deployment
    • Monitoring and Visualization
    • Reporting
    • Change Management
    • Incident Management
    • Trouble Ticketing
    • Security Auditing
    • Forecasting and Planning

    View Slide

  25. Calling Houston…
    Website
    Deployment team
    “website-push” perl script

    View Slide

  26. Calling Houston…
    Website
    Deployment team
    “website-push” perl script
    Command line tools
    Hand build
    Hand deploy to NFS
    % /opt/amazon/customer-service/bin/request-refund

    View Slide

  27. Breaking the monolith

    View Slide

  28. Breaking the monolith
    ü Small
    ü Focused
    ü Single-purpose
    ü Connected via HTTP API

    View Slide

  29. Conway’s law
    Architecture
    Organization THEIR
    PRODUCT
    Deployment tools
    CI/CD tools
    Monitoring tools
    Metrics tool
    Logging tools
    APM tools
    Infrastructure provisioning
    tools
    Security tools
    Database management tools
    Testing tools
    ….
    “Organizations which design systems … are constrained to
    produce designs which are copies of the communication
    structures of these organizations.”
    — M. Conway

    View Slide

  30. You measure.
    You collect data.
    You listen to anecdotes.

    View Slide

  31. Culture: Amazon Leadership Principles
    1. Customer Obsession
    2. Ownership
    3. Invent and Simplify
    4. Are Right, A Lot
    5. Hire and Develop the Best
    6. Insist on the Highest Standards
    7. Think Big
    8. Bias for Action
    9. Frugality
    10. Learn and Be Curious
    11. Earn Trust
    12. Dive Deep
    13. Have Backbone; Disagree
    and Commit
    14. Deliver Results

    View Slide

  32. Wait
    Write
    Code
    Wait
    Build
    Code
    Wait
    Deploy
    to Test
    Deploy
    to
    Prod

    View Slide

  33. • Centralized and hosted build
    system
    • Generating artifacts to deploy
    Brazil

    View Slide

  34. • Deployment service
    • No downtime deployments
    • Health checking
    • Versioned artifacts and rollbacks
    https://www.allthingsdistributed.com/2014/11/apollo-amazon-deployment-engine.html

    View Slide

  35. Pipelines • Path code takes from check-in
    to production
    • Where automation, testing, and
    approvals happen
    • Enabler of continuous
    deployment

    View Slide

  36. Example Pipeline and Stages
    Packages
    Revision history
    VersionSet
    Revision history
    Gamma
    Revision history
    Status
    Approval status - Diff
    PDX-Prod
    Revision history
    Compliance
    verification
    Status
    L1 approval
    L2 approval
    Deploy when ready
    Status
    Cancel
    Approval Workflow
    Prod - Rest
    Revision history
    Whitelisting
    Status
    Approval Workflow
    Approve Not
    Approve Not
    >>
    >>

    View Slide

  37. Hundreds of millions of deployments a
    year - as of 2019

    View Slide

  38. https://aws.amazon.com/devops/

    View Slide

  39. Achieving Operational Excellence
    Processes

    View Slide

  40. “Oh! Those tables always come back, and they’re always damaged.
    They’re not packaged right, so the surface of the table always gets
    scratched.”

    View Slide

  41. People already have good intentions

    View Slide

  42. If good intentions don’t work, what does?

    View Slide

  43. Mechanisms

    View Slide

  44. 1902

    View Slide

  45. Toyota will not allow any defect that they know
    about to go down the manufacturing line.

    View Slide

  46. Image Source: https://www.autoguide.com/auto-news/2016/01/toyota-production-japan-may-stop-next-month-
    due-to-steel-shortage.html

    View Slide

  47. Image Source: https://www.autoguide.com/auto-news/2016/01/toyota-production-japan-may-stop-next-month-
    due-to-steel-shortage.html
    Andon Cord

    View Slide

  48. The Andon Cord

    View Slide

  49. Andon Customer Service

    View Slide

  50. View Slide

  51. Jeff Bezos 2012 Shareholder Letter
    We noticed that you experienced poor
    video playback while watching the
    following rental on Amazon Video On
    Demand: Casablanca. We’re sorry for
    the inconvenience and have issued you
    a refund for the following amount:
    $2.99. We hope to see you again soon.

    View Slide

  52. "Good intentions never work, you need
    good mechanisms to make anything
    happen."
    Jeff Bezos

    View Slide

  53. Good Mechanisms ≈ Complete Processes
    Tools
    Adoption
    Audit

    View Slide

  54. Correction of Errors (COE)
    Mechanism to learn from our mistakes
    • technical flaws
    • process flaws
    • documentation flaws
    • organizational flaws
    • other flaws
    Mechanism to identify contributing factors to failures
    Mechanism to drive CONTINUOUS IMPROVEMENT

    View Slide

  55. Anatomy of a COE
    • What happened?
    • What data do you have to support this?
    • Metrics and graphs
    • What was the impact on customers and your business?
    • What are the contributing factors?
    • Don’t stop at operators.
    • What lessons did you learn?
    • What corrective actions are you taking?
    • Actions items
    • Related items (trouble tickets etc.)
    https://www.youtube.com/watch?v=yQiRli2ZPxU

    View Slide

  56. Culture: Amazon Leadership Principles
    1. Customer Obsession
    2. Ownership
    3. Invent and Simplify
    4. Are Right, A Lot
    5. Hire and Develop the Best
    6. Insist on the Highest Standards
    7. Think Big
    8. Bias for Action
    9. Frugality
    10. Learn and Be Curious
    11. Earn Trust
    12. Dive Deep
    13. Have Backbone; Disagree
    and Commit
    14. Deliver Results

    View Slide

  57. Audit
    Weekly Operational Metrics Review
    • Continuous inspection mechanism
    • Maintains focus on operations
    • Foundation of a healthy operations program
    Typical Agenda (~15min)
    • Share successes and failings
    • Action items follow up
    • Review COEs
    • Review key service metrics
    • Identify new best practices https://aws.amazon.com/blogs/opensource/the-wheel/

    View Slide

  58. Continuous Improvement

    View Slide

  59. Policy Engine
    • Automated risk and opportunity analyzer
    • Identifies potential risks to availability, infrastructure, security and
    more
    • Both inherited and direct
    • Highlights potential opportunities to optimize resource utilization
    • Extensible and configurable
    • Provides single-pane-of-glass view into policy compliance
    • Allows acknowledgment
    • Reports roll-up the organization hierarchy
    Mechanism to propagate local learnings globally

    View Slide

  60. In conclusion...
    Achieving operational excellence
    requires:
    an operationally focused culture
    a rich set of tools
    the right processes
    • Good Intentions Don’t Work
    • Mechanisms Work

    View Slide

  61. “The world, thankfully, is full of many high-performing, highly
    distinctive corporate cultures.
    We never claim that our approach is the right one – just that
    it’s ours – and over the last two decades, we’ve collected a large
    group of like-minded people. Folks who find our approach
    energizing and meaningful.”
    Jeff Bezos - 2015 Amazon.com letter to shareholders

    View Slide

  62. Thank you!
    © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Adrian Hornsby
    @adhorn
    https://medium.com/@adhorn
    https://dev.to/adhorn

    View Slide