Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Getting Your System to Production (and keeping it there)

Eoin Woods
December 01, 2015

Getting Your System to Production (and keeping it there)

It can be dispiriting to find that a well-designed system that has been carefully implemented runs into problems as soon as it hits production, but such things do happen. This session explores why this happens and discusses why good software development practice is important but ultimately isn't sufficient to create a reliable and effective enterprise system. We'll discuss what being "production ready" really means in order to allow us to understand the principles, patterns and practices that we need to be aware of and apply in order to get our systems into production safely and keep them there.

Eoin Woods

December 01, 2015
Tweet

More Decks by Eoin Woods

Other Decks in Technology

Transcript

  1. Getting a System to Production
    ... and keeping it there
    1

    View Slide

  2. Who Am I?
    Eoin Woods - CTO at Endava
    2005 - 2014 in capital markets (UBS, BGI)
    2000 - 2004 in product engineering & consultancy 

    (Bull, Sybase, InterTrust, independent)
    Author, editor, speaker, community-guy
    2

    View Slide

  3. Who are Endava?
    Software Engineering & IT Services Firm
    2800+ people
    UK, US, Germany, Romania, Moldova, Serbia,
    Macedonia
    Agile and Digital Transformation
    Consulting, Architecture, Development, Testing
    Data and Analytics
    Application Management, Infrastructure, DevOps
    3

    View Slide

  4. Content
    Introducing Production Systems
    What Goes Wrong in Production?
    Solutions for Production Systems
    Conclusions
    4

    View Slide

  5. Production Systems
    5

    View Slide

  6. What is a production system?
    6
    Any system

    being used

    for real work

    View Slide

  7. Why is Productionisation Hard?
    No one teaches you about production
    who do you talk to?
    what do they want?
    what is the definition of “done” ?
    Production is difficult for developers
    hard to access, interrogate, debug, change, ...
    7

    View Slide

  8. A new cast of characters
    8
    Developers
    Development
    Users

    View Slide

  9. A new cast of characters
    8
    Production
    Users
    Developers
    Auditors
    Operations
    Acquirers
    Infrastructure
    Business

    Management

    View Slide

  10. Production is constrained
    Highly controlled
    Content is all valuable
    Change can be difficult
    9

    View Slide

  11. Production is unpredictable
    10

    View Slide

  12. Production is highly visible!
    11

    View Slide

  13. You don’t own production
    12

    View Slide

  14. What goes wrong?
    13

    View Slide

  15. Performance surprises
    Interactive load
    Batch time surprises
    System abusers!
    “all transactions this year”,
    “average since 1967”, ...
    14

    View Slide

  16. Environment bombshells
    Constraints and contention
    Unexpected behaviour
    Integration points
    15

    View Slide

  17. Failures happen
    Software defects
    Platform failures
    Environment failures
    16

    View Slide

  18. Security tangles
    Security is simple in
    Development
    Much more complex
    in Production!
    17

    View Slide

  19. Finding Solutions
    18

    View Slide

  20. Key requirements for production
    Functionally correct
    does what the business process requires
    Stability
    behaves predictably in all situations
    Capacity
    can process the workload required (at all times)
    Security
    limits access to those who are authorised to have it
    19

    View Slide

  21. Solution Framework
    Correctness Stability Capacity Security
    Design
    Principles
    Technology
    Practices
    20

    View Slide

  22. Solution Framework
    Correctness Stability Capacity Security
    Design
    Principles
    Technology
    Practices
    Simplicity
    20

    View Slide

  23. Solution Framework
    Correctness Stability Capacity Security
    Design
    Principles
    Technology
    Practices
    Simplicity
    Resource
    Governor
    20

    View Slide

  24. Solution Framework
    Correctness Stability Capacity Security
    Design
    Principles
    Technology
    Practices
    Simplicity
    Resource
    Governor
    Threat
    Modelling
    20

    View Slide

  25. Solution Framework
    Correctness Stability Capacity Security
    Design
    Principles
    Technology
    Practices
    Simplicity
    Resource
    Governor
    Threat
    Modelling
    20
    Our focus today

    View Slide

  26. General Principles
    One Team
    Automate
    Measure and Improve (feedback loops)
    Good Enough over Perfection
    21
    Timeless principles … that led to CD and DevOps

    View Slide

  27. DevOps Principles
    Communication
    Automation
    Lean thinking
    Measurement
    Sharing
    22
    CALMS - itrevolution.com/devops-culture-part-1

    View Slide

  28. Solutions: Achieving Stability
    23

    View Slide

  29. Stability - design principles
    Fail quickly
    fail fast, timeouts
    Isolate problems
    flow control, circuit breakers, bulkheads,
    asynchronous integration
    Ensure steady state operation
    housekeeping, predictable resource allocation,
    governors, throttling
    24

    View Slide

  30. Stability - technology solutions
    25

    View Slide

  31. Stability - technology solutions
    Fail fast
    25

    View Slide

  32. Stability - technology solutions
    Fail fast
    Bulkhead
    25

    View Slide

  33. Stability - technology solutions
    Timeouts
    Fail fast
    Bulkhead
    25

    View Slide

  34. Stability - technology solutions
    Timeouts
    Fail fast
    Bulkhead
    Governor
    25

    View Slide

  35. Stability - technology solutions
    Timeouts
    Circuit
    Breaker
    Fail fast
    Bulkhead
    Governor
    25

    View Slide

  36. Stability - technology solutions
    Timeouts
    Circuit
    Breaker
    Fail fast
    Bulkhead
    Governor
    Housekeeping
    25

    View Slide

  37. Example - Circuit Breaker
    Clear
    Checking
    Tripped
    err_returned
    timeout
    err_returned &&

    err_count > 10
    err_returned
    26

    View Slide

  38. Stability - practices
    Repeatability
    defined processes, practice scenarios, prelive
    environments
    Automation
    automate the routine, automate the difficult
    allow the human back in the loop on demand
    Transparency
    logging, monitoring, alerts, trends
    27

    View Slide

  39. Stability - process automation
    Logging 

    & Metrics
    Monitoring
    Automation
    28

    View Slide

  40. Stability - environments
    Development
    UAT
    Prelive
    Production
    29

    View Slide

  41. “Uncontrolled”
    Stability - environments
    Development
    UAT
    Prelive
    Production
    29

    View Slide

  42. “Controlled”
    “Uncontrolled”
    Stability - environments
    Development
    UAT
    Prelive
    Production
    29

    View Slide

  43. “Controlled”
    “Uncontrolled”
    Stability - environments
    Development
    UAT
    Prelive
    Production
    29
    The DevOps Zone

    View Slide

  44. Stability - production runbooks
    Security, Audit,

    Compliance, ...
    Production

    Operations
    Developers
    System design
    Experience
    Constraints
    •Overview
    •Install
    •Backout
    •Op Procs
    •Investigation
    •Recovery
    30

    View Slide

  45. Solutions: Achieving Capacity
    31

    View Slide

  46. Capacity - design principles
    Minimise workload
    efficiency is important
    Flatten the peaks
    move workload around
    Design for the large (scalability)
    understand where the time goes
    multiply by a million
    32

    View Slide

  47. Capacity - technology solutions
    Measure and minimise
    understand where the work is
    Caching and pre-computing
    reduce the work to be done
    Sharding and partitioning
    separate workload to allow scale
    33

    View Slide

  48. Capacity - solutions
    34

    View Slide

  49. Capacity - solutions
    Segment
    Timings
    34

    View Slide

  50. Capacity - solutions
    Segment
    Timings
    Static cache
    34

    View Slide

  51. Capacity - solutions
    Segment
    Timings
    Static cache
    Lookaside cache
    34

    View Slide

  52. Capacity - solutions
    Segment
    Timings
    Static cache
    Lookaside cache
    Result set caching
    34

    View Slide

  53. Capacity - solutions
    Segment
    Timings
    Static cache
    Lookaside cache
    Precompute
    Result set caching
    34

    View Slide

  54. Capacity - solutions
    Segment
    Timings
    Static cache
    Lookaside cache
    Precompute
    Result set caching
    Phased
    batch
    34

    View Slide

  55. Moving Work Around
    Utilisation
    0
    25
    50
    75
    100
    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
    Utilisation
    0
    25
    50
    75
    100
    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
    35

    View Slide

  56. Capacity - practices
    Model and estimate
    Test capacity on realistic environments
    allows model calibration
    Monitoring and trend analysis
    tests theory against reality
    spots impending storms before they hit
    36

    View Slide

  57. Solutions: Achieving Security
    37

    View Slide

  58. Security - key design principles
    What they don’t have won’t hurt you
    least privilege - grant the minimum needed
    Security needs simplicity
    what you can’t analyse you can’t be sure about
    Don’t put your eggs in one basket
    separate privileges to avoid total breaches
    Fail safely
    38

    View Slide

  59. Security - solutions
    39

    View Slide

  60. Security - solutions
    Authentication
    & Roles
    39

    View Slide

  61. Security - solutions
    Authentication
    & Roles
    Least privilege
    / separation
    39

    View Slide

  62. Security - solutions
    Authentication
    & Roles
    Least privilege
    / separation
    Privacy (TLS)
    39

    View Slide

  63. Security - solutions
    Authentication
    & Roles
    Least privilege
    / separation
    Privacy (TLS)
    Trust (certs)
    39

    View Slide

  64. Security - solutions
    Authentication
    & Roles
    Least privilege
    / separation
    Privacy (TLS)
    Isolation (firewalls
    & zones)
    Trust (certs)
    39

    View Slide

  65. Security - key practices
    Model threats to identify mitigation
    Define policy to know what to protect
    Apply mechanisms to mitigate threats
    Test security as well as functions
    40

    View Slide

  66. Security - techniques
    Security
    Model
    Threat

    Model
    41

    View Slide

  67. Summary
    42

    View Slide

  68. Production is just different
    it’s not yours and you need to respect that
    Production is demanding
    Correctness
    Stability
    Capacity
    Security
    Summary
    43

    View Slide

  69. Summary (ii)
    Identify solutions by requirement & area
    principles
    technologies
    practices
    44

    View Slide

  70. Summary (iii)
    Production requirements and principles
    go back to the age of the mainframe
    CD and DevOps the latest incarnation
    welcome attention from developers
    new tech enabling new possibilities
    breaking down silos to make it happen
    45

    View Slide

  71. Books
    Software Systems
    Architecture
    Second Edition
    NICK ROZANSKI • EOIN WOODS
    Working with Stakeholders Using Viewpoints and Perspectives
    Second
    Edition
    46

    View Slide

  72. Eoin Woods

    [email protected]
    www.eoinwoods.info

    @eoinwoodz
    Thank you.
    Questions?
    47
    Acknowledgements
    http://www.icons-land.com
    http://www.alamy.com/
    http://www.42u.com

    View Slide