Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Are you ready for production?

JBD
December 02, 2019

Are you ready for production?

JBD

December 02, 2019
Tweet

More Decks by JBD

Other Decks in Technology

Transcript

  1. @rakyll
    Are you ready
    for production?
    Jaana B. Dogan, Google
    jbd.dev/prod-readiness

    View Slide

  2. View Slide

  3. @rakyll
    Are we ready to push to production?
    What’s good or bad when operating services?
    How do we transfer knowledge?
    How do we learn from failure?

    View Slide

  4. @rakyll
    ● Ensure a service meets accepted standards of operational readiness.
    ● Ensure the teams can benefit from organization-wide knowledge.
    Production readiness involves reviews (PRRs).
    Production readiness

    View Slide

  5. @rakyll
    When?
    Various times in the lifetime of a service!
    ● Launching a new production service.
    ● Introducing prod readiness to an existing service.
    ● Handing off the operations.
    ● Preparing oncall support.

    View Slide

  6. @rakyll
    Checklist
    Design
    Development
    Configuration management
    Release management
    Observability
    Security
    Capacity planning

    View Slide

  7. @rakyll
    Why?
    Engineers don’t feel confident which affects the release velocity.
    Teams figure out practices with trial-error.
    Trial-error culture makes finger pointing common.

    View Slide

  8. @rakyll
    PRRs
    A checklist or a questionnaire.
    Check manually, automatically or do both.
    Self-served or assisted.
    Create checklist templates if needed.

    View Slide

  9. @rakyll
    DISCLAIMER: THIS IS A REFERENCE.

    View Slide

  10. View Slide

  11. @rakyll
    Design and development
    Have reproducible builds.
    Define and set SLOs for your service at design time.
    Document the availability expectations of external dependencies.
    Avoid single points of failures by not depending on a single global
    resource.

    View Slide

  12. @rakyll
    Have reproducible builds.
    Rationale: Rollouts shouldn’t be affected by the outages of the external systems. Otherwise, the
    rollout-ability will always be capped by the least available external service and it wouldn’t be
    possible to rollout/rollback on demand.
    Deliverables:
    ● Use X to ensure source tree is hosted internally.
    ● Use Y for continuous integration.
    ● Don’t mandate the external test coverage service before merges.

    View Slide

  13. @rakyll
    Configuration
    Static, small and non-secret configuration can be flags.
    Use a configuration delivery service for everything else.
    Development configuration shouldn’t inherit from production.
    Document dynamic configuration capabilities.

    View Slide

  14. @rakyll
    Release management
    Document your release process step by step.
    Document how releases affect metrics.
    Document your canary release process.
    Document how to revert canaries.
    Ensure that rollbacks use the same process that rollouts use.

    View Slide

  15. @rakyll
    Observability
    Ensure the collection of metrics that are required by your SLOs.
    Ensure client- and server-side of the data can be differentiated.
    Include (cloud) platform metrics in your dashboards.
    Setup alerting for your external service dependencies.
    Always propagate the incoming distributed trace context header.

    View Slide

  16. @rakyll
    Security (1/2)
    Ensure all external requests are encrypted.
    Ensure all production projects have proper IAM configuration.
    Use subnetworks to isolate. Use VPN to connect to remote networks.
    Document and monitor user data access.
    Ensure debugging endpoints are limited by ACL.

    View Slide

  17. @rakyll
    Security (2/2)
    Sanitize user input.
    Have payload size restrictions for user input.
    Ensure your service can block incoming traffic selectively per user.
    Avoid external endpoints triggers a large number of internal fanouts.

    View Slide

  18. @rakyll
    Capacity planning
    Document how your service scales.
    Document resource requirements for your service.
    Document resource constraints: resource type, region, etc.
    Document quota restrictions to create new resources.
    Document load tests for performance regressions if possible.

    View Slide

  19. @rakyll
    Where to start?
    ● Acknowledge the need as a team.
    ● Research practices that apply, consult domain experts.
    ● Start having production readiness discussions early.
    ● Learn from failure and share knowledge widely.
    ● Enforce production readiness practices.

    View Slide

  20. @rakyll
    Evaluate PRRs...
    Provide rationale for each item.
    Remove items as they are irrelevant.
    Review the reviews when needed.

    View Slide

  21. @rakyll
    Start early on...
    Discuss when you are initially designing a new service.
    Seen as a burden rather than a helpful utility when introduced late.
    Checklists provide insights about design tradeoffs.

    View Slide

  22. @rakyll
    The “process”...
    “We don’t want to deal with process
    and the overhead of committees!”
    Don’t. Reviews can increase the velocity.
    - Faster design decisions.
    - Faster troubleshooting.
    - Faster onboarding.

    View Slide

  23. @rakyll
    Production readiness can help teams to be confident, reduce everyday
    mistakes and help onboarding new people to the projects.

    View Slide

  24. @rakyll
    You can either do them manually, automatically or both.

    View Slide

  25. @rakyll
    Not just a later addition but helps the teams make design decisions, hence
    it should be discussed early and should be a part of the initial design.

    View Slide

  26. @rakyll
    Relatively low overhead compared to not having them. No review means
    repeated mistakes and failure, as well as burnout.
    maturity
    release
    velocity

    View Slide

  27. @rakyll
    Last but not least, it is more than making sure your boxes check.

    View Slide

  28. @rakyll
    Happy production :)
    Jaana B. Dogan, Google
    [email protected]
    jbd.dev/prod-readiness

    View Slide