Are you ready for production?

E7526ec3e801f8ba99f6746498a154a6?s=47 JBD
December 02, 2019

Are you ready for production?

E7526ec3e801f8ba99f6746498a154a6?s=128

JBD

December 02, 2019
Tweet

Transcript

  1. @rakyll Are you ready for production? Jaana B. Dogan, Google

    jbd.dev/prod-readiness
  2. None
  3. @rakyll Are we ready to push to production? What’s good

    or bad when operating services? How do we transfer knowledge? How do we learn from failure?
  4. @rakyll • Ensure a service meets accepted standards of operational

    readiness. • Ensure the teams can benefit from organization-wide knowledge. Production readiness involves reviews (PRRs). Production readiness
  5. @rakyll When? Various times in the lifetime of a service!

    • Launching a new production service. • Introducing prod readiness to an existing service. • Handing off the operations. • Preparing oncall support.
  6. @rakyll Checklist Design Development Configuration management Release management Observability Security

    Capacity planning
  7. @rakyll Why? Engineers don’t feel confident which affects the release

    velocity. Teams figure out practices with trial-error. Trial-error culture makes finger pointing common.
  8. @rakyll PRRs A checklist or a questionnaire. Check manually, automatically

    or do both. Self-served or assisted. Create checklist templates if needed.
  9. @rakyll DISCLAIMER: THIS IS A REFERENCE.

  10. None
  11. @rakyll Design and development Have reproducible builds. Define and set

    SLOs for your service at design time. Document the availability expectations of external dependencies. Avoid single points of failures by not depending on a single global resource.
  12. @rakyll Have reproducible builds. Rationale: Rollouts shouldn’t be affected by

    the outages of the external systems. Otherwise, the rollout-ability will always be capped by the least available external service and it wouldn’t be possible to rollout/rollback on demand. Deliverables: • Use X to ensure source tree is hosted internally. • Use Y for continuous integration. • Don’t mandate the external test coverage service before merges.
  13. @rakyll Configuration Static, small and non-secret configuration can be flags.

    Use a configuration delivery service for everything else. Development configuration shouldn’t inherit from production. Document dynamic configuration capabilities.
  14. @rakyll Release management Document your release process step by step.

    Document how releases affect metrics. Document your canary release process. Document how to revert canaries. Ensure that rollbacks use the same process that rollouts use.
  15. @rakyll Observability Ensure the collection of metrics that are required

    by your SLOs. Ensure client- and server-side of the data can be differentiated. Include (cloud) platform metrics in your dashboards. Setup alerting for your external service dependencies. Always propagate the incoming distributed trace context header.
  16. @rakyll Security (1/2) Ensure all external requests are encrypted. Ensure

    all production projects have proper IAM configuration. Use subnetworks to isolate. Use VPN to connect to remote networks. Document and monitor user data access. Ensure debugging endpoints are limited by ACL.
  17. @rakyll Security (2/2) Sanitize user input. Have payload size restrictions

    for user input. Ensure your service can block incoming traffic selectively per user. Avoid external endpoints triggers a large number of internal fanouts.
  18. @rakyll Capacity planning Document how your service scales. Document resource

    requirements for your service. Document resource constraints: resource type, region, etc. Document quota restrictions to create new resources. Document load tests for performance regressions if possible.
  19. @rakyll Where to start? • Acknowledge the need as a

    team. • Research practices that apply, consult domain experts. • Start having production readiness discussions early. • Learn from failure and share knowledge widely. • Enforce production readiness practices.
  20. @rakyll Evaluate PRRs... Provide rationale for each item. Remove items

    as they are irrelevant. Review the reviews when needed.
  21. @rakyll Start early on... Discuss when you are initially designing

    a new service. Seen as a burden rather than a helpful utility when introduced late. Checklists provide insights about design tradeoffs.
  22. @rakyll The “process”... “We don’t want to deal with process

    and the overhead of committees!” Don’t. Reviews can increase the velocity. - Faster design decisions. - Faster troubleshooting. - Faster onboarding.
  23. @rakyll Production readiness can help teams to be confident, reduce

    everyday mistakes and help onboarding new people to the projects.
  24. @rakyll You can either do them manually, automatically or both.

  25. @rakyll Not just a later addition but helps the teams

    make design decisions, hence it should be discussed early and should be a part of the initial design.
  26. @rakyll Relatively low overhead compared to not having them. No

    review means repeated mistakes and failure, as well as burnout. maturity release velocity
  27. @rakyll Last but not least, it is more than making

    sure your boxes check.
  28. @rakyll Happy production :) Jaana B. Dogan, Google jbd@google.com jbd.dev/prod-readiness