Are you ready for production?

@rakyll Are you ready for production? Jaana B. Dogan, Google
jbd.dev/prod-readiness

@rakyll Are we ready to push to production? What’s good
or bad when operating services? How do we transfer knowledge? How do we learn from failure?

@rakyll • Ensure a service meets accepted standards of operational
readiness. • Ensure the teams can beneﬁt from organization-wide knowledge. Production readiness involves reviews (PRRs). Production readiness

@rakyll When? Various times in the lifetime of a service!
• Launching a new production service. • Introducing prod readiness to an existing service. • Handing off the operations. • Preparing oncall support.

@rakyll Checklist Design Development Conﬁguration management Release management Observability Security
Capacity planning

@rakyll Why? Engineers don’t feel confident which affects the release
velocity. Teams figure out practices with trial-error. Trial-error culture makes finger pointing common.

@rakyll PRRs A checklist or a questionnaire. Check manually, automatically
or do both. Self-served or assisted. Create checklist templates if needed.

@rakyll DISCLAIMER: THIS IS A REFERENCE.

@rakyll Design and development Have reproducible builds. Deﬁne and set
SLOs for your service at design time. Document the availability expectations of external dependencies. Avoid single points of failures by not depending on a single global resource.

@rakyll Have reproducible builds. Rationale: Rollouts shouldn’t be affected by
the outages of the external systems. Otherwise, the rollout-ability will always be capped by the least available external service and it wouldn’t be possible to rollout/rollback on demand. Deliverables: • Use X to ensure source tree is hosted internally. • Use Y for continuous integration. • Don’t mandate the external test coverage service before merges.

@rakyll Configuration Static, small and non-secret configuration can be flags.
Use a configuration delivery service for everything else. Development configuration shouldn’t inherit from production. Document dynamic configuration capabilities.

@rakyll Release management Document your release process step by step.
Document how releases affect metrics. Document your canary release process. Document how to revert canaries. Ensure that rollbacks use the same process that rollouts use.

@rakyll Observability Ensure the collection of metrics that are required
by your SLOs. Ensure client- and server-side of the data can be differentiated. Include (cloud) platform metrics in your dashboards. Setup alerting for your external service dependencies. Always propagate the incoming distributed trace context header.

@rakyll Security (1/2) Ensure all external requests are encrypted. Ensure
all production projects have proper IAM conﬁguration. Use subnetworks to isolate. Use VPN to connect to remote networks. Document and monitor user data access. Ensure debugging endpoints are limited by ACL.

@rakyll Security (2/2) Sanitize user input. Have payload size restrictions
for user input. Ensure your service can block incoming traffic selectively per user. Avoid external endpoints triggers a large number of internal fanouts.

@rakyll Capacity planning Document how your service scales. Document resource
requirements for your service. Document resource constraints: resource type, region, etc. Document quota restrictions to create new resources. Document load tests for performance regressions if possible.

@rakyll Where to start? • Acknowledge the need as a
team. • Research practices that apply, consult domain experts. • Start having production readiness discussions early. • Learn from failure and share knowledge widely. • Enforce production readiness practices.

@rakyll Evaluate PRRs... Provide rationale for each item. Remove items
as they are irrelevant. Review the reviews when needed.

@rakyll Start early on... Discuss when you are initially designing
a new service. Seen as a burden rather than a helpful utility when introduced late. Checklists provide insights about design tradeoffs.

@rakyll The “process”... “We don’t want to deal with process
and the overhead of committees!” Don’t. Reviews can increase the velocity. - Faster design decisions. - Faster troubleshooting. - Faster onboarding.

@rakyll Production readiness can help teams to be conﬁdent, reduce
everyday mistakes and help onboarding new people to the projects.

@rakyll You can either do them manually, automatically or both.

@rakyll Not just a later addition but helps the teams
make design decisions, hence it should be discussed early and should be a part of the initial design.

@rakyll Relatively low overhead compared to not having them. No
review means repeated mistakes and failure, as well as burnout. maturity release velocity

@rakyll Last but not least, it is more than making
sure your boxes check.

@rakyll Happy production :) Jaana B. Dogan, Google [email protected] jbd.dev/prod-readiness

Are you ready for production?

Are you ready for production?

JBD

More Decks by JBD

Other Decks in Technology

Featured

Transcript

@rakyll Are you ready for production? Jaana B. Dogan, Google

@rakyll Are we ready to push to production? What’s good

@rakyll • Ensure a service meets accepted standards of operational

@rakyll When? Various times in the lifetime of a service!

@rakyll Checklist Design Development Conﬁguration management Release management Observability Security

@rakyll Why? Engineers don’t feel conﬁdent which affects the release

@rakyll PRRs A checklist or a questionnaire. Check manually, automatically

@rakyll DISCLAIMER: THIS IS A REFERENCE.

@rakyll Design and development Have reproducible builds. Deﬁne and set

@rakyll Have reproducible builds. Rationale: Rollouts shouldn’t be affected by

@rakyll Configuration Static, small and non-secret conﬁguration can be ﬂags.

@rakyll Release management Document your release process step by step.

@rakyll Observability Ensure the collection of metrics that are required

@rakyll Security (1/2) Ensure all external requests are encrypted. Ensure

@rakyll Security (2/2) Sanitize user input. Have payload size restrictions

@rakyll Capacity planning Document how your service scales. Document resource

@rakyll Where to start? • Acknowledge the need as a

@rakyll Evaluate PRRs... Provide rationale for each item. Remove items

@rakyll Start early on... Discuss when you are initially designing

@rakyll The “process”... “We don’t want to deal with process

@rakyll Production readiness can help teams to be conﬁdent, reduce

@rakyll You can either do them manually, automatically or both.

@rakyll Not just a later addition but helps the teams

@rakyll Relatively low overhead compared to not having them. No

@rakyll Last but not least, it is more than making

@rakyll Happy production :) Jaana B. Dogan, Google [email protected] jbd.dev/prod-readiness