Slide 1

Slide 1 text

@rakyll Are you ready for production? Jaana B. Dogan, Google jbd.dev/prod-readiness

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

@rakyll Are we ready to push to production? What’s good or bad when operating services? How do we transfer knowledge? How do we learn from failure?

Slide 4

Slide 4 text

@rakyll ● Ensure a service meets accepted standards of operational readiness. ● Ensure the teams can benefit from organization-wide knowledge. Production readiness involves reviews (PRRs). Production readiness

Slide 5

Slide 5 text

@rakyll When? Various times in the lifetime of a service! ● Launching a new production service. ● Introducing prod readiness to an existing service. ● Handing off the operations. ● Preparing oncall support.

Slide 6

Slide 6 text

@rakyll Checklist Design Development Configuration management Release management Observability Security Capacity planning

Slide 7

Slide 7 text

@rakyll Why? Engineers don’t feel confident which affects the release velocity. Teams figure out practices with trial-error. Trial-error culture makes finger pointing common.

Slide 8

Slide 8 text

@rakyll PRRs A checklist or a questionnaire. Check manually, automatically or do both. Self-served or assisted. Create checklist templates if needed.

Slide 9

Slide 9 text

@rakyll DISCLAIMER: THIS IS A REFERENCE.

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

@rakyll Design and development Have reproducible builds. Define and set SLOs for your service at design time. Document the availability expectations of external dependencies. Avoid single points of failures by not depending on a single global resource.

Slide 12

Slide 12 text

@rakyll Have reproducible builds. Rationale: Rollouts shouldn’t be affected by the outages of the external systems. Otherwise, the rollout-ability will always be capped by the least available external service and it wouldn’t be possible to rollout/rollback on demand. Deliverables: ● Use X to ensure source tree is hosted internally. ● Use Y for continuous integration. ● Don’t mandate the external test coverage service before merges.

Slide 13

Slide 13 text

@rakyll Configuration Static, small and non-secret configuration can be flags. Use a configuration delivery service for everything else. Development configuration shouldn’t inherit from production. Document dynamic configuration capabilities.

Slide 14

Slide 14 text

@rakyll Release management Document your release process step by step. Document how releases affect metrics. Document your canary release process. Document how to revert canaries. Ensure that rollbacks use the same process that rollouts use.

Slide 15

Slide 15 text

@rakyll Observability Ensure the collection of metrics that are required by your SLOs. Ensure client- and server-side of the data can be differentiated. Include (cloud) platform metrics in your dashboards. Setup alerting for your external service dependencies. Always propagate the incoming distributed trace context header.

Slide 16

Slide 16 text

@rakyll Security (1/2) Ensure all external requests are encrypted. Ensure all production projects have proper IAM configuration. Use subnetworks to isolate. Use VPN to connect to remote networks. Document and monitor user data access. Ensure debugging endpoints are limited by ACL.

Slide 17

Slide 17 text

@rakyll Security (2/2) Sanitize user input. Have payload size restrictions for user input. Ensure your service can block incoming traffic selectively per user. Avoid external endpoints triggers a large number of internal fanouts.

Slide 18

Slide 18 text

@rakyll Capacity planning Document how your service scales. Document resource requirements for your service. Document resource constraints: resource type, region, etc. Document quota restrictions to create new resources. Document load tests for performance regressions if possible.

Slide 19

Slide 19 text

@rakyll Where to start? ● Acknowledge the need as a team. ● Research practices that apply, consult domain experts. ● Start having production readiness discussions early. ● Learn from failure and share knowledge widely. ● Enforce production readiness practices.

Slide 20

Slide 20 text

@rakyll Evaluate PRRs... Provide rationale for each item. Remove items as they are irrelevant. Review the reviews when needed.

Slide 21

Slide 21 text

@rakyll Start early on... Discuss when you are initially designing a new service. Seen as a burden rather than a helpful utility when introduced late. Checklists provide insights about design tradeoffs.

Slide 22

Slide 22 text

@rakyll The “process”... “We don’t want to deal with process and the overhead of committees!” Don’t. Reviews can increase the velocity. - Faster design decisions. - Faster troubleshooting. - Faster onboarding.

Slide 23

Slide 23 text

@rakyll Production readiness can help teams to be confident, reduce everyday mistakes and help onboarding new people to the projects.

Slide 24

Slide 24 text

@rakyll You can either do them manually, automatically or both.

Slide 25

Slide 25 text

@rakyll Not just a later addition but helps the teams make design decisions, hence it should be discussed early and should be a part of the initial design.

Slide 26

Slide 26 text

@rakyll Relatively low overhead compared to not having them. No review means repeated mistakes and failure, as well as burnout. maturity release velocity

Slide 27

Slide 27 text

@rakyll Last but not least, it is more than making sure your boxes check.

Slide 28

Slide 28 text

@rakyll Happy production :) Jaana B. Dogan, Google jbd@google.com jbd.dev/prod-readiness