$30 off During Our Annual Pro Sale. View Details »

Path to SRE

Path to SRE

You’ve heard all the buzz about SRE, but what does it actually take to do it if you are not Google? In this talk you will learn about our experience creating Auth0's SRE flavor and rolling it out.

Damian Schenkelman

July 17, 2019
Tweet

More Decks by Damian Schenkelman

Other Decks in Programming

Transcript

  1. The Path to SRE
    @dschenkelman
    Director of Engineering @auth0

    View Slide

  2. SRE

    View Slide

  3. Why?

    View Slide

  4. Reliability is the one feature
    every customer uses
    - an @auth0 SRE

    View Slide

  5. Auth0
    User
    Auth0
    Customer App

    View Slide

  6. Context

    View Slide

  7. Focused Investment
    Like Security but for Reliability

    View Slide

  8. Scale

    View Slide

  9. Research

    View Slide

  10. Companies

    View Slide

  11. Organizations

    View Slide

  12. Style

    View Slide

  13. Sponsors

    View Slide

  14. Who?

    View Slide

  15. Spectrum
    Systems Software

    View Slide

  16. The Usual Suspects

    View Slide

  17. Teachers

    View Slide

  18. Advocates

    View Slide

  19. Problem solvers

    View Slide

  20. Know the system

    View Slide

  21. Experience

    View Slide

  22. node.js

    View Slide

  23. Educate

    View Slide

  24. What we do
    SRE identifies, develops, refines,
    and disseminates the libraries,
    services, practices, and processes
    key to system reliability.

    View Slide

  25. SRE does not force itself
    on other teams

    View Slide

  26. SRE does not handle all
    incident response

    View Slide

  27. Involvement Spectrum
    SRE Run Service
    Embedding
    Consultancy
    Office Hours/
    Workshops

    View Slide

  28. Contacting SRE

    View Slide

  29. The brand

    View Slide

  30. Logo

    View Slide

  31. Office Hours

    View Slide

  32. Brown bags

    View Slide

  33. Investigations

    View Slide

  34. Flexibility

    View Slide

  35. Incidents

    View Slide

  36. Execute!

    View Slide

  37. You are selling TRUST

    View Slide

  38. SLOs

    View Slide

  39. R2

    View Slide

  40. View Slide

  41. View Slide

  42. View Slide

  43. Incident Response

    View Slide

  44. Distributed Traces

    View Slide

  45. Rate limiting

    View Slide

  46. CI/CD

    View Slide

  47. Complex Issues

    View Slide

  48. Today

    View Slide

  49. Org
    IAM DX
    Platform
    SRE

    View Slide

  50. Results
    • 5/11 teams doing R2s organically
    • > 5x more frequent deploys with < 10x
    duration
    • 80% critical services with tracing

    View Slide

  51. Results (2)
    • 5 complex issues solved
    • > 99.99% reliability for User
    Management API
    • ~8ms -> ~3ms 99th perc latency for
    rate limits

    View Slide

  52. Success

    View Slide

  53. Vision
    Subject to change :)
    IAM DX
    Platform
    SRE
    PR
    SRE
    AR
    SRE
    AR
    SRE
    OX

    View Slide

  54. Thanks
    @dschenkelman

    View Slide