Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Achieving repeatable, extensible and self serve infrastructure

Tasdik Rahman
November 16, 2019

Achieving repeatable, extensible and self serve infrastructure

Tasdik Rahman

November 16, 2019
Tweet

More Decks by Tasdik Rahman

Other Decks in Programming

Transcript

  1. Achieving repeatable,
    extensible and self
    serve infrastructure

    View full-size slide

  2. 2
    tasdikrahman.me
    @tasdikrahman
    ● Product Engineer @ Gojek
    ● Contributor to oVirt
    ● Backpacker
    ● Weekend chef
    ● Chelsea FC!!

    View full-size slide

  3. What does Gojek do?
    3

    View full-size slide

  4. 4
    Ref: gojek.io

    View full-size slide

  5. What am I gonna talk
    about?
    5

    View full-size slide

  6. 6
    Ref: shutterstock.com

    View full-size slide

  7. 7
    Ref: shutterstock.com
    Evolution of
    Infrastructure @ Gojek

    View full-size slide

  8. Travelling back in time
    8

    View full-size slide

  9. Rapid Demand
    9

    View full-size slide

  10. How to deal with it?
    10

    View full-size slide

  11. Central
    Infrastructure Team
    11

    View full-size slide

  12. Abstract out Infrastructure
    For Product Teams
    13

    View full-size slide

  13. Adhoc requests
    15

    View full-size slide

  14. “Measure what is measurable, and make
    measurable what is not so”
    - Galileo
    16
    Credits: biography.com

    View full-size slide

  15. Service request tickets
    17

    View full-size slide

  16. 18
    Example service request in our ticket system by a team (names redacted)

    View full-size slide

  17. 19
    Example service request to increase disk size (names redacted)

    View full-size slide

  18. Number of service requests kept
    increasing with scale and more
    product groups coming in
    20

    View full-size slide

  19. 21
    Ref: gunshowcomic.com/648

    View full-size slide

  20. How does one keep up
    with service requests?
    22

    View full-size slide

  21. Scale your team vertically
    and keep doing so
    23

    View full-size slide

  22. Sustainable?
    24

    View full-size slide

  23. Very hard to do, but
    mostly No
    25

    View full-size slide

  24. Eventually, we noticed we
    were becoming the
    bottleneck
    26

    View full-size slide

  25. Give access to someone
    from the product team?
    27

    View full-size slide

  26. Chances of Security
    loopholes
    28

    View full-size slide

  27. 29
    Ref: https://blog.codinghorror.com/the-broken-window-theory/

    View full-size slide

  28. What do we do then?
    30

    View full-size slide

  29. Quick detour
    31

    View full-size slide

  30. Where did systems
    administration start?
    32

    View full-size slide

  31. Evolution of Automation at
    Gojek
    33

    View full-size slide

  32. Evolution of Automation at Gojek
    34
    ● Scripts
    ● Chef-cookbooks
    ● Rundeck
    ● Deployment scripts

    View full-size slide

  33. Problems with the earlier solutions
    35
    ● Multiple ways around building and
    using automation
    ● Managing dependencies for the
    automation. Eg: people using
    gcloud/AWS

    View full-size slide

  34. Problems with the earlier solutions
    36
    ● Lack of convention leading to meagre
    contributions to automation from devs.
    ● Adhoc way of managing access to tools
    like terraform, knife leading to stray
    accidents.
    ● No central platform for automation.

    View full-size slide

  35. Number of tickets getting
    created still not
    decreasing
    37

    View full-size slide

  36. Clearing infrastructure
    debts
    38

    View full-size slide

  37. Moving from
    maintenance to innovation
    mode
    39

    View full-size slide

  38. Making infrastructure
    boring for product teams
    40

    View full-size slide

  39. Proctor:
    Our automation
    orchestrator
    41
    Ref: github.com/gojek/proctor

    View full-size slide

  40. Installation
    44

    View full-size slide

  41. 45
    Helm all the way
    Reference value: stable/proctor-service/values.yaml

    View full-size slide

  42. Automation using proctor
    46

    View full-size slide

  43. Sample proc to increase disk
    47

    View full-size slide

  44. Sample proc to increase disk
    48

    View full-size slide

  45. Scripts can be added by developers
    and they get added to proctor after
    our review
    49

    View full-size slide

  46. Sample procs in our ecosystem
    50

    View full-size slide

  47. Outcome of having
    proctor?
    53

    View full-size slide

  48. Decrease in number of
    tickets which were
    mechanical in nature
    54

    View full-size slide

  49. Having terraform
    inside CI
    55
    +

    View full-size slide

  50. But before that
    56

    View full-size slide

  51. Creating the gcloud
    project
    57

    View full-size slide

  52. 58
    Sample
    directory structure

    View full-size slide

  53. 59
    .gitlab-yml for the gcloud
    project in gitlab

    View full-size slide

  54. 61
    Plan and apply

    View full-size slide

  55. Private terraform registry
    consisting of 90+ modules
    62

    View full-size slide

  56. Teams managing and
    provisioning their own infra
    with our best practices
    baked in terraform modules
    64

    View full-size slide

  57. OSS alternatives?
    65

    View full-size slide

  58. 66
    Reference: runatlantis.io/

    View full-size slide

  59. Ideal state?
    67

    View full-size slide

  60. 68
    Ref: Google SRE book: Eliminating toil

    View full-size slide

  61. Known caveats?
    69

    View full-size slide

  62. Deletion of infra
    70

    View full-size slide

  63. Teams forget what they
    are using
    71

    View full-size slide

  64. Lessons learnt?
    72

    View full-size slide

  65. Avoid premature
    automation
    73

    View full-size slide

  66. High service requests for
    product teams is a smell
    74

    View full-size slide

  67. No Big bang
    changes
    75

    View full-size slide

  68. Documentation should go
    hand in hand, would affect
    productivity directly
    76

    View full-size slide

  69. Reduce steps for
    onboarding to your
    tooling, lesser the better
    77

    View full-size slide

  70. Invisible infrastructure
    78

    View full-size slide

  71. Product managers in
    Infrastructure teams
    79

    View full-size slide

  72. Prioritizing on innovation
    80

    View full-size slide

  73. Links and References
    ● https://github.com/gojek/proctor
    ● https://blog.gojekengineering.com/olympus-terraforming-repeatabl
    e-and-extensible-infrastructure-at-go-jek-42ad5b0a4f9a
    ● https://learn.hashicorp.com/terraform/development/running-terrafor
    m-in-automation
    ● https://lethain.com/product-management-infra-engineering/
    81

    View full-size slide

  74. 82
    @tasdikrahman
    tasdikrahman.me

    View full-size slide