Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Achieving repeatable, extensible and self serve infrastructure

Tasdik Rahman
November 16, 2019

Achieving repeatable, extensible and self serve infrastructure

Tasdik Rahman

November 16, 2019
Tweet

More Decks by Tasdik Rahman

Other Decks in Programming

Transcript

  1. Achieving repeatable,
    extensible and self
    serve infrastructure

    View Slide

  2. 2
    tasdikrahman.me
    @tasdikrahman
    ● Product Engineer @ Gojek
    ● Contributor to oVirt
    ● Backpacker
    ● Weekend chef
    ● Chelsea FC!!

    View Slide

  3. What does Gojek do?
    3

    View Slide

  4. 4
    Ref: gojek.io

    View Slide

  5. What am I gonna talk
    about?
    5

    View Slide

  6. 6
    Ref: shutterstock.com

    View Slide

  7. 7
    Ref: shutterstock.com
    Evolution of
    Infrastructure @ Gojek

    View Slide

  8. Travelling back in time
    8

    View Slide

  9. Rapid Demand
    9

    View Slide

  10. How to deal with it?
    10

    View Slide

  11. Central
    Infrastructure Team
    11

    View Slide

  12. Intent?
    12

    View Slide

  13. Abstract out Infrastructure
    For Product Teams
    13

    View Slide

  14. Outcome?
    14

    View Slide

  15. Adhoc requests
    15

    View Slide

  16. “Measure what is measurable, and make
    measurable what is not so”
    - Galileo
    16
    Credits: biography.com

    View Slide

  17. Service request tickets
    17

    View Slide

  18. 18
    Example service request in our ticket system by a team (names redacted)

    View Slide

  19. 19
    Example service request to increase disk size (names redacted)

    View Slide

  20. Number of service requests kept
    increasing with scale and more
    product groups coming in
    20

    View Slide

  21. 21
    Ref: gunshowcomic.com/648

    View Slide

  22. How does one keep up
    with service requests?
    22

    View Slide

  23. Scale your team vertically
    and keep doing so
    23

    View Slide

  24. Sustainable?
    24

    View Slide

  25. Very hard to do, but
    mostly No
    25

    View Slide

  26. Eventually, we noticed we
    were becoming the
    bottleneck
    26

    View Slide

  27. Give access to someone
    from the product team?
    27

    View Slide

  28. Chances of Security
    loopholes
    28

    View Slide

  29. 29
    Ref: https://blog.codinghorror.com/the-broken-window-theory/

    View Slide

  30. What do we do then?
    30

    View Slide

  31. Quick detour
    31

    View Slide

  32. Where did systems
    administration start?
    32

    View Slide

  33. Evolution of Automation at
    Gojek
    33

    View Slide

  34. Evolution of Automation at Gojek
    34
    ● Scripts
    ● Chef-cookbooks
    ● Rundeck
    ● Deployment scripts

    View Slide

  35. Problems with the earlier solutions
    35
    ● Multiple ways around building and
    using automation
    ● Managing dependencies for the
    automation. Eg: people using
    gcloud/AWS

    View Slide

  36. Problems with the earlier solutions
    36
    ● Lack of convention leading to meagre
    contributions to automation from devs.
    ● Adhoc way of managing access to tools
    like terraform, knife leading to stray
    accidents.
    ● No central platform for automation.

    View Slide

  37. Number of tickets getting
    created still not
    decreasing
    37

    View Slide

  38. Clearing infrastructure
    debts
    38

    View Slide

  39. Moving from
    maintenance to innovation
    mode
    39

    View Slide

  40. Making infrastructure
    boring for product teams
    40

    View Slide

  41. Proctor:
    Our automation
    orchestrator
    41
    Ref: github.com/gojek/proctor

    View Slide

  42. 42

    View Slide

  43. 43

    View Slide

  44. Installation
    44

    View Slide

  45. 45
    Helm all the way
    Reference value: stable/proctor-service/values.yaml

    View Slide

  46. Automation using proctor
    46

    View Slide

  47. Sample proc to increase disk
    47

    View Slide

  48. Sample proc to increase disk
    48

    View Slide

  49. Scripts can be added by developers
    and they get added to proctor after
    our review
    49

    View Slide

  50. Sample procs in our ecosystem
    50

    View Slide

  51. Demo
    51

    View Slide

  52. Profit?
    52

    View Slide

  53. Outcome of having
    proctor?
    53

    View Slide

  54. Decrease in number of
    tickets which were
    mechanical in nature
    54

    View Slide

  55. Having terraform
    inside CI
    55
    +

    View Slide

  56. But before that
    56

    View Slide

  57. Creating the gcloud
    project
    57

    View Slide

  58. 58
    Sample
    directory structure

    View Slide

  59. 59
    .gitlab-yml for the gcloud
    project in gitlab

    View Slide

  60. 60

    View Slide

  61. 61
    Plan and apply

    View Slide

  62. Private terraform registry
    consisting of 90+ modules
    62

    View Slide

  63. Outcome?
    63

    View Slide

  64. Teams managing and
    provisioning their own infra
    with our best practices
    baked in terraform modules
    64

    View Slide

  65. OSS alternatives?
    65

    View Slide

  66. 66
    Reference: runatlantis.io/

    View Slide

  67. Ideal state?
    67

    View Slide

  68. 68
    Ref: Google SRE book: Eliminating toil

    View Slide

  69. Known caveats?
    69

    View Slide

  70. Deletion of infra
    70

    View Slide

  71. Teams forget what they
    are using
    71

    View Slide

  72. Lessons learnt?
    72

    View Slide

  73. Avoid premature
    automation
    73

    View Slide

  74. High service requests for
    product teams is a smell
    74

    View Slide

  75. No Big bang
    changes
    75

    View Slide

  76. Documentation should go
    hand in hand, would affect
    productivity directly
    76

    View Slide

  77. Reduce steps for
    onboarding to your
    tooling, lesser the better
    77

    View Slide

  78. Invisible infrastructure
    78

    View Slide

  79. Product managers in
    Infrastructure teams
    79

    View Slide

  80. Prioritizing on innovation
    80

    View Slide

  81. Links and References
    ● https://github.com/gojek/proctor
    ● https://blog.gojekengineering.com/olympus-terraforming-repeatabl
    e-and-extensible-infrastructure-at-go-jek-42ad5b0a4f9a
    ● https://learn.hashicorp.com/terraform/development/running-terrafor
    m-in-automation
    ● https://lethain.com/product-management-infra-engineering/
    81

    View Slide

  82. 82
    @tasdikrahman
    tasdikrahman.me

    View Slide