Achieving repeatable, extensible and self serve infrastructure

Achieving repeatable, extensible and self serve infrastructure

99f99340cf6fe31f86e8dd0a988eec7c?s=128

Tasdik Rahman

November 16, 2019
Tweet

Transcript

  1. Achieving repeatable, extensible and self serve infrastructure

  2. 2 tasdikrahman.me @tasdikrahman • Product Engineer @ Gojek • Contributor

    to oVirt • Backpacker • Weekend chef • Chelsea FC!!
  3. What does Gojek do? 3

  4. 4 Ref: gojek.io

  5. What am I gonna talk about? 5

  6. 6 Ref: shutterstock.com

  7. 7 Ref: shutterstock.com Evolution of Infrastructure @ Gojek

  8. Travelling back in time 8

  9. Rapid Demand 9

  10. How to deal with it? 10

  11. Central Infrastructure Team 11

  12. Intent? 12

  13. Abstract out Infrastructure For Product Teams 13

  14. Outcome? 14

  15. Adhoc requests 15

  16. “Measure what is measurable, and make measurable what is not

    so” - Galileo 16 Credits: biography.com
  17. Service request tickets 17

  18. 18 Example service request in our ticket system by a

    team (names redacted)
  19. 19 Example service request to increase disk size (names redacted)

  20. Number of service requests kept increasing with scale and more

    product groups coming in 20
  21. 21 Ref: gunshowcomic.com/648

  22. How does one keep up with service requests? 22

  23. Scale your team vertically and keep doing so 23

  24. Sustainable? 24

  25. Very hard to do, but mostly No 25

  26. Eventually, we noticed we were becoming the bottleneck 26

  27. Give access to someone from the product team? 27

  28. Chances of Security loopholes 28

  29. 29 Ref: https://blog.codinghorror.com/the-broken-window-theory/

  30. What do we do then? 30

  31. Quick detour 31

  32. Where did systems administration start? 32

  33. Evolution of Automation at Gojek 33

  34. Evolution of Automation at Gojek 34 • Scripts • Chef-cookbooks

    • Rundeck • Deployment scripts
  35. Problems with the earlier solutions 35 • Multiple ways around

    building and using automation • Managing dependencies for the automation. Eg: people using gcloud/AWS
  36. Problems with the earlier solutions 36 • Lack of convention

    leading to meagre contributions to automation from devs. • Adhoc way of managing access to tools like terraform, knife leading to stray accidents. • No central platform for automation.
  37. Number of tickets getting created still not decreasing 37

  38. Clearing infrastructure debts 38

  39. Moving from maintenance to innovation mode 39

  40. Making infrastructure boring for product teams 40

  41. Proctor: Our automation orchestrator 41 Ref: github.com/gojek/proctor

  42. 42

  43. 43

  44. Installation 44

  45. 45 Helm all the way Reference value: stable/proctor-service/values.yaml

  46. Automation using proctor 46

  47. Sample proc to increase disk 47

  48. Sample proc to increase disk 48

  49. Scripts can be added by developers and they get added

    to proctor after our review 49
  50. Sample procs in our ecosystem 50

  51. Demo 51

  52. Profit? 52

  53. Outcome of having proctor? 53

  54. Decrease in number of tickets which were mechanical in nature

    54
  55. Having terraform inside CI 55 +

  56. But before that 56

  57. Creating the gcloud project 57

  58. 58 Sample directory structure

  59. 59 .gitlab-yml for the gcloud project in gitlab

  60. 60

  61. 61 Plan and apply

  62. Private terraform registry consisting of 90+ modules 62

  63. Outcome? 63

  64. Teams managing and provisioning their own infra with our best

    practices baked in terraform modules 64
  65. OSS alternatives? 65

  66. 66 Reference: runatlantis.io/

  67. Ideal state? 67

  68. 68 Ref: Google SRE book: Eliminating toil

  69. Known caveats? 69

  70. Deletion of infra 70

  71. Teams forget what they are using 71

  72. Lessons learnt? 72

  73. Avoid premature automation 73

  74. High service requests for product teams is a smell 74

  75. No Big bang changes 75

  76. Documentation should go hand in hand, would affect productivity directly

    76
  77. Reduce steps for onboarding to your tooling, lesser the better

    77
  78. Invisible infrastructure 78

  79. Product managers in Infrastructure teams 79

  80. Prioritizing on innovation 80

  81. Links and References • https://github.com/gojek/proctor • https://blog.gojekengineering.com/olympus-terraforming-repeatabl e-and-extensible-infrastructure-at-go-jek-42ad5b0a4f9a • https://learn.hashicorp.com/terraform/development/running-terrafor

    m-in-automation • https://lethain.com/product-management-infra-engineering/ 81
  82. 82 @tasdikrahman tasdikrahman.me