Lessons scaling operations to everyone @indix

Lessons scaling operations to everyone @indix

Presented at Rootconf > Miniconf 2017, Chennai. Video available at https://www.youtube.com/watch?v=zUTz1eqwBkI

D90acaa01cb59a2b8b7e986958953eee?s=128

Ashwanth Kumar

November 25, 2017
Tweet

Transcript

  1. Lessons scaling operations to everyone @indix Rootconf → Miniconf (Chennai)

    2017
  2. About Me Ashwanth Kumar Principal Engineer, Indix @_ashwanthkumar

  3. Later Stage* Growth Stage Early Stage

  4. Early Stage Goal - Working Infrastructure → 5 - 15

    Developers → 1 - 2 Member Ops Team Responsibilities → Write Deployment Scripts for various systems (internal and open-source) → Centralized Control & Responsibility of Infrastructure on AWS 1
  5. Early Stage Lessons → Operations team couldn’t really contribute to

    our system design / architecture ◦ Always overloaded with ad-hoc requests → On-call support for our existing production systems without much context 1
  6. Early Stage 1 Lessons → Developers wanted to try lots

    of new things on fast-growing Big-Data landscape but ops couldn’t handle all these requests ◦ Ops started working with Devs so they can take these experiments on their own ◦ Devs had a lot of say about the operational setup, scripts, etc.
  7. Growth Stage Goal - Decentralised access to infrastructure → 15

    - 30 Engineers → 2 - 3 Ops Engineers Responsibilities → Educate developers on their infrastructure → Work on the overall process (or framework) for operations 2
  8. Growth Stage “If the development team is frequently called in

    the middle of the night, automation is the likely outcome. If operations is frequently called, the usual reaction is to grow the operations team.” From: On designing and deploying internet-scale services ~ James Hamilton - LISA ’07 2
  9. 2 Growth Stage Lessons → (+) Some developers loved to

    contribute for operations - oss.indix.com → (+) Individual teams took over their infra & on-call resulting in faster & better systems → (-) With decentralised operations, cost control is very hard, but super important → (-) Backup is important if we have to provide CRUD access to everybody
  10. 3 Later Stage Goal - Self Serve Infrastructure → 30

    - 50 (approx.) Engineers → 2 - 3 Ops Engineers Responsibilities → Become enablers (via process / automation tools) for engineers to deliver e2e → Influence the design & architecture of all systems with focus on cost, security & HA
  11. 3 Later Stage Lessons → (+) Using Resource schedulers helped

    provide a unified view of all underlying resources → (+) Operations is a first-class skill for Devs and “Development” is for Ops → (+) Operability Review before the first prod push helped reduce lots of surprises → (-) De-centralised infra access lead to lot of fragmentation in the deployment stack
  12. TechRadar to address Fragmentation

  13. http://oss.indix.com/indix-radar/DevOps /

  14. This view helps new members and existing members get a

    view of tools we’ve tried in the past and decided to use or not and things we decided to try etc. http://oss.indix.com/indix-radar/DevOps /
  15. Operability Review

  16. “Unless we meet all the requirements mentioned below, the software

    won't be signed off from operability perspective and would not qualify as production ready.” “ … we should be able to identify possible issues before running our code in production. In order to bring stability to any production system, an Operability Review tries to identify such areas and take pro-active measure to minimize the overhead in a live system.”
  17. Items from the Operability Review → Benchmarking / Load Testing

    Results → Data store (& its setup) → Security → Scaling Policy → Deployment → Backup / Recovery → Monitoring / Alerts → Cost, etc.
  18. Credits Swathi Ravichandran @swathrav Thank you @_ashwanthkumar