Lessons scaling operations to everyone @indix

Lessons scaling operations to everyone @indix Rootconf → Miniconf (Chennai)
2017

About Me Ashwanth Kumar Principal Engineer, Indix @_ashwanthkumar

Later Stage* Growth Stage Early Stage

Early Stage Goal - Working Infrastructure → 5 - 15
Developers → 1 - 2 Member Ops Team Responsibilities → Write Deployment Scripts for various systems (internal and open-source) → Centralized Control & Responsibility of Infrastructure on AWS 1

Early Stage Lessons → Operations team couldn’t really contribute to
our system design / architecture ◦ Always overloaded with ad-hoc requests → On-call support for our existing production systems without much context 1

Early Stage 1 Lessons → Developers wanted to try lots
of new things on fast-growing Big-Data landscape but ops couldn’t handle all these requests ◦ Ops started working with Devs so they can take these experiments on their own ◦ Devs had a lot of say about the operational setup, scripts, etc.

Growth Stage Goal - Decentralised access to infrastructure → 15
- 30 Engineers → 2 - 3 Ops Engineers Responsibilities → Educate developers on their infrastructure → Work on the overall process (or framework) for operations 2

Growth Stage “If the development team is frequently called in
the middle of the night, automation is the likely outcome. If operations is frequently called, the usual reaction is to grow the operations team.” From: On designing and deploying internet-scale services ~ James Hamilton - LISA ’07 2

2 Growth Stage Lessons → (+) Some developers loved to
contribute for operations - oss.indix.com → (+) Individual teams took over their infra & on-call resulting in faster & better systems → (-) With decentralised operations, cost control is very hard, but super important → (-) Backup is important if we have to provide CRUD access to everybody

3 Later Stage Goal - Self Serve Infrastructure → 30
- 50 (approx.) Engineers → 2 - 3 Ops Engineers Responsibilities → Become enablers (via process / automation tools) for engineers to deliver e2e → Influence the design & architecture of all systems with focus on cost, security & HA

3 Later Stage Lessons → (+) Using Resource schedulers helped
provide a unified view of all underlying resources → (+) Operations is a first-class skill for Devs and “Development” is for Ops → (+) Operability Review before the first prod push helped reduce lots of surprises → (-) De-centralised infra access lead to lot of fragmentation in the deployment stack

TechRadar to address Fragmentation

http://oss.indix.com/indix-radar/DevOps /

This view helps new members and existing members get a
view of tools we’ve tried in the past and decided to use or not and things we decided to try etc. http://oss.indix.com/indix-radar/DevOps /

Operability Review

“Unless we meet all the requirements mentioned below, the software
won't be signed off from operability perspective and would not qualify as production ready.” “ … we should be able to identify possible issues before running our code in production. In order to bring stability to any production system, an Operability Review tries to identify such areas and take pro-active measure to minimize the overhead in a live system.”

Items from the Operability Review → Benchmarking / Load Testing
Results → Data store (& its setup) → Security → Scaling Policy → Deployment → Backup / Recovery → Monitoring / Alerts → Cost, etc.

Credits Swathi Ravichandran @swathrav Thank you @_ashwanthkumar

Lessons scaling operations to everyone @indix

Lessons scaling operations to everyone @indix

Ashwanth Kumar

More Decks by Ashwanth Kumar

Other Decks in Technology

Featured

Transcript