Lessons scaling operations to everyone @indix

Slide 1

Slide 1 text

Lessons scaling operations to everyone @indix Rootconf → Miniconf (Chennai) 2017

Slide 2

Slide 2 text

About Me Ashwanth Kumar Principal Engineer, Indix @_ashwanthkumar

Slide 3

Slide 3 text

Later Stage* Growth Stage Early Stage

Slide 4

Slide 4 text

Early Stage Goal - Working Infrastructure → 5 - 15 Developers → 1 - 2 Member Ops Team Responsibilities → Write Deployment Scripts for various systems (internal and open-source) → Centralized Control & Responsibility of Infrastructure on AWS 1

Slide 5

Slide 5 text

Early Stage Lessons → Operations team couldn’t really contribute to our system design / architecture ○ Always overloaded with ad-hoc requests → On-call support for our existing production systems without much context 1

Slide 6

Slide 6 text

Early Stage 1 Lessons → Developers wanted to try lots of new things on fast-growing Big-Data landscape but ops couldn’t handle all these requests ○ Ops started working with Devs so they can take these experiments on their own ○ Devs had a lot of say about the operational setup, scripts, etc.

Slide 7

Slide 7 text

Growth Stage Goal - Decentralised access to infrastructure → 15 - 30 Engineers → 2 - 3 Ops Engineers Responsibilities → Educate developers on their infrastructure → Work on the overall process (or framework) for operations 2

Slide 8

Slide 8 text

Growth Stage “If the development team is frequently called in the middle of the night, automation is the likely outcome. If operations is frequently called, the usual reaction is to grow the operations team.” From: On designing and deploying internet-scale services ~ James Hamilton - LISA ’07 2

Slide 9

Slide 9 text

2 Growth Stage Lessons → (+) Some developers loved to contribute for operations - oss.indix.com → (+) Individual teams took over their infra & on-call resulting in faster & better systems → (-) With decentralised operations, cost control is very hard, but super important → (-) Backup is important if we have to provide CRUD access to everybody

Slide 10

Slide 10 text

3 Later Stage Goal - Self Serve Infrastructure → 30 - 50 (approx.) Engineers → 2 - 3 Ops Engineers Responsibilities → Become enablers (via process / automation tools) for engineers to deliver e2e → Influence the design & architecture of all systems with focus on cost, security & HA

Slide 11

Slide 11 text

3 Later Stage Lessons → (+) Using Resource schedulers helped provide a unified view of all underlying resources → (+) Operations is a first-class skill for Devs and “Development” is for Ops → (+) Operability Review before the first prod push helped reduce lots of surprises → (-) De-centralised infra access lead to lot of fragmentation in the deployment stack

Slide 12

Slide 12 text

TechRadar to address Fragmentation

Slide 13

Slide 13 text

http://oss.indix.com/indix-radar/DevOps /

Slide 14

Slide 14 text

This view helps new members and existing members get a view of tools we’ve tried in the past and decided to use or not and things we decided to try etc. http://oss.indix.com/indix-radar/DevOps /

Slide 15

Slide 15 text

Operability Review

Slide 16

Slide 16 text

“Unless we meet all the requirements mentioned below, the software won't be signed off from operability perspective and would not qualify as production ready.” “ … we should be able to identify possible issues before running our code in production. In order to bring stability to any production system, an Operability Review tries to identify such areas and take pro-active measure to minimize the overhead in a live system.”

Slide 17

Slide 17 text

Items from the Operability Review → Benchmarking / Load Testing Results → Data store (& its setup) → Security → Scaling Policy → Deployment → Backup / Recovery → Monitoring / Alerts → Cost, etc.

Slide 18

Slide 18 text

Credits Swathi Ravichandran @swathrav Thank you @_ashwanthkumar