"How we scaled Instagram", by Mike Krieger

How Ops & On-Call Evolved Mike Krieger (@mikeyk)

Why this talk? • Glimpse into process that’s not usually
made public • Lessons learned along the way • Why did it work at the time? What wasn’t working?

Early History

Early Days • Ops experience = almost none • Running
on a single server in LA • fabric tasks for everything

Early Days • Worked because it was the simple thing
ﬁrst • Ops was all in Python • What wasn’t working: we had no DR story at all

Django & Ubuntu & Postgres & Memcached & Redis

Early Days • Launched & everything was on ﬁre! •
All for one, one for all (for all 2 of us) • Japan, 3 am

Early Days • Alerting & Monitoring = Munin & Pingability
• We even bought MiFis.

Early Days • At least we had monitoring • Commitment
to hop on and ﬁx things, sense of urgency • P(Having Monitoring) at early stage is heavily correlated with how easy it is to get going (eg Munin vs Nagios)

Early Days • Munin; no way of silencing and having
pre- planned downtime = the worst (at the movies? welcome to a interruption every 5 mins) • Relationships + sleep are suffering

Scaling up

Scaling up • Moved right down the street, to South
Park

Scaling up • Both awake, but primarily me ﬁxing •
Don’t underestimate solidarity • Chicken Coop Ops

Scaling up • Hired our ﬁrst dev (iOS+Infra) • Took
my ﬁrst trip abroad (hello Runbook, hello ParisOps) • Weddings

Scaling up • RIP Pingability • We need phone calls,
so…Pingdom + Pager Duty • One person on rotation, “Team. Ops Team”

Scaling up • Having early cross-stack employees means knowledge +
ownership over systems likeliest to cause pages (no dedicated Ops person) • Us-against-world led to strong focus on accountability

Scaling up • Texting & Skype for communication • Very
easy to know what had rolled out when; rare “who broke it?” issue • Good test coverage, manual rollouts • Impending burnout all around

Starting a team

Starting a team • Hired two infra engineers, ﬁnally •
Engineering blog on our architecture • No rotation yet • Also, everything was on ﬁre. All of the time.

Starting a team • Perpetual Beluga Ops Thread (RIP) •
Most problematic on weekends, which are our peak • Worst issues are those you’ve never fully ﬁxed (for us: queuing at HAProxy layer)

Starting a team • Replaced Munin with Sensu & Ganglia;
so much better • Still using PagerDuty, mostly so everyone would get notiﬁed

Starting a team • Everyone knew expectations coming in •
Launching Android meant 2x user base in 6 months • Some relief because of shared responsibility

Starting a team • No time for forward-facing ops &
infra investments; slowly changing near end of this period • Philosophy: can we get to the next weekend?

Starting a team • Playing chicken with AWS instance upgrades
• Hidden frustration: when everyone’s responsible no one is. When no one’s responsible, everyone is.

2012 and Post-FB

Post-FB • Arrived at FB with 2 infra; tripled in
3 months • Nick, our second infra hire, kicks off on-call process • Primary/Secondary/Tertiary

Post-FB • Primary expected to handle everything, secondary is safety
net • One big Messenger thread + IRC • A few longer outages due to not escalating, but overall pretty solid

Post-FB • Slowly increasing Runbook coverage • Shadow + reverse
shadow during initial rotations

Post-FB • FB also has strong “engineers are on-call” culture
• So transfers were already on-board

Post-FB • Challenge: operate “all inclusive” ops team in FB,
which has more tech-speciﬁc on-calls • Solution: continue to use PagerDuty, but with integration into FB’s on call systems

Post-FB • Challenge: gone from full-stack engineers to more specialized
roles; leads to dropped hand- offs at client/server border • Solution: still WIP. Client teams have on-call rotations now, too.

Post-FB • Challenge: things are more stable (yay). But… primary
no longer very educational • New folks on team not experienced enough to jump in • Risk of burnout from those on team for a while

Post-FB • v2: L1/L2/L3 • L1 is primarily triage and
responds when they can, but escalation is expected and encouraged • New members of team can become L1 very quickly

Post-FB • Hot off the presses :) • Exit surveys
to hear about what’s going well • Main issue: L1s escalate to a broad group rather than L2s

Looking forward

Looking Forward • Which tech to ofﬂoad to FB systems,
and how to make sure we’re aligned?

Looking Forward • Coordination with FB on-calls for tech we
integrate with?

Looking Forward • Scaling intra-team communication?

Looking Forward • How do you teach triage & problem
solving?

"How we scaled Instagram", by Mike Krieger

"How we scaled Instagram", by Mike Krieger

More Decks by Opbeat

Other Decks in Technology

Featured

Transcript