"How we scaled Instagram", by Mike Krieger

A6ec5537afa27bdb4fe5056e2d34810d?s=47 Opbeat
September 26, 2014

"How we scaled Instagram", by Mike Krieger

Video here: https://opbeat.com/blog/posts/how-we-scaled-instagram-mike-krieger/

Mike Krieger, co-founder of Instagram, talks about their tech setup and how ops and on-duty evolved as they scaled Instagram from one random server in LA, to AWS and ultimately to Facebook’s own infrastructure.

A6ec5537afa27bdb4fe5056e2d34810d?s=128

Opbeat

September 26, 2014
Tweet

Transcript

  1. 2.

    Why this talk? • Glimpse into process that’s not usually

    made public • Lessons learned along the way • Why did it work at the time? What wasn’t working?
  2. 4.

    Early Days • Ops experience = almost none • Running

    on a single server in LA • fabric tasks for everything
  3. 5.

    Early Days • Worked because it was the simple thing

    first • Ops was all in Python • What wasn’t working: we had no DR story at all
  4. 6.
  5. 7.
  6. 9.

    Early Days • Launched & everything was on fire! •

    All for one, one for all (for all 2 of us) • Japan, 3 am
  7. 11.

    Early Days • At least we had monitoring • Commitment

    to hop on and fix things, sense of urgency • P(Having Monitoring) at early stage is heavily correlated with how easy it is to get going (eg Munin vs Nagios)
  8. 12.

    Early Days • Munin; no way of silencing and having

    pre- planned downtime = the worst (at the movies? welcome to a interruption every 5 mins) • Relationships + sleep are suffering
  9. 15.

    Scaling up • Both awake, but primarily me fixing •

    Don’t underestimate solidarity • Chicken Coop Ops
  10. 16.

    Scaling up • Hired our first dev (iOS+Infra) • Took

    my first trip abroad (hello Runbook, hello ParisOps) • Weddings
  11. 17.

    Scaling up • RIP Pingability • We need phone calls,

    so…Pingdom + Pager Duty • One person on rotation, “Team. Ops Team”
  12. 18.

    Scaling up • Having early cross-stack employees means knowledge +

    ownership over systems likeliest to cause pages (no dedicated Ops person) • Us-against-world led to strong focus on accountability
  13. 19.

    Scaling up • Texting & Skype for communication • Very

    easy to know what had rolled out when; rare “who broke it?” issue • Good test coverage, manual rollouts • Impending burnout all around
  14. 21.

    Starting a team • Hired two infra engineers, finally •

    Engineering blog on our architecture • No rotation yet • Also, everything was on fire. All of the time.
  15. 22.

    Starting a team • Perpetual Beluga Ops Thread (RIP) •

    Most problematic on weekends, which are our peak • Worst issues are those you’ve never fully fixed (for us: queuing at HAProxy layer)
  16. 23.

    Starting a team • Replaced Munin with Sensu & Ganglia;

    so much better • Still using PagerDuty, mostly so everyone would get notified
  17. 24.

    Starting a team • Everyone knew expectations coming in •

    Launching Android meant 2x user base in 6 months • Some relief because of shared responsibility
  18. 25.

    Starting a team • No time for forward-facing ops &

    infra investments; slowly changing near end of this period • Philosophy: can we get to the next weekend?
  19. 26.

    Starting a team • Playing chicken with AWS instance upgrades

    • Hidden frustration: when everyone’s responsible no one is. When no one’s responsible, everyone is.
  20. 28.

    Post-FB • Arrived at FB with 2 infra; tripled in

    3 months • Nick, our second infra hire, kicks off on-call process • Primary/Secondary/Tertiary
  21. 29.

    Post-FB • Primary expected to handle everything, secondary is safety

    net • One big Messenger thread + IRC • A few longer outages due to not escalating, but overall pretty solid
  22. 31.
  23. 32.

    Post-FB • Challenge: operate “all inclusive” ops team in FB,

    which has more tech-specific on-calls • Solution: continue to use PagerDuty, but with integration into FB’s on call systems
  24. 33.

    Post-FB • Challenge: gone from full-stack engineers to more specialized

    roles; leads to dropped hand- offs at client/server border • Solution: still WIP. Client teams have on-call rotations now, too.
  25. 34.

    Post-FB • Challenge: things are more stable (yay). But… primary

    no longer very educational • New folks on team not experienced enough to jump in • Risk of burnout from those on team for a while
  26. 35.

    Post-FB • v2: L1/L2/L3 • L1 is primarily triage and

    responds when they can, but escalation is expected and encouraged • New members of team can become L1 very quickly
  27. 36.

    Post-FB • Hot off the presses :) • Exit surveys

    to hear about what’s going well • Main issue: L1s escalate to a broad group rather than L2s
  28. 38.

    Looking Forward • Which tech to offload to FB systems,

    and how to make sure we’re aligned?
  29. 42.

    Q&A