Upgrade to Pro — share decks privately, control downloads, hide ads and more …

"How we scaled Instagram", by Mike Krieger

Opbeat
September 26, 2014

"How we scaled Instagram", by Mike Krieger

Video here: https://opbeat.com/blog/posts/how-we-scaled-instagram-mike-krieger/

Mike Krieger, co-founder of Instagram, talks about their tech setup and how ops and on-duty evolved as they scaled Instagram from one random server in LA, to AWS and ultimately to Facebook’s own infrastructure.

Opbeat

September 26, 2014
Tweet

More Decks by Opbeat

Other Decks in Technology

Transcript

  1. Why this talk? • Glimpse into process that’s not usually

    made public • Lessons learned along the way • Why did it work at the time? What wasn’t working?
  2. Early Days • Ops experience = almost none • Running

    on a single server in LA • fabric tasks for everything
  3. Early Days • Worked because it was the simple thing

    first • Ops was all in Python • What wasn’t working: we had no DR story at all
  4. Early Days • Launched & everything was on fire! •

    All for one, one for all (for all 2 of us) • Japan, 3 am
  5. Early Days • At least we had monitoring • Commitment

    to hop on and fix things, sense of urgency • P(Having Monitoring) at early stage is heavily correlated with how easy it is to get going (eg Munin vs Nagios)
  6. Early Days • Munin; no way of silencing and having

    pre- planned downtime = the worst (at the movies? welcome to a interruption every 5 mins) • Relationships + sleep are suffering
  7. Scaling up • Both awake, but primarily me fixing •

    Don’t underestimate solidarity • Chicken Coop Ops
  8. Scaling up • Hired our first dev (iOS+Infra) • Took

    my first trip abroad (hello Runbook, hello ParisOps) • Weddings
  9. Scaling up • RIP Pingability • We need phone calls,

    so…Pingdom + Pager Duty • One person on rotation, “Team. Ops Team”
  10. Scaling up • Having early cross-stack employees means knowledge +

    ownership over systems likeliest to cause pages (no dedicated Ops person) • Us-against-world led to strong focus on accountability
  11. Scaling up • Texting & Skype for communication • Very

    easy to know what had rolled out when; rare “who broke it?” issue • Good test coverage, manual rollouts • Impending burnout all around
  12. Starting a team • Hired two infra engineers, finally •

    Engineering blog on our architecture • No rotation yet • Also, everything was on fire. All of the time.
  13. Starting a team • Perpetual Beluga Ops Thread (RIP) •

    Most problematic on weekends, which are our peak • Worst issues are those you’ve never fully fixed (for us: queuing at HAProxy layer)
  14. Starting a team • Replaced Munin with Sensu & Ganglia;

    so much better • Still using PagerDuty, mostly so everyone would get notified
  15. Starting a team • Everyone knew expectations coming in •

    Launching Android meant 2x user base in 6 months • Some relief because of shared responsibility
  16. Starting a team • No time for forward-facing ops &

    infra investments; slowly changing near end of this period • Philosophy: can we get to the next weekend?
  17. Starting a team • Playing chicken with AWS instance upgrades

    • Hidden frustration: when everyone’s responsible no one is. When no one’s responsible, everyone is.
  18. Post-FB • Arrived at FB with 2 infra; tripled in

    3 months • Nick, our second infra hire, kicks off on-call process • Primary/Secondary/Tertiary
  19. Post-FB • Primary expected to handle everything, secondary is safety

    net • One big Messenger thread + IRC • A few longer outages due to not escalating, but overall pretty solid
  20. Post-FB • Challenge: operate “all inclusive” ops team in FB,

    which has more tech-specific on-calls • Solution: continue to use PagerDuty, but with integration into FB’s on call systems
  21. Post-FB • Challenge: gone from full-stack engineers to more specialized

    roles; leads to dropped hand- offs at client/server border • Solution: still WIP. Client teams have on-call rotations now, too.
  22. Post-FB • Challenge: things are more stable (yay). But… primary

    no longer very educational • New folks on team not experienced enough to jump in • Risk of burnout from those on team for a while
  23. Post-FB • v2: L1/L2/L3 • L1 is primarily triage and

    responds when they can, but escalation is expected and encouraged • New members of team can become L1 very quickly
  24. Post-FB • Hot off the presses :) • Exit surveys

    to hear about what’s going well • Main issue: L1s escalate to a broad group rather than L2s
  25. Looking Forward • Which tech to offload to FB systems,

    and how to make sure we’re aligned?
  26. Q&A