Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling Instagram at AirBnB OpenAir 2015

Scaling Instagram at AirBnB OpenAir 2015

Mike Krieger

June 04, 2015
Tweet

More Decks by Mike Krieger

Other Decks in Technology

Transcript

  1. Mike Krieger
    INSTAGRAM
    SCALING INSTAGRAM
    AIRBNB OPEN AIR SUMMIT 2015

    View full-size slide

  2. co-founder, technical lead

    View full-size slide

  3. São Paulo, Brazil
    photo: Diego Torres Silvestre

    View full-size slide

  4. Stanford SymSys
    photo: Waqas Mustafeez

    View full-size slide

  5. LAST TIME I GAVE 

    A TALK AT AIRBNB…

    View full-size slide

  6. it was April 2012

    View full-size slide

  7. we were 2 years old

    View full-size slide

  8. 2 product guys with 

    no backend experience

    View full-size slide

  9. @goldenretrieverbailey

    View full-size slide

  10. we had been acquired

    the week before

    View full-size slide

  11. I had not slept much

    View full-size slide

  12. we had an engineering
    team of 4 people

    View full-size slide

  13. we had about

    30 million monthly actives

    View full-size slide

  14. Taylor Swift CMA

    View full-size slide

  15. we're 5 years old

    View full-size slide

  16. sleeping (slightly) more

    View full-size slide

  17. hired better coders than me

    View full-size slide

  18. we have an eng 

    team of 95 people

    View full-size slide

  19. we have over 

    300 million monthly actives

    View full-size slide

  20. Taylor Swift Grammy

    View full-size slide

  21. how is Instagram infra 

    different in 2015?

    View full-size slide

  22. what guides our evolution?

    View full-size slide

  23. how we adapted to
    infra, team, and product changes

    View full-size slide

  24. ORIGINAL PHILOSOPHY

    View full-size slide

  25. do the simple 

    thing first

    View full-size slide

  26. aka Use Boring Technology

    View full-size slide

  27. boring means 

    operationally quiet, too

    View full-size slide

  28. nginx &
    redis &
    memcached &
    postgres &
    gearman &
    django

    View full-size slide

  29. 2015 EDITION

    View full-size slide

  30. nginx &
    redis &
    memcached &
    postgres &
    gearman &
    django

    View full-size slide

  31. nginx &
    cassandra &
    memcached &
    postgres &
    rabbitmq &
    django

    View full-size slide

  32. unicorn &
    proxygen &
    scribe &
    thrift
    nginx &
    cassandra &
    memcached &
    postgres &
    rabbitmq &
    django

    View full-size slide

  33. do the simple 

    thing first
    1
    until your
    {scale, team, product}
    changes
    2

    View full-size slide

  34. do the simple 

    thing first
    1
    until your
    {scale, team, product}
    changes
    2

    View full-size slide

  35. scaling = replacing all
    components of a car while

    driving at 100mph

    View full-size slide

  36. which components to 

    replace & when

    View full-size slide

  37. Async Tasks (site scale)

    Code Deployment (team scale)

    Search (product scale)

    View full-size slide

  38. CAROUSEL ADS
    ADS
    requests should take < 3s

    View full-size slide

  39. CAROUSEL ADS
    ADS
    fan-out delivery to all your
    followers' feeds

    View full-size slide

  40. CAROUSEL ADS
    ADS
    especially popular users

    View full-size slide

  41. CAROUSEL ADS
    ADS
    post to external services
    (eg FB & Twitter)

    View full-size slide

  42. CAROUSEL ADS
    ADS
    v1: Gearman

    View full-size slide

  43. CAROUSEL ADS
    ADS
    async task broker

    View full-size slide

  44. CAROUSEL ADS
    ADS
    1 gearman broker
    4 app servers
    1 async worker box

    View full-size slide

  45. CAROUSEL ADS
    ADS
    dead simple to set up

    View full-size slide

  46. CAROUSEL ADS
    ADS
    memcached-like in simplicity

    View full-size slide

  47. CAROUSEL ADS
    ADS
    got us through
    1.5 years of growth

    View full-size slide

  48. photo: MAMJODH

    View full-size slide

  49. CAROUSEL ADS
    ADS
    messy to add/deploy
    new workers

    View full-size slide

  50. CAROUSEL ADS
    ADS
    single core, 60ms mean
    submission time

    View full-size slide

  51. CAROUSEL ADS
    ADS
    1s+ enqueue time under load

    View full-size slide

  52. CAROUSEL ADS
    ADS
    8 gearman brokers
    400 app servers
    12,000+ threads
    32 async worker boxes

    View full-size slide

  53. CAROUSEL ADS
    ADS
    v2: “sharded” gearman

    View full-size slide

  54. CAROUSEL ADS
    ADS
    BROKERS[node_index  %  len(BROKERS)]

    View full-size slide

  55. CAROUSEL ADS
    ADS
    no graceful failover

    View full-size slide

  56. CAROUSEL ADS
    ADS
    # of app servers growing quickly

    View full-size slide

  57. CAROUSEL ADS
    ADS
    persistence was more dangerous
    than not persisting

    View full-size slide

  58. CAROUSEL ADS
    ADS
    simple thing was waking us up &
    becoming operational burden

    View full-size slide

  59. CAROUSEL ADS
    ADS
    operating at new scale

    View full-size slide

  60. CAROUSEL ADS
    ADS
    time to move on

    View full-size slide

  61. CAROUSEL ADS
    ADS
    please thank all your soon to be
    decommissioned infra pieces

    View full-size slide

  62. CAROUSEL ADS
    ADS
    basically didn't think about
    Gearman until we had to

    View full-size slide

  63. CAROUSEL ADS
    ADS
    “do the simple thing next”

    View full-size slide

  64. CAROUSEL ADS
    ADS
    roll your own

    View full-size slide

  65. CAROUSEL ADS
    ADS
    rewrite gearman

    View full-size slide

  66. CAROUSEL ADS
    ADS
    v3: celery and rabbitmq

    View full-size slide

  67. CAROUSEL ADS
    ADS
    celery

    for much simpler worker code

    View full-size slide

  68. CAROUSEL ADS
    ADS
    rabbitmq

    low(ish) maintenance

    View full-size slide

  69. CAROUSEL ADS
    ADS
    any dev can add async task with
    one @task decorator

    View full-size slide

  70. CAROUSEL ADS
    ADS
    kick off with function.delay()

    View full-size slide

  71. CAROUSEL ADS
    ADS
    replication + failover 

    + persistence

    View full-size slide

  72. CAROUSEL ADS
    ADS
    5ms mean

    10ms P90

    View full-size slide

  73. CAROUSEL ADS
    ADS
    opportunity to gain both
    operational & dev efficiency

    View full-size slide

  74. CAROUSEL ADS
    ADS
    more details: 

    http://bit.ly/igcelery

    View full-size slide

  75. CAROUSEL ADS
    ADS
    the art of getting code to prod

    View full-size slide

  76. CAROUSEL ADS
    ADS
    v1: fab and git pull

    View full-size slide

  77. CAROUSEL ADS
    ADS
    fabric: Python remote scripting

    View full-size slide

  78. CAROUSEL ADS
    ADS
    >  fab  djangos  update_git    
    >  fab  djangos  restart_django

    View full-size slide

  79. CAROUSEL ADS
    ADS
    great for 2 engineers

    View full-size slide

  80. CAROUSEL ADS
    ADS
    past 12 machines = pain

    View full-size slide

  81. CAROUSEL ADS
    ADS
    v2: fab parallel mode 

    to the rescue

    View full-size slide

  82. CAROUSEL ADS
    ADS
    >  fab  -­‐z20  djangos  update_git    
    >  fab  -­‐z20  djangos  restart_django

    View full-size slide

  83. CAROUSEL ADS
    ADS
    worked up to 70 machines

    View full-size slide

  84. CAROUSEL ADS
    ADS
    the year of the GitHub DDOSs

    View full-size slide

  85. CAROUSEL ADS
    ADS
    swear it wasn't us deploying

    View full-size slide

  86. CAROUSEL ADS
    ADS
    v3: fab rollout

    View full-size slide

  87. CAROUSEL ADS
    ADS
    >  fab  -­‐z20  djangos  rollout:server  
    ...doing  fresh  git  fetch    
    ...zipping  up  origin/master  
    ...uploading  to  S3  
    ...pulling  down  zip  
    ...unpacking  zip  
    ...mapping  'current'  symlink  
    ...restarting  Django

    View full-size slide

  88. CAROUSEL ADS
    ADS
    lasted us another 1.5 years

    View full-size slide

  89. CAROUSEL ADS
    ADS
    IG infra 2 to 10 eng

    View full-size slide

  90. CAROUSEL ADS
    ADS
    “hey, can I roll out?”
    “wait! I'm already rolling”

    View full-size slide

  91. CAROUSEL ADS
    ADS
    v4: enter Sauron

    View full-size slide

  92. CAROUSEL ADS
    ADS
    lasted us another 1.5 years

    View full-size slide

  93. CAROUSEL ADS
    ADS
    v5: scaling institutional
    knowledge

    View full-size slide

  94. CAROUSEL ADS
    ADS
    “did you remember to roll to a canary?”
    “don't roll to the workers with a -z of > 40!”
    “did you tail the error logs?”
    “did you catch that new tier we deployed?”

    View full-size slide

  95. CAROUSEL ADS
    ADS
    >  fab  -­‐z20  djangos  rollout:server  
    ...grabbing  lock  from  Sauron  
    ...doing  fresh  git  fetch    
    ...zipping  up  origin/master  
    ...uploading  to  S3  
    ...pulling  down  zip  to  canary  1  
    ...unpacking  zip  on  canary  1  
    ...mapping  'current'  symlink  on  canary  1  
    ...restarting  Django  on  canary  1

    View full-size slide

  96. CAROUSEL ADS
    ADS
    ...tailing  error  logs  on  canary  1  
    ...ok,  200  responses  are  even  
    ...deploying  to  async  worker  1  
    ...measuring  success  rate  on  worker  1  
    ...looks  good,  deploying  widely

    View full-size slide

  97. CAROUSEL ADS
    ADS
    “hold on, aren't you basically
    doing continuous deployment,
    but not?”

    View full-size slide

  98. CAROUSEL ADS
    ADS
    backend committers++

    View full-size slide

  99. CAROUSEL ADS
    ADS
    human lock contention

    View full-size slide

  100. CAROUSEL ADS
    ADS
    v5: continuous deployment

    View full-size slide

  101. CAROUSEL ADS
    ADS
    extended Sauron with Jenkins
    integration

    View full-size slide

  102. CAROUSEL ADS
    ADS
    take human procedure, automate

    View full-size slide

  103. CAROUSEL ADS
    ADS
    deeply understood every step of
    our deploy

    View full-size slide

  104. CAROUSEL ADS
    ADS
    has scaled to 50+ committers on
    backend codebase

    View full-size slide

  105. CAROUSEL ADS
    ADS
    v1: minimize moving parts

    View full-size slide

  106. CAROUSEL ADS
    ADS
    SELECT  id  FROM  users  WHERE  
    full_name  LIKE  ...

    View full-size slide

  107. CAROUSEL ADS
    ADS
    postgres & search, sittin' in 

    a b-tree

    View full-size slide

  108. CAROUSEL ADS
    ADS
    prefix-only, plz

    View full-size slide

  109. CAROUSEL ADS
    ADS
    haystack was pretty small

    View full-size slide

  110. ADS
    ok, but Bieber

    View full-size slide

  111. CAROUSEL ADS
    ADS
    CELEBRITY_OVERRIDES  =  {  
       'taylor  swift':  19151555,  
       'taylorswift':  19151555,  
       'justinbieber':  6860189,  
       'justin  bieber':  6860189  
    }
    ACTUAL CODE :(

    View full-size slide

  112. ADS
    ok, but Selena & Taylor & Harry &
    Zayn & ...

    View full-size slide

  113. ADS
    aka product needs have evolved

    View full-size slide

  114. ADS
    Lucene-based
    HTTP/JSON interface
    great indexing options

    View full-size slide

  115. CAROUSEL ADS
    ADS
    curl  -­‐XPUT  'http://solr/update/json'  -­‐d  '{  
           {"add":    
               {"doc":  {  
                   "username"  :  "justinbieber",  
                   "followed_by":  12345678  
               }  
           }  
    }'

    View full-size slide

  116. CAROUSEL ADS
    ADS
    -­‐  CELEBRITY_OVERRIDES  =  {  
    -­‐    'taylor  swift':  19151555,  
    -­‐    'taylorswift':  19151555,  
    -­‐    'justin  bieber':  68680189  
    -­‐  }

    View full-size slide

  117. ADS
    <1 month to transfer over

    View full-size slide

  118. ADS
    launch Android

    View full-size slide

  119. ADS
    4x the queries

    View full-size slide

  120. ADS
    no SolrCloud yet

    View full-size slide

  121. ADS
    index twice?

    partition by prefix?

    View full-size slide

  122. ADS
    scale had changed

    View full-size slide

  123. ADS
    v3: ElasticSearch

    View full-size slide

  124. CAROUSEL ADS
    ADS
    curl  -­‐XPUT  'http://es:9200/users/user/6860189'  -­‐d  '{  
           "username"  :  "justinbieber",  
           "followed_by":  12345678  
    }'

    View full-size slide

  125. ADS
    also Lucene based

    easy query API

    out-of-box cluster support

    View full-size slide

  126. ADS
    very simple to set up

    View full-size slide

  127. ADS
    in a steady state,
    worked beautifully

    View full-size slide

  128. ADS
    but (at least in 2013) had high
    operational overhead

    View full-size slide

  129. ADS
    split brain

    View full-size slide

  130. ADS
    AWS autodiscovery

    View full-size slide

  131. ADS
    had to keep queries simple

    View full-size slide

  132. ADS
    not enough engineers to fully
    staff search team

    View full-size slide

  133. ADS
    meanwhile, instagration

    View full-size slide

  134. ADS
    v4: Unicorn

    View full-size slide

  135. ADS
    FB's graph search system

    View full-size slide

  136. ADS
    core idea: use social edges as
    part of the search

    View full-size slide

  137. CAROUSEL ADS
    ADS
    //  people  who  I  follow  named  Justin  
    (and  (term  justin*)  
             (term  followedby:4))  
    //  people  followed  by  the  people  I  follow,  named  Justin  
    (and  (term  justin*)  
             (apply  followedby:(term  followedby:4))  
    //  people  named  Justin,  prioritizing  the  people  I  follow  
    (weak-­‐and  (term  followedby:4  :optional-­‐hits  2)  
                       (term  justin*))

    View full-size slide

  138. ADS
    double-digit % increase in search
    clicks per daily active

    View full-size slide

  139. ADS
    bonus: new Explore photos

    View full-size slide

  140. ADS
    v1: most liked, globally

    View full-size slide

  141. ADS
    trying to everything to everyone

    View full-size slide

  142. ADS
    v2: photos liked by 

    people I follow

    View full-size slide

  143. ADS
    let's get social

    View full-size slide

  144. CAROUSEL ADS
    ADS
    //  photos  I  haven't  liked,  but  the  people  I  follow  liked  
    (difference  
           (or  likedby:friendA  likedby:friendB  …)    
                   likedby:4  
    )

    View full-size slide

  145. ADS
    who I follow (not always) who has
    my taste

    View full-size slide

  146. CAROUSEL ADS
    ADS
    //  photos  I  haven't  liked  yet,  liked  by  people  whose  photos  
    I  already  liked  
    (difference    
         (apply  liker:    
             (extract  owner:  liker:4))    
         liker:4)

    View full-size slide

  147. ADS
    6x increase in taps into photos
    on Explore

    View full-size slide

  148. ADS
    http://bit.ly/fbunicorn

    View full-size slide

  149. do the simple 

    thing first
    1
    until your
    {scale, team, product}
    changes
    2

    View full-size slide

  150. CAROUSEL ADS
    ADS
    ground your evolution in 

    problem-solving

    View full-size slide

  151. then do the next simplest thing

    View full-size slide

  152. CAROUSEL ADS
    ADS
    get in touch:

    [email protected]

    View full-size slide