Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling Instagram at AirBnB OpenAir 2015

Scaling Instagram at AirBnB OpenAir 2015

Mike Krieger

June 04, 2015
Tweet

More Decks by Mike Krieger

Other Decks in Technology

Transcript

  1. View Slide

  2. Mike Krieger
    INSTAGRAM
    SCALING INSTAGRAM
    AIRBNB OPEN AIR SUMMIT 2015

    View Slide

  3. co-founder, technical lead

    View Slide

  4. São Paulo, Brazil
    photo: Diego Torres Silvestre

    View Slide

  5. Stanford SymSys
    photo: Waqas Mustafeez

    View Slide

  6. @mikeyk

    [email protected]

    View Slide

  7. LAST TIME I GAVE 

    A TALK AT AIRBNB…

    View Slide

  8. it was April 2012

    View Slide

  9. we were 2 years old

    View Slide

  10. 2 product guys with 

    no backend experience

    View Slide

  11. @goldenretrieverbailey

    View Slide

  12. we had been acquired

    the week before

    View Slide

  13. I had not slept much

    View Slide

  14. we had an engineering
    team of 4 people

    View Slide

  15. we had about

    30 million monthly actives

    View Slide

  16. Taylor Swift CMA

    View Slide

  17. TODAY...

    View Slide

  18. we're 5 years old

    View Slide

  19. sleeping (slightly) more

    View Slide

  20. hired better coders than me

    View Slide

  21. we have an eng 

    team of 95 people

    View Slide

  22. we have over 

    300 million monthly actives

    View Slide

  23. Taylor Swift Grammy

    View Slide

  24. THIS TALK

    View Slide

  25. how is Instagram infra 

    different in 2015?

    View Slide

  26. what guides our evolution?

    View Slide

  27. how we adapted to
    infra, team, and product changes

    View Slide

  28. ORIGINAL PHILOSOPHY

    View Slide

  29. do the simple 

    thing first

    View Slide

  30. aka YAGNI

    View Slide

  31. aka Use Boring Technology

    View Slide

  32. boring means 

    operationally quiet, too

    View Slide

  33. nginx &
    redis &
    memcached &
    postgres &
    gearman &
    django

    View Slide

  34. 2015 EDITION

    View Slide

  35. nginx &
    redis &
    memcached &
    postgres &
    gearman &
    django

    View Slide

  36. nginx &
    cassandra &
    memcached &
    postgres &
    rabbitmq &
    django

    View Slide

  37. unicorn &
    proxygen &
    scribe &
    thrift
    nginx &
    cassandra &
    memcached &
    postgres &
    rabbitmq &
    django

    View Slide

  38. do the simple 

    thing first
    1
    until your
    {scale, team, product}
    changes
    2

    View Slide

  39. do the simple 

    thing first
    1
    until your
    {scale, team, product}
    changes
    2

    View Slide

  40. scaling = replacing all
    components of a car while

    driving at 100mph

    View Slide

  41. which components to 

    replace & when

    View Slide

  42. DEEPER DIVE

    View Slide

  43. Async Tasks (site scale)

    Code Deployment (team scale)

    Search (product scale)

    View Slide

  44. ASYNC TASKS

    View Slide

  45. CAROUSEL ADS
    ADS
    requests should take < 3s

    View Slide

  46. CAROUSEL ADS
    ADS
    fan-out delivery to all your
    followers' feeds

    View Slide

  47. CAROUSEL ADS
    ADS
    especially popular users

    View Slide

  48. CAROUSEL ADS
    ADS
    post to external services
    (eg FB & Twitter)

    View Slide

  49. CAROUSEL ADS
    ADS
    v1: Gearman

    View Slide

  50. CAROUSEL ADS
    ADS
    async task broker

    View Slide

  51. CAROUSEL ADS
    ADS
    1 gearman broker
    4 app servers
    1 async worker box

    View Slide

  52. CAROUSEL ADS
    ADS
    dead simple to set up

    View Slide

  53. CAROUSEL ADS
    ADS
    memcached-like in simplicity

    View Slide

  54. CAROUSEL ADS
    ADS
    got us through
    1.5 years of growth

    View Slide

  55. photo: MAMJODH

    View Slide

  56. CAROUSEL ADS
    ADS
    messy to add/deploy
    new workers

    View Slide

  57. CAROUSEL ADS
    ADS
    single core, 60ms mean
    submission time

    View Slide

  58. CAROUSEL ADS
    ADS
    1s+ enqueue time under load

    View Slide

  59. CAROUSEL ADS
    ADS
    8 gearman brokers
    400 app servers
    12,000+ threads
    32 async worker boxes

    View Slide

  60. CAROUSEL ADS
    ADS
    v2: “sharded” gearman

    View Slide

  61. CAROUSEL ADS
    ADS
    BROKERS[node_index  %  len(BROKERS)]

    View Slide

  62. CAROUSEL ADS
    ADS
    no graceful failover

    View Slide

  63. CAROUSEL ADS
    ADS
    # of app servers growing quickly

    View Slide

  64. CAROUSEL ADS
    ADS
    persistence was more dangerous
    than not persisting

    View Slide

  65. CAROUSEL ADS
    ADS
    simple thing was waking us up &
    becoming operational burden

    View Slide

  66. CAROUSEL ADS
    ADS
    operating at new scale

    View Slide

  67. CAROUSEL ADS
    ADS
    time to move on

    View Slide

  68. View Slide

  69. your infra

    View Slide

  70. CAROUSEL ADS
    ADS
    please thank all your soon to be
    decommissioned infra pieces

    View Slide

  71. CAROUSEL ADS
    ADS
    basically didn't think about
    Gearman until we had to

    View Slide

  72. CAROUSEL ADS
    ADS
    “do the simple thing next”

    View Slide

  73. CAROUSEL ADS
    ADS
    roll your own

    View Slide

  74. CAROUSEL ADS
    ADS
    rewrite gearman

    View Slide

  75. CAROUSEL ADS
    ADS
    v3: celery and rabbitmq

    View Slide

  76. CAROUSEL ADS
    ADS
    celery

    for much simpler worker code

    View Slide

  77. CAROUSEL ADS
    ADS
    rabbitmq

    low(ish) maintenance

    View Slide

  78. CAROUSEL ADS
    ADS
    any dev can add async task with
    one @task decorator

    View Slide

  79. CAROUSEL ADS
    ADS
    kick off with function.delay()

    View Slide

  80. CAROUSEL ADS
    ADS
    replication + failover 

    + persistence

    View Slide

  81. CAROUSEL ADS
    ADS
    5ms mean

    10ms P90

    View Slide

  82. CAROUSEL ADS
    ADS
    opportunity to gain both
    operational & dev efficiency

    View Slide

  83. CAROUSEL ADS
    ADS
    more details: 

    http://bit.ly/igcelery

    View Slide

  84. DEPLOYMENT

    View Slide

  85. CAROUSEL ADS
    ADS
    the art of getting code to prod

    View Slide

  86. CAROUSEL ADS
    ADS
    v1: fab and git pull

    View Slide

  87. CAROUSEL ADS
    ADS
    fabric: Python remote scripting

    View Slide

  88. CAROUSEL ADS
    ADS
    >  fab  djangos  update_git    
    >  fab  djangos  restart_django

    View Slide

  89. CAROUSEL ADS
    ADS
    great for 2 engineers

    View Slide

  90. CAROUSEL ADS
    ADS
    past 12 machines = pain

    View Slide

  91. CAROUSEL ADS
    ADS
    v2: fab parallel mode 

    to the rescue

    View Slide

  92. CAROUSEL ADS
    ADS
    >  fab  -­‐z20  djangos  update_git    
    >  fab  -­‐z20  djangos  restart_django

    View Slide

  93. CAROUSEL ADS
    ADS
    worked up to 70 machines

    View Slide

  94. CAROUSEL ADS
    ADS
    the year of the GitHub DDOSs

    View Slide

  95. CAROUSEL ADS
    ADS
    swear it wasn't us deploying

    View Slide

  96. CAROUSEL ADS
    ADS
    v3: fab rollout

    View Slide

  97. CAROUSEL ADS
    ADS
    >  fab  -­‐z20  djangos  rollout:server  
    ...doing  fresh  git  fetch    
    ...zipping  up  origin/master  
    ...uploading  to  S3  
    ...pulling  down  zip  
    ...unpacking  zip  
    ...mapping  'current'  symlink  
    ...restarting  Django

    View Slide

  98. CAROUSEL ADS
    ADS
    lasted us another 1.5 years

    View Slide

  99. CAROUSEL ADS
    ADS
    IG infra 2 to 10 eng

    View Slide

  100. CAROUSEL ADS
    ADS
    “hey, can I roll out?”
    “wait! I'm already rolling”

    View Slide

  101. CAROUSEL ADS
    ADS
    v4: enter Sauron

    View Slide

  102. View Slide

  103. View Slide

  104. CAROUSEL ADS
    ADS
    lasted us another 1.5 years

    View Slide

  105. CAROUSEL ADS
    ADS
    v5: scaling institutional
    knowledge

    View Slide

  106. CAROUSEL ADS
    ADS
    “did you remember to roll to a canary?”
    “don't roll to the workers with a -z of > 40!”
    “did you tail the error logs?”
    “did you catch that new tier we deployed?”

    View Slide

  107. CAROUSEL ADS
    ADS
    >  fab  -­‐z20  djangos  rollout:server  
    ...grabbing  lock  from  Sauron  
    ...doing  fresh  git  fetch    
    ...zipping  up  origin/master  
    ...uploading  to  S3  
    ...pulling  down  zip  to  canary  1  
    ...unpacking  zip  on  canary  1  
    ...mapping  'current'  symlink  on  canary  1  
    ...restarting  Django  on  canary  1

    View Slide

  108. CAROUSEL ADS
    ADS
    ...tailing  error  logs  on  canary  1  
    ...ok,  200  responses  are  even  
    ...deploying  to  async  worker  1  
    ...measuring  success  rate  on  worker  1  
    ...looks  good,  deploying  widely

    View Slide

  109. CAROUSEL ADS
    ADS
    “hold on, aren't you basically
    doing continuous deployment,
    but not?”

    View Slide

  110. CAROUSEL ADS
    ADS
    backend committers++

    View Slide

  111. CAROUSEL ADS
    ADS
    human lock contention

    View Slide

  112. CAROUSEL ADS
    ADS
    v5: continuous deployment

    View Slide

  113. CAROUSEL ADS
    ADS
    extended Sauron with Jenkins
    integration

    View Slide

  114. ADS

    View Slide

  115. CAROUSEL ADS
    ADS
    take human procedure, automate

    View Slide

  116. CAROUSEL ADS
    ADS
    deeply understood every step of
    our deploy

    View Slide

  117. CAROUSEL ADS
    ADS
    has scaled to 50+ committers on
    backend codebase

    View Slide

  118. SEARCH

    View Slide

  119. CAROUSEL ADS
    ADS
    v1: minimize moving parts

    View Slide

  120. CAROUSEL ADS
    ADS
    SELECT  id  FROM  users  WHERE  
    full_name  LIKE  ...

    View Slide

  121. CAROUSEL ADS
    ADS
    postgres & search, sittin' in 

    a b-tree

    View Slide

  122. CAROUSEL ADS
    ADS
    prefix-only, plz

    View Slide

  123. CAROUSEL ADS
    ADS
    haystack was pretty small

    View Slide

  124. ADS
    ok, but Bieber

    View Slide

  125. CAROUSEL ADS
    ADS
    CELEBRITY_OVERRIDES  =  {  
       'taylor  swift':  19151555,  
       'taylorswift':  19151555,  
       'justinbieber':  6860189,  
       'justin  bieber':  6860189  
    }
    ACTUAL CODE :(

    View Slide

  126. ADS
    ok, but Selena & Taylor & Harry &
    Zayn & ...

    View Slide

  127. ADS
    aka product needs have evolved

    View Slide

  128. ADS
    v2: Solr

    View Slide

  129. ADS
    Lucene-based
    HTTP/JSON interface
    great indexing options

    View Slide

  130. CAROUSEL ADS
    ADS
    curl  -­‐XPUT  'http://solr/update/json'  -­‐d  '{  
           {"add":    
               {"doc":  {  
                   "username"  :  "justinbieber",  
                   "followed_by":  12345678  
               }  
           }  
    }'

    View Slide

  131. CAROUSEL ADS
    ADS
    -­‐  CELEBRITY_OVERRIDES  =  {  
    -­‐    'taylor  swift':  19151555,  
    -­‐    'taylorswift':  19151555,  
    -­‐    'justin  bieber':  68680189  
    -­‐  }

    View Slide

  132. ADS
    <1 month to transfer over

    View Slide

  133. ADS
    launch Android

    View Slide

  134. ADS
    4x the queries

    View Slide

  135. ADS
    no SolrCloud yet

    View Slide

  136. ADS
    index twice?

    partition by prefix?

    View Slide

  137. ADS
    scale had changed

    View Slide

  138. ADS
    v3: ElasticSearch

    View Slide

  139. CAROUSEL ADS
    ADS
    curl  -­‐XPUT  'http://es:9200/users/user/6860189'  -­‐d  '{  
           "username"  :  "justinbieber",  
           "followed_by":  12345678  
    }'

    View Slide

  140. ADS
    also Lucene based

    easy query API

    out-of-box cluster support

    View Slide

  141. ADS
    very simple to set up

    View Slide

  142. ADS
    in a steady state,
    worked beautifully

    View Slide

  143. ADS
    but (at least in 2013) had high
    operational overhead

    View Slide

  144. ADS
    split brain

    View Slide

  145. ADS
    AWS autodiscovery

    View Slide

  146. ADS
    had to keep queries simple

    View Slide

  147. ADS
    not enough engineers to fully
    staff search team

    View Slide

  148. ADS
    meanwhile, instagration

    View Slide

  149. ADS
    v4: Unicorn

    View Slide

  150. ADS
    FB's graph search system

    View Slide

  151. ADS
    core idea: use social edges as
    part of the search

    View Slide

  152. CAROUSEL ADS
    ADS
    //  people  who  I  follow  named  Justin  
    (and  (term  justin*)  
             (term  followedby:4))  
    //  people  followed  by  the  people  I  follow,  named  Justin  
    (and  (term  justin*)  
             (apply  followedby:(term  followedby:4))  
    //  people  named  Justin,  prioritizing  the  people  I  follow  
    (weak-­‐and  (term  followedby:4  :optional-­‐hits  2)  
                       (term  justin*))

    View Slide

  153. ADS
    double-digit % increase in search
    clicks per daily active

    View Slide

  154. ADS
    bonus: new Explore photos

    View Slide

  155. ADS
    v1: most liked, globally

    View Slide

  156. View Slide

  157. ADS
    trying to everything to everyone

    View Slide

  158. ADS
    v2: photos liked by 

    people I follow

    View Slide

  159. ADS
    let's get social

    View Slide

  160. CAROUSEL ADS
    ADS
    //  photos  I  haven't  liked,  but  the  people  I  follow  liked  
    (difference  
           (or  likedby:friendA  likedby:friendB  …)    
                   likedby:4  
    )

    View Slide

  161. ADS

    View Slide

  162. ADS
    who I follow (not always) who has
    my taste

    View Slide

  163. CAROUSEL ADS
    ADS
    //  photos  I  haven't  liked  yet,  liked  by  people  whose  photos  
    I  already  liked  
    (difference    
         (apply  liker:    
             (extract  owner:  liker:4))    
         liker:4)

    View Slide

  164. ADS
    6x increase in taps into photos
    on Explore

    View Slide

  165. ADS
    http://bit.ly/fbunicorn

    View Slide

  166. TAKEAWAYS

    View Slide

  167. do the simple 

    thing first
    1
    until your
    {scale, team, product}
    changes
    2

    View Slide

  168. CAROUSEL ADS
    ADS
    ground your evolution in 

    problem-solving

    View Slide

  169. then do the next simplest thing

    View Slide

  170. CAROUSEL ADS
    ADS
    get in touch:

    [email protected]

    View Slide

  171. View Slide