Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling Instagram at AirBnB OpenAir 2015

Scaling Instagram at AirBnB OpenAir 2015

401b07f62a1f23221cfe55e73bf8a813?s=128

Mike Krieger

June 04, 2015
Tweet

Transcript

  1. None
  2. Mike Krieger INSTAGRAM SCALING INSTAGRAM AIRBNB OPEN AIR SUMMIT 2015

  3. co-founder, technical lead

  4. São Paulo, Brazil photo: Diego Torres Silvestre

  5. Stanford SymSys photo: Waqas Mustafeez

  6. @mikeyk
 mike@instagram.com

  7. LAST TIME I GAVE 
 A TALK AT AIRBNB…

  8. it was April 2012

  9. we were 2 years old

  10. 2 product guys with 
 no backend experience

  11. @goldenretrieverbailey

  12. we had been acquired
 the week before

  13. I had not slept much

  14. we had an engineering team of 4 people

  15. we had about
 30 million monthly actives

  16. Taylor Swift CMA

  17. TODAY...

  18. we're 5 years old

  19. sleeping (slightly) more

  20. hired better coders than me

  21. we have an eng 
 team of 95 people

  22. we have over 
 300 million monthly actives

  23. Taylor Swift Grammy

  24. THIS TALK

  25. how is Instagram infra 
 different in 2015?

  26. what guides our evolution?

  27. how we adapted to infra, team, and product changes

  28. ORIGINAL PHILOSOPHY

  29. do the simple 
 thing first

  30. aka YAGNI

  31. aka Use Boring Technology

  32. boring means 
 operationally quiet, too

  33. nginx & redis & memcached & postgres & gearman &

    django
  34. 2015 EDITION

  35. nginx & redis & memcached & postgres & gearman &

    django
  36. nginx & cassandra & memcached & postgres & rabbitmq &

    django
  37. unicorn & proxygen & scribe & thrift nginx & cassandra

    & memcached & postgres & rabbitmq & django
  38. do the simple 
 thing first 1 until your {scale,

    team, product} changes 2
  39. do the simple 
 thing first 1 until your {scale,

    team, product} changes 2
  40. scaling = replacing all components of a car while
 driving

    at 100mph
  41. which components to 
 replace & when

  42. DEEPER DIVE

  43. Async Tasks (site scale)
 Code Deployment (team scale)
 Search (product

    scale)
  44. ASYNC TASKS

  45. CAROUSEL ADS ADS requests should take < 3s

  46. CAROUSEL ADS ADS fan-out delivery to all your followers' feeds

  47. CAROUSEL ADS ADS especially popular users

  48. CAROUSEL ADS ADS post to external services (eg FB &

    Twitter)
  49. CAROUSEL ADS ADS v1: Gearman

  50. CAROUSEL ADS ADS async task broker

  51. CAROUSEL ADS ADS 1 gearman broker 4 app servers 1

    async worker box
  52. CAROUSEL ADS ADS dead simple to set up

  53. CAROUSEL ADS ADS memcached-like in simplicity

  54. CAROUSEL ADS ADS got us through 1.5 years of growth

  55. photo: MAMJODH

  56. CAROUSEL ADS ADS messy to add/deploy new workers

  57. CAROUSEL ADS ADS single core, 60ms mean submission time

  58. CAROUSEL ADS ADS 1s+ enqueue time under load

  59. CAROUSEL ADS ADS 8 gearman brokers 400 app servers 12,000+

    threads 32 async worker boxes
  60. CAROUSEL ADS ADS v2: “sharded” gearman

  61. CAROUSEL ADS ADS BROKERS[node_index  %  len(BROKERS)]

  62. CAROUSEL ADS ADS no graceful failover

  63. CAROUSEL ADS ADS # of app servers growing quickly

  64. CAROUSEL ADS ADS persistence was more dangerous than not persisting

  65. CAROUSEL ADS ADS simple thing was waking us up &

    becoming operational burden
  66. CAROUSEL ADS ADS operating at new scale

  67. CAROUSEL ADS ADS time to move on

  68. None
  69. your infra

  70. CAROUSEL ADS ADS please thank all your soon to be

    decommissioned infra pieces
  71. CAROUSEL ADS ADS basically didn't think about Gearman until we

    had to
  72. CAROUSEL ADS ADS “do the simple thing next”

  73. CAROUSEL ADS ADS roll your own

  74. CAROUSEL ADS ADS rewrite gearman

  75. CAROUSEL ADS ADS v3: celery and rabbitmq

  76. CAROUSEL ADS ADS celery
 for much simpler worker code

  77. CAROUSEL ADS ADS rabbitmq
 low(ish) maintenance

  78. CAROUSEL ADS ADS any dev can add async task with

    one @task decorator
  79. CAROUSEL ADS ADS kick off with function.delay()

  80. CAROUSEL ADS ADS replication + failover 
 + persistence

  81. CAROUSEL ADS ADS 5ms mean
 10ms P90

  82. CAROUSEL ADS ADS opportunity to gain both operational & dev

    efficiency
  83. CAROUSEL ADS ADS more details: 
 http://bit.ly/igcelery

  84. DEPLOYMENT

  85. CAROUSEL ADS ADS the art of getting code to prod

  86. CAROUSEL ADS ADS v1: fab and git pull

  87. CAROUSEL ADS ADS fabric: Python remote scripting

  88. CAROUSEL ADS ADS >  fab  djangos  update_git     >

     fab  djangos  restart_django
  89. CAROUSEL ADS ADS great for 2 engineers

  90. CAROUSEL ADS ADS past 12 machines = pain

  91. CAROUSEL ADS ADS v2: fab parallel mode 
 to the

    rescue
  92. CAROUSEL ADS ADS >  fab  -­‐z20  djangos  update_git    

    >  fab  -­‐z20  djangos  restart_django
  93. CAROUSEL ADS ADS worked up to 70 machines

  94. CAROUSEL ADS ADS the year of the GitHub DDOSs

  95. CAROUSEL ADS ADS swear it wasn't us deploying

  96. CAROUSEL ADS ADS v3: fab rollout

  97. CAROUSEL ADS ADS >  fab  -­‐z20  djangos  rollout:server   ...doing

     fresh  git  fetch     ...zipping  up  origin/master   ...uploading  to  S3   ...pulling  down  zip   ...unpacking  zip   ...mapping  'current'  symlink   ...restarting  Django
  98. CAROUSEL ADS ADS lasted us another 1.5 years

  99. CAROUSEL ADS ADS IG infra 2 to 10 eng

  100. CAROUSEL ADS ADS “hey, can I roll out?” “wait! I'm

    already rolling”
  101. CAROUSEL ADS ADS v4: enter Sauron

  102. None
  103. None
  104. CAROUSEL ADS ADS lasted us another 1.5 years

  105. CAROUSEL ADS ADS v5: scaling institutional knowledge

  106. CAROUSEL ADS ADS “did you remember to roll to a

    canary?” “don't roll to the workers with a -z of > 40!” “did you tail the error logs?” “did you catch that new tier we deployed?”
  107. CAROUSEL ADS ADS >  fab  -­‐z20  djangos  rollout:server   ...grabbing

     lock  from  Sauron   ...doing  fresh  git  fetch     ...zipping  up  origin/master   ...uploading  to  S3   ...pulling  down  zip  to  canary  1   ...unpacking  zip  on  canary  1   ...mapping  'current'  symlink  on  canary  1   ...restarting  Django  on  canary  1
  108. CAROUSEL ADS ADS ...tailing  error  logs  on  canary  1  

    ...ok,  200  responses  are  even   ...deploying  to  async  worker  1   ...measuring  success  rate  on  worker  1   ...looks  good,  deploying  widely
  109. CAROUSEL ADS ADS “hold on, aren't you basically doing continuous

    deployment, but not?”
  110. CAROUSEL ADS ADS backend committers++

  111. CAROUSEL ADS ADS human lock contention

  112. CAROUSEL ADS ADS v5: continuous deployment

  113. CAROUSEL ADS ADS extended Sauron with Jenkins integration

  114. ADS

  115. CAROUSEL ADS ADS take human procedure, automate

  116. CAROUSEL ADS ADS deeply understood every step of our deploy

  117. CAROUSEL ADS ADS has scaled to 50+ committers on backend

    codebase
  118. SEARCH

  119. CAROUSEL ADS ADS v1: minimize moving parts

  120. CAROUSEL ADS ADS SELECT  id  FROM  users  WHERE   full_name

     LIKE  ...
  121. CAROUSEL ADS ADS postgres & search, sittin' in 
 a

    b-tree
  122. CAROUSEL ADS ADS prefix-only, plz

  123. CAROUSEL ADS ADS haystack was pretty small

  124. ADS ok, but Bieber

  125. CAROUSEL ADS ADS CELEBRITY_OVERRIDES  =  {      'taylor  swift':

     19151555,      'taylorswift':  19151555,      'justinbieber':  6860189,      'justin  bieber':  6860189   } ACTUAL CODE :(
  126. ADS ok, but Selena & Taylor & Harry & Zayn

    & ...
  127. ADS aka product needs have evolved

  128. ADS v2: Solr

  129. ADS Lucene-based HTTP/JSON interface great indexing options

  130. CAROUSEL ADS ADS curl  -­‐XPUT  'http://solr/update/json'  -­‐d  '{    

         {"add":                {"doc":  {                  "username"  :  "justinbieber",                  "followed_by":  12345678              }          }   }'
  131. CAROUSEL ADS ADS -­‐  CELEBRITY_OVERRIDES  =  {   -­‐  

     'taylor  swift':  19151555,   -­‐    'taylorswift':  19151555,   -­‐    'justin  bieber':  68680189   -­‐  }
  132. ADS <1 month to transfer over

  133. ADS launch Android

  134. ADS 4x the queries

  135. ADS no SolrCloud yet

  136. ADS index twice?
 partition by prefix?

  137. ADS scale had changed

  138. ADS v3: ElasticSearch

  139. CAROUSEL ADS ADS curl  -­‐XPUT  'http://es:9200/users/user/6860189'  -­‐d  '{    

         "username"  :  "justinbieber",          "followed_by":  12345678   }'
  140. ADS also Lucene based
 easy query API
 out-of-box cluster support

  141. ADS very simple to set up

  142. ADS in a steady state, worked beautifully

  143. ADS but (at least in 2013) had high operational overhead

  144. ADS split brain

  145. ADS AWS autodiscovery

  146. ADS had to keep queries simple

  147. ADS not enough engineers to fully staff search team

  148. ADS meanwhile, instagration

  149. ADS v4: Unicorn

  150. ADS FB's graph search system

  151. ADS core idea: use social edges as part of the

    search
  152. CAROUSEL ADS ADS //  people  who  I  follow  named  Justin

      (and  (term  justin*)            (term  followedby:4))   //  people  followed  by  the  people  I  follow,  named  Justin   (and  (term  justin*)            (apply  followedby:(term  followedby:4))   //  people  named  Justin,  prioritizing  the  people  I  follow   (weak-­‐and  (term  followedby:4  :optional-­‐hits  2)                      (term  justin*))
  153. ADS double-digit % increase in search clicks per daily active

  154. ADS bonus: new Explore photos

  155. ADS v1: most liked, globally

  156. None
  157. ADS trying to everything to everyone

  158. ADS v2: photos liked by 
 people I follow

  159. ADS let's get social

  160. CAROUSEL ADS ADS //  photos  I  haven't  liked,  but  the

     people  I  follow  liked   (difference          (or  likedby:friendA  likedby:friendB  …)                    likedby:4   )
  161. ADS

  162. ADS who I follow (not always) who has my taste

  163. CAROUSEL ADS ADS //  photos  I  haven't  liked  yet,  liked

     by  people  whose  photos   I  already  liked   (difference          (apply  liker:              (extract  owner:  liker:4))          liker:4)
  164. ADS 6x increase in taps into photos on Explore

  165. ADS http://bit.ly/fbunicorn

  166. TAKEAWAYS

  167. do the simple 
 thing first 1 until your {scale,

    team, product} changes 2
  168. CAROUSEL ADS ADS ground your evolution in 
 problem-solving

  169. then do the next simplest thing

  170. CAROUSEL ADS ADS get in touch:
 mike@instagram.com

  171. None