Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Mike Krieger INSTAGRAM SCALING INSTAGRAM AIRBNB OPEN AIR SUMMIT 2015

Slide 3

Slide 3 text

co-founder, technical lead

Slide 4

Slide 4 text

São Paulo, Brazil photo: Diego Torres Silvestre

Slide 5

Slide 5 text

Stanford SymSys photo: Waqas Mustafeez

Slide 6

Slide 6 text

@mikeyk
 [email protected]

Slide 7

Slide 7 text

LAST TIME I GAVE 
 A TALK AT AIRBNB…

Slide 8

Slide 8 text

it was April 2012

Slide 9

Slide 9 text

we were 2 years old

Slide 10

Slide 10 text

2 product guys with 
 no backend experience

Slide 11

Slide 11 text

@goldenretrieverbailey

Slide 12

Slide 12 text

we had been acquired
 the week before

Slide 13

Slide 13 text

I had not slept much

Slide 14

Slide 14 text

we had an engineering team of 4 people

Slide 15

Slide 15 text

we had about
 30 million monthly actives

Slide 16

Slide 16 text

Taylor Swift CMA

Slide 17

Slide 17 text

TODAY...

Slide 18

Slide 18 text

we're 5 years old

Slide 19

Slide 19 text

sleeping (slightly) more

Slide 20

Slide 20 text

hired better coders than me

Slide 21

Slide 21 text

we have an eng 
 team of 95 people

Slide 22

Slide 22 text

we have over 
 300 million monthly actives

Slide 23

Slide 23 text

Taylor Swift Grammy

Slide 24

Slide 24 text

THIS TALK

Slide 25

Slide 25 text

how is Instagram infra 
 different in 2015?

Slide 26

Slide 26 text

what guides our evolution?

Slide 27

Slide 27 text

how we adapted to infra, team, and product changes

Slide 28

Slide 28 text

ORIGINAL PHILOSOPHY

Slide 29

Slide 29 text

do the simple 
 thing first

Slide 30

Slide 30 text

aka YAGNI

Slide 31

Slide 31 text

aka Use Boring Technology

Slide 32

Slide 32 text

boring means 
 operationally quiet, too

Slide 33

Slide 33 text

nginx & redis & memcached & postgres & gearman & django

Slide 34

Slide 34 text

2015 EDITION

Slide 35

Slide 35 text

nginx & redis & memcached & postgres & gearman & django

Slide 36

Slide 36 text

nginx & cassandra & memcached & postgres & rabbitmq & django

Slide 37

Slide 37 text

unicorn & proxygen & scribe & thrift nginx & cassandra & memcached & postgres & rabbitmq & django

Slide 38

Slide 38 text

do the simple 
 thing first 1 until your {scale, team, product} changes 2

Slide 39

Slide 39 text

do the simple 
 thing first 1 until your {scale, team, product} changes 2

Slide 40

Slide 40 text

scaling = replacing all components of a car while
 driving at 100mph

Slide 41

Slide 41 text

which components to 
 replace & when

Slide 42

Slide 42 text

DEEPER DIVE

Slide 43

Slide 43 text

Async Tasks (site scale)
 Code Deployment (team scale)
 Search (product scale)

Slide 44

Slide 44 text

ASYNC TASKS

Slide 45

Slide 45 text

CAROUSEL ADS ADS requests should take < 3s

Slide 46

Slide 46 text

CAROUSEL ADS ADS fan-out delivery to all your followers' feeds

Slide 47

Slide 47 text

CAROUSEL ADS ADS especially popular users

Slide 48

Slide 48 text

CAROUSEL ADS ADS post to external services (eg FB & Twitter)

Slide 49

Slide 49 text

CAROUSEL ADS ADS v1: Gearman

Slide 50

Slide 50 text

CAROUSEL ADS ADS async task broker

Slide 51

Slide 51 text

CAROUSEL ADS ADS 1 gearman broker 4 app servers 1 async worker box

Slide 52

Slide 52 text

CAROUSEL ADS ADS dead simple to set up

Slide 53

Slide 53 text

CAROUSEL ADS ADS memcached-like in simplicity

Slide 54

Slide 54 text

CAROUSEL ADS ADS got us through 1.5 years of growth

Slide 55

Slide 55 text

photo: MAMJODH

Slide 56

Slide 56 text

CAROUSEL ADS ADS messy to add/deploy new workers

Slide 57

Slide 57 text

CAROUSEL ADS ADS single core, 60ms mean submission time

Slide 58

Slide 58 text

CAROUSEL ADS ADS 1s+ enqueue time under load

Slide 59

Slide 59 text

CAROUSEL ADS ADS 8 gearman brokers 400 app servers 12,000+ threads 32 async worker boxes

Slide 60

Slide 60 text

CAROUSEL ADS ADS v2: “sharded” gearman

Slide 61

Slide 61 text

CAROUSEL ADS ADS BROKERS[node_index  %  len(BROKERS)]

Slide 62

Slide 62 text

CAROUSEL ADS ADS no graceful failover

Slide 63

Slide 63 text

CAROUSEL ADS ADS # of app servers growing quickly

Slide 64

Slide 64 text

CAROUSEL ADS ADS persistence was more dangerous than not persisting

Slide 65

Slide 65 text

CAROUSEL ADS ADS simple thing was waking us up & becoming operational burden

Slide 66

Slide 66 text

CAROUSEL ADS ADS operating at new scale

Slide 67

Slide 67 text

CAROUSEL ADS ADS time to move on

Slide 68

Slide 68 text

No content

Slide 69

Slide 69 text

your infra

Slide 70

Slide 70 text

CAROUSEL ADS ADS please thank all your soon to be decommissioned infra pieces

Slide 71

Slide 71 text

CAROUSEL ADS ADS basically didn't think about Gearman until we had to

Slide 72

Slide 72 text

CAROUSEL ADS ADS “do the simple thing next”

Slide 73

Slide 73 text

CAROUSEL ADS ADS roll your own

Slide 74

Slide 74 text

CAROUSEL ADS ADS rewrite gearman

Slide 75

Slide 75 text

CAROUSEL ADS ADS v3: celery and rabbitmq

Slide 76

Slide 76 text

CAROUSEL ADS ADS celery
 for much simpler worker code

Slide 77

Slide 77 text

CAROUSEL ADS ADS rabbitmq
 low(ish) maintenance

Slide 78

Slide 78 text

CAROUSEL ADS ADS any dev can add async task with one @task decorator

Slide 79

Slide 79 text

CAROUSEL ADS ADS kick off with function.delay()

Slide 80

Slide 80 text

CAROUSEL ADS ADS replication + failover 
 + persistence

Slide 81

Slide 81 text

CAROUSEL ADS ADS 5ms mean
 10ms P90

Slide 82

Slide 82 text

CAROUSEL ADS ADS opportunity to gain both operational & dev efficiency

Slide 83

Slide 83 text

CAROUSEL ADS ADS more details: 
 http://bit.ly/igcelery

Slide 84

Slide 84 text

DEPLOYMENT

Slide 85

Slide 85 text

CAROUSEL ADS ADS the art of getting code to prod

Slide 86

Slide 86 text

CAROUSEL ADS ADS v1: fab and git pull

Slide 87

Slide 87 text

CAROUSEL ADS ADS fabric: Python remote scripting

Slide 88

Slide 88 text

CAROUSEL ADS ADS >  fab  djangos  update_git     >  fab  djangos  restart_django

Slide 89

Slide 89 text

CAROUSEL ADS ADS great for 2 engineers

Slide 90

Slide 90 text

CAROUSEL ADS ADS past 12 machines = pain

Slide 91

Slide 91 text

CAROUSEL ADS ADS v2: fab parallel mode 
 to the rescue

Slide 92

Slide 92 text

CAROUSEL ADS ADS >  fab  -­‐z20  djangos  update_git     >  fab  -­‐z20  djangos  restart_django

Slide 93

Slide 93 text

CAROUSEL ADS ADS worked up to 70 machines

Slide 94

Slide 94 text

CAROUSEL ADS ADS the year of the GitHub DDOSs

Slide 95

Slide 95 text

CAROUSEL ADS ADS swear it wasn't us deploying

Slide 96

Slide 96 text

CAROUSEL ADS ADS v3: fab rollout

Slide 97

Slide 97 text

CAROUSEL ADS ADS >  fab  -­‐z20  djangos  rollout:server   ...doing  fresh  git  fetch     ...zipping  up  origin/master   ...uploading  to  S3   ...pulling  down  zip   ...unpacking  zip   ...mapping  'current'  symlink   ...restarting  Django

Slide 98

Slide 98 text

CAROUSEL ADS ADS lasted us another 1.5 years

Slide 99

Slide 99 text

CAROUSEL ADS ADS IG infra 2 to 10 eng

Slide 100

Slide 100 text

CAROUSEL ADS ADS “hey, can I roll out?” “wait! I'm already rolling”

Slide 101

Slide 101 text

CAROUSEL ADS ADS v4: enter Sauron

Slide 102

Slide 102 text

No content

Slide 103

Slide 103 text

No content

Slide 104

Slide 104 text

CAROUSEL ADS ADS lasted us another 1.5 years

Slide 105

Slide 105 text

CAROUSEL ADS ADS v5: scaling institutional knowledge

Slide 106

Slide 106 text

CAROUSEL ADS ADS “did you remember to roll to a canary?” “don't roll to the workers with a -z of > 40!” “did you tail the error logs?” “did you catch that new tier we deployed?”

Slide 107

Slide 107 text

CAROUSEL ADS ADS >  fab  -­‐z20  djangos  rollout:server   ...grabbing  lock  from  Sauron   ...doing  fresh  git  fetch     ...zipping  up  origin/master   ...uploading  to  S3   ...pulling  down  zip  to  canary  1   ...unpacking  zip  on  canary  1   ...mapping  'current'  symlink  on  canary  1   ...restarting  Django  on  canary  1

Slide 108

Slide 108 text

CAROUSEL ADS ADS ...tailing  error  logs  on  canary  1   ...ok,  200  responses  are  even   ...deploying  to  async  worker  1   ...measuring  success  rate  on  worker  1   ...looks  good,  deploying  widely

Slide 109

Slide 109 text

CAROUSEL ADS ADS “hold on, aren't you basically doing continuous deployment, but not?”

Slide 110

Slide 110 text

CAROUSEL ADS ADS backend committers++

Slide 111

Slide 111 text

CAROUSEL ADS ADS human lock contention

Slide 112

Slide 112 text

CAROUSEL ADS ADS v5: continuous deployment

Slide 113

Slide 113 text

CAROUSEL ADS ADS extended Sauron with Jenkins integration

Slide 114

Slide 114 text

ADS

Slide 115

Slide 115 text

CAROUSEL ADS ADS take human procedure, automate

Slide 116

Slide 116 text

CAROUSEL ADS ADS deeply understood every step of our deploy

Slide 117

Slide 117 text

CAROUSEL ADS ADS has scaled to 50+ committers on backend codebase

Slide 118

Slide 118 text

SEARCH

Slide 119

Slide 119 text

CAROUSEL ADS ADS v1: minimize moving parts

Slide 120

Slide 120 text

CAROUSEL ADS ADS SELECT  id  FROM  users  WHERE   full_name  LIKE  ...

Slide 121

Slide 121 text

CAROUSEL ADS ADS postgres & search, sittin' in 
 a b-tree

Slide 122

Slide 122 text

CAROUSEL ADS ADS prefix-only, plz

Slide 123

Slide 123 text

CAROUSEL ADS ADS haystack was pretty small

Slide 124

Slide 124 text

ADS ok, but Bieber

Slide 125

Slide 125 text

CAROUSEL ADS ADS CELEBRITY_OVERRIDES  =  {      'taylor  swift':  19151555,      'taylorswift':  19151555,      'justinbieber':  6860189,      'justin  bieber':  6860189   } ACTUAL CODE :(

Slide 126

Slide 126 text

ADS ok, but Selena & Taylor & Harry & Zayn & ...

Slide 127

Slide 127 text

ADS aka product needs have evolved

Slide 128

Slide 128 text

ADS v2: Solr

Slide 129

Slide 129 text

ADS Lucene-based HTTP/JSON interface great indexing options

Slide 130

Slide 130 text

CAROUSEL ADS ADS curl  -­‐XPUT  'http://solr/update/json'  -­‐d  '{          {"add":                {"doc":  {                  "username"  :  "justinbieber",                  "followed_by":  12345678              }          }   }'

Slide 131

Slide 131 text

CAROUSEL ADS ADS -­‐  CELEBRITY_OVERRIDES  =  {   -­‐    'taylor  swift':  19151555,   -­‐    'taylorswift':  19151555,   -­‐    'justin  bieber':  68680189   -­‐  }

Slide 132

Slide 132 text

ADS <1 month to transfer over

Slide 133

Slide 133 text

ADS launch Android

Slide 134

Slide 134 text

ADS 4x the queries

Slide 135

Slide 135 text

ADS no SolrCloud yet

Slide 136

Slide 136 text

ADS index twice?
 partition by prefix?

Slide 137

Slide 137 text

ADS scale had changed

Slide 138

Slide 138 text

ADS v3: ElasticSearch

Slide 139

Slide 139 text

CAROUSEL ADS ADS curl  -­‐XPUT  'http://es:9200/users/user/6860189'  -­‐d  '{          "username"  :  "justinbieber",          "followed_by":  12345678   }'

Slide 140

Slide 140 text

ADS also Lucene based
 easy query API
 out-of-box cluster support

Slide 141

Slide 141 text

ADS very simple to set up

Slide 142

Slide 142 text

ADS in a steady state, worked beautifully

Slide 143

Slide 143 text

ADS but (at least in 2013) had high operational overhead

Slide 144

Slide 144 text

ADS split brain

Slide 145

Slide 145 text

ADS AWS autodiscovery

Slide 146

Slide 146 text

ADS had to keep queries simple

Slide 147

Slide 147 text

ADS not enough engineers to fully staff search team

Slide 148

Slide 148 text

ADS meanwhile, instagration

Slide 149

Slide 149 text

ADS v4: Unicorn

Slide 150

Slide 150 text

ADS FB's graph search system

Slide 151

Slide 151 text

ADS core idea: use social edges as part of the search

Slide 152

Slide 152 text

CAROUSEL ADS ADS //  people  who  I  follow  named  Justin   (and  (term  justin*)            (term  followedby:4))   //  people  followed  by  the  people  I  follow,  named  Justin   (and  (term  justin*)            (apply  followedby:(term  followedby:4))   //  people  named  Justin,  prioritizing  the  people  I  follow   (weak-­‐and  (term  followedby:4  :optional-­‐hits  2)                      (term  justin*))

Slide 153

Slide 153 text

ADS double-digit % increase in search clicks per daily active

Slide 154

Slide 154 text

ADS bonus: new Explore photos

Slide 155

Slide 155 text

ADS v1: most liked, globally

Slide 156

Slide 156 text

No content

Slide 157

Slide 157 text

ADS trying to everything to everyone

Slide 158

Slide 158 text

ADS v2: photos liked by 
 people I follow

Slide 159

Slide 159 text

ADS let's get social

Slide 160

Slide 160 text

CAROUSEL ADS ADS //  photos  I  haven't  liked,  but  the  people  I  follow  liked   (difference          (or  likedby:friendA  likedby:friendB  …)                    likedby:4   )

Slide 161

Slide 161 text

ADS

Slide 162

Slide 162 text

ADS who I follow (not always) who has my taste

Slide 163

Slide 163 text

CAROUSEL ADS ADS //  photos  I  haven't  liked  yet,  liked  by  people  whose  photos   I  already  liked   (difference          (apply  liker:              (extract  owner:  liker:4))          liker:4)

Slide 164

Slide 164 text

ADS 6x increase in taps into photos on Explore

Slide 165

Slide 165 text

ADS http://bit.ly/fbunicorn

Slide 166

Slide 166 text

TAKEAWAYS

Slide 167

Slide 167 text

do the simple 
 thing first 1 until your {scale, team, product} changes 2

Slide 168

Slide 168 text

CAROUSEL ADS ADS ground your evolution in 
 problem-solving

Slide 169

Slide 169 text

then do the next simplest thing

Slide 170

Slide 170 text

CAROUSEL ADS ADS get in touch:
 [email protected]

Slide 171

Slide 171 text

No content