Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sam Kitajima-Kimbrel - Bowerbirds of Technology: Architecture and Teams at Less-than-Google Scale

Sam Kitajima-Kimbrel - Bowerbirds of Technology: Architecture and Teams at Less-than-Google Scale

Facebook, Google, Uber, LinkedIn, and friends are the rarefied heights of software engineering. They encounter and solve problems at scales shared by few others, and as a result, their priorities in production engineering and architecture are just a bit different from the rest of us down here in the other 99% of services. Through deconstructing a few blog posts from these giants, we’ll evaluate just what is it that they’re thinking about when they build systems and whether any of their choices are relevant to those of us operating at high scale yet still something less than millions of requests per second.

This talk will go into depth on how to make technological decisions to meet your customers’ requirements without requiring a small army of engineers to answer 2 AM pages, and how to set realistic goals for your team around operations, uptime, communications, and disaster recovery.

With these guidelines in mind, you should be better equipped to say no (or yes!) the next time your team’s software hipster proposes moving everything to the Next Big Thing.


PyCon 2018

May 11, 2018

More Decks by PyCon 2018

Other Decks in Programming


  1. @skimbrel B OW E R B I R D S

    O F T E C H N O L O G Y P Y C O N 2 0 1 8 S A M K I TA J I M A - K I M B R E L @ S K I M B R E L
  2. These are bowerbirds! They build these structures called bowers out

    of sticks and colorful objects they find in their environment in an effort to attract mates. More on them later; for now enjoy the nice photos of birds that I found on Flickr.
  3. @skimbrel CA L H E N D E R S

    O N , DJA N G O CO N U S 2 0 0 8 "Most websites aren't in the top 100 web sites." Speaking of Flickr, Cal Henderson used to work there. "It turns out all but 100 of them are not in the top 100"
  4. @skimbrel Z I P F D I S T R

    I B U T I O N Zipfian distributions: # links / traffic / users / whatever metric you want is inversely proportional to rank among all sites (this graph happens to be Wikipedia articles and the axes are log/log) Many empirical studies/measurements of the web have shown this holds true.
  5. @skimbrel YO U A R E N OT G O

    O G L E , A N D T H AT ' S O K AY ! Or: most of us, excepting the ones who work at Google, are not Google. And that's okay! Tons of products do just great at not-Google-scale.
  6. @skimbrel I A M A L S O N OT

    G O O G L E I am also not Google. And that is me with a slightly different hair color if you're in the cheap seats. Currently at Nuna (healthcare data, you haven’t heard of it), previously at Twilio, 8 years on large-and-fast-growing-but-not-Facebook-scale web services.
  7. @skimbrel W H AT A R E G O O

    G L E ' S P R O B L E M S ? What do Facebook, Amazon, and Google worry about? - Absurdly high throughput and data storage requirements - 10s of 1000s of servers in dozens to hundreds of datacenters worldwide - Thousands of engineers and the ability to specialize them into very, very tiny niches - Virtually limitless resources and the patience to train people on their systems How does this manifest? Let's look at examples.
  8. @skimbrel A B S U R D LY H I

    G H T H R O U G H P U T A N D S TO R AG E D E M A N D S
  9. @skimbrel T E N S O F T H O

  10. @skimbrel T H O U S A N D S

    O F D E V S
  11. @skimbrel ( N E A R - ) U N

    L I M I T E D R E S O U R C E S
  12. @skimbrel C A S E S T U D I

    E S Let's do a few brief case studies.
  13. @skimbrel U B E R S C H E M

    A L E S S * * I N VO K I N G U B E R A S A T E C H N O L O G I C A L E X A M P L E D O E S N OT C O N S T I T U T E A N E N D O R S E M E N T O F U B E R ' S B U S I N E S S S T R AT E G Y, E T H I C S , O R C U LT U R E (Uber has given us plenty of bad examples in non-technological things and this is emphatically not an endorsement of any of their behavior towards human beings) Case study: Uber hit scaling issues with Postgres and made themselves a new datastore. What were they looking for?
  14. @skimbrel " L I N E A R LY A

    D D C A PAC I T Y B Y A D D I N G M O R E S E RV E R S "
  15. @skimbrel " FAVO R W R I T E AVA

    I L A B I L I T Y OV E R R E A D - YO U R - W R I T E S E M A N T I C S "
  16. @skimbrel E V E N T N OT I F

    I C AT I O N S ( T R I G G E R S ) Side note: "we had an asynchronous event system built on Kafka 0.7 and we couldn't get it to run lossless". Have you tried upgrading?
  17. @skimbrel W H AT I S I T, T H

    E N ? So what did they build?
  18. @skimbrel " A P P E N D - O

    N LY S PA R S E T H R E E - D I M E N S I O N A L P E R S I S T E N T H A S H M A P, V E RY S I M I L A R TO G O O G L E ' S B I G TA B L E " To which my only reply is this comic.
  19. This comic is probably famous enough now but here it

    is again.
  20. @skimbrel "So, how do I query the database?" "It's not

    a database. It's a key- value store!"
  21. @skimbrel "Ok, it's not a database. How do I query

    it?" "You write a distributed map reduce function in Erlang!"
  22. @skimbrel "Did you just tell me to go **** myself?"

    "I believe I did, Bob."
  23. @skimbrel T H I S H A S A C

    O S T Point is: this has a cost.
  24. @skimbrel N E W A B S T R AC

    T I O N S Boundary between app and database changes — app has to know and enforce schemas and persistence strategies
  25. @skimbrel E V E N T UA L C O

    N S I S T E N C Y You can't read things you just wrote, and neither can other processes.
  26. @skimbrel F L E X I B L E Q

    U E R I E S - Have to know query patterns ahead of time - Mandatory sharding — can't read globally w/o extra work - No joins
  27. @skimbrel D E V E L O P E R

    FA M I L I A R I T Y - Can't hire fast or can't ramp fast - People aren't gonna walk in knowing this - Good luck w/ contractors
  28. @skimbrel A M A Z O N A N D

    S E RV I C E A R C H I T E C T U R E Steve Yegge quit Amazon and went to Google. Accidentally made public a long rant about how AWS was going to eat Google's lunch. Major focus: how Amazon got its service-oriented architecture. In 2002ish, Bezos gave the following orders.
  29. @skimbrel "All teams will henceforth expose their data and functionality

    through service interfaces."
  30. @skimbrel "Teams must communicate with each other through these interfaces."

  31. @skimbrel "There will be no other form of interprocess communication

    allowed […]the only communication allowed is via service interface calls over the network."
  32. @skimbrel "It doesn't matter what technology they use. HTTP, Corba,

    Pubsub, custom protocols — doesn't matter."
  33. @skimbrel "All service interfaces, without exception, must be designed from

    the ground up to be externalizable." Externalizable here means "exposed to the outside world", i.e. to customers, and sold as a product.
  34. @skimbrel "Anyone who doesn't do this will be fired." So

    yeah. Amazon's way of making systems and developer teams scale. They were and are serious about this.
  35. @skimbrel A M A Z O N L E A

    R N E D S O M E T H I N G S This also had a cost. Steve, on what Amazon learned:
  36. @skimbrel "pager escalation gets way harder, because a ticket might

    bounce through 20 service calls before the real owner is identified"
  37. @skimbrel "every single one of your peer teams suddenly becomes

    a potential DOS attacker"
  38. @skimbrel "monitoring and QA are the same thing…"

  39. @skimbrel "…the only thing still functioning in the server is

    the little component that knows how to say 'I'm fine, roger roger, over and out' in a cheery droid voice"
  40. @skimbrel "you won't be able to find any of them

    without a service-discovery mechanism [...] which is itself another service" Which requires a service *registry*, which…
  41. @skimbrel M A S S I V E LY S

    C A L A B L E I N F R A S T R U C T U R E C O S T S D E V E L O P E R T I M E Again: the crux is that massively-scalable infra costs dev time because the mental model is so much more complicated. And you don't have a lot of that.
  42. @skimbrel B U T I WA N T TO B

    E G O O G L E ! "But I *want* to be Google!", you may cry.
  43. @skimbrel G O O G L E WA S N

    ' T G O O G L E OV E R N I G H T …
  44. @skimbrel B E N G O M E S (

    H T T P: / / R E A D W R I T E . CO M / 2 0 1 2 / 0 2 / 2 9 / I N T E R V I E W _ C H A N G I N G _ E N G I N E S _ M I D - F L I G H T _ Q A _ W I T H _ G O O G / ) “When I joined Google, it would take us about a month to crawl and build an index of about 50 million pages.” Google in 1999: - 1 month to crawl and index 50MM pages. - 10k queries per day Google 2006: 10k queries per second Google 2012: 1 minute to index 50MM pages
  45. @skimbrel A N D S O M E T I

    M E S S T I L L I S N OT G O O G L E . On good authority: many things internal to Google still run on vanilla MySQL. Really! Even Google doesn't solve problems they don't have. Which goes to show…
  46. @skimbrel " B O R I N G " T

    E C H C A N G O R E A L LY FA R PyCon US 2017: Instagram is still a Django monolith! And doesn't even seem to be using asyncio! Horizontally-sharded RDBMSes — 15-year-old tech that goes 20k MPS at Twilio and still gives us full ACID.
  47. @skimbrel E X P O N E N T I

    A L G R OW T H F E E L S S L OW AT F I R S T If you and your product are lucky enough to experience the joys of irrational exuberance and exponential growth… - Low part of the curve is gentle enough to give you warning - There is no single point where your system will keel over and die instantly
  48. @skimbrel I T E R AT E I T E

    R AT E I T E R AT E Find the Most On Fire thing Evolve/replace it Repeat
  49. @skimbrel O K , I ' M N OT G

    O O G L E ( Y E T ) … . S O W H AT ? At this point in the talk I hope you're starting to think "OK, I'm not Google (yet)." "So what? What does this mean I should worry about instead?"
  50. @skimbrel U S E R T R U S T

    A B OV E A L L E L S E Maintain your users' trust. Meet their needs.
  51. @skimbrel FA S T, S A F E I T

    E R AT I O N Move fast *without* breaking things. Team’s time is one of the most precious resources; make the most of it.
  52. @skimbrel H E A LT H Y T E A

    M S To do that, our team needs to be healthy — will focus on on-call in a bit but plenty of other things to consider
  53. @skimbrel I N C L U S I V E

    T E A M S Beyond on-call being manageable, having an inclusive team will help you. More here later too.
  54. @skimbrel L E T ' S B E B OW

    E R B I R D S ! You don't have to reinvent the wheel. Build bowers instead. Bowerbirds: build structures from found materials to attract mates. Modern software ecosystem is our found environment We want healthy relationships w/ our users and our team — find what we need & combine First: technical decisions within this framework and then how to run a team and business
  55. @skimbrel B E A P I C K Y B

    I R D First, let's talk about picking technologies. OK, so we need a… bottle cap, it seems. Database? Browser framework? Web server? Who knows. What do we want to think about?
  56. @skimbrel P R O J E C T M AT

    U R I T Y Not brand-spanking new Not in Apache Attic
  57. @skimbrel M A I N TA I N E R

    S H I P Not the originating company Apache is the standard here if the project isn't big enough for, say, a DSF. Release velocity?
  58. @skimbrel S E C U R I T Y Search

    for CVEs. How many? Were they resolved? How quickly? How hard is your deployment going to be?
  59. @skimbrel S TA B I L I T Y Two

    types! API stability (is v2.0 gonna come out and break all the things?) System stability (does the database, well, database)
  60. @skimbrel P R O J E C T E C

    O S Y S T E M Library support for your language(s) Developer awareness/familiarity — can you hire people fast enough and how fast do they ramp up? Will consultants know it? Picking tech that everyone knows means you won't have to wait three months for your new developers to be productive on your stack.
  61. @skimbrel " O U T O F T H E

    B OX " "Out of the box"-iness aka friction What's the first 30 minutes like? Are there Dockerfiles? Chef cookbooks?
  62. @skimbrel D O C U M E N TAT I

    O N Existent? Up-to-date? Comprehensive? Searchable? Discoverable?
  63. @skimbrel S U P P O RT A N D

    C O N S U LTA N T S Can you get a support contract from *someone*? When your main DB dies at 1 AM and your backup turns out to be corrupt… you will want help.
  64. @skimbrel L I C E N S I N G

    L A N D M I N E S GPL software can't go in Apple's App Store! Or in the news recently: Apache declared Facebook's license + patent grant model no good. Panic ensued; Facebook ended up re-licensing with MIT.
  65. @skimbrel B U Y V S B U I L

    D Open-source and DIY obviously aren't our only choices. We can pay money for things! How do we decide?
  66. @skimbrel W H AT W O U L D I

    T C O S T TO B U I L D I T ?
  67. @skimbrel H OW L O N G W O U

    L D I T TA K E ? And what would you lose in the meantime to not having it tomorrow? Not only do you not get The Shiny tomorrow, you have to choose something else *not* to build because you're using up some dev time for this.
  68. @skimbrel H OW H A R D I S I

    T TO R E P L AC E ? What happens if the vendor goes down? Goes out of business?
  69. @skimbrel R E L AT I O N S H

    I P S Last: how do we run services, projects, and businesses? What should our relationships with our customers (whom we care about deeply because we want to acquire a billion of them) look like? What does a healthy team look like?
  70. @skimbrel T E A M S So, about healthy teams.

    I’m going to talk about a few things here: on-call, psychological safety, and inclusivity.
  71. @skimbrel S U S TA I N A B L

    E O N - C A L L First up, on-call. We have a problem with on-call and pager rotations. Who does on-call right? Hospitals, nuclear power plants, firefighters…
  72. @skimbrel 1 6 8 H O U R S ÷

    4 0 H O U R W O R K W E E K = 4 . 2 P E O P L E Let's do some math.
  73. @skimbrel 1 6 8 H O U R S ÷

    4 0 H O U R W O R K W E E K = 4 . 2 5 P E O P L E Oops, we can't have 0.2 of a person. Five.
  74. @skimbrel 1 6 8 H O U R S ÷

    4 0 H O U R W O R K W E E K = 4 . 2 5 6 P E O P L E Are we okay with only 32 hours per week for PTO and sick time? Better make it six. Show of hands please. Right. So how do we make on-call be less awful?
  75. @skimbrel E M P OW E R M E N

    T A N D AU TO M AT I O N One of the things that's come out of devops culture is that employing humans to be robots is bad. So don't do it. On-call's job: get paged at 2 AM *maybe once a week, ideally once a month*, find the thing that broke and make it *never do that again*. Give them the time and space to do this.
  76. @skimbrel A P P R O P R I AT

    E AVA I L A B I L I T Y A N D S C A L A B I L I T Y As shown earlier: you won't go 10k/day to 10k/sec overnight. Or even in a year. Obvious path to 10x, and line of sight to 100x. Also think about how available and reliable you need to be — telecom vs… say, doctor's office appointment system.
  77. @skimbrel 9 9 % U P T I M E

    ? 3.65 days/year 7.2 hours/month 1.68 hours/week 14.4 minutes/day MATH TIME AGAIN!
  78. @skimbrel T H R E E N I N E

    S 8.76 hours/year 43.8 minutes/month 10.1 minutes/week 1.44 minutes/day
  79. @skimbrel * F O U R * N I N

    E S ? 52.56 minutes/year 1.01 minutes/week 8.64 seconds/day
  80. @skimbrel F I V E ? ! 5.26 minutes/year

  81. @skimbrel D O YO U N E E D T

    H AT S L A ? Don’t overcommit yourself. Office appointment app? Odds are your users *won’t even notice* an offline DB migration over the weekend.
  82. @skimbrel H E A LT H Y T E A

    M S Humane on-call schedule + consciously-chosen SLAs + sensible alerting create a culture where people who might otherwise not have joined can show up — people with children, disabled people, etc. Which goes hand in hand w/ building a safe and inclusive work environment.
  83. @skimbrel P S Y C H O L O G

    I C A L S A F E T Y Google study etc.
  84. @skimbrel I N C L U S I V I

    T Y, N OT J U S T D I V E R S I T Y This is a *start* towards building an inclusive team. Diversity isn’t enough — people of differing backgrounds need to be comfortable being themselves.
  85. @skimbrel S E T G R O U N D

    R U L E S Set some ground rules with your teams — make a charter. Some things that might come up: - Code reviews - Meeting etiquette (no interruptions; 3 in 1 / 1 in 3 rule; give credit) - Space for learning (don’t feign surprise; no RTFM; - Handling conflict
  86. @skimbrel A D D R E S S B I

    A S Be aware of and take steps to address conscious and unconscious bias.
  87. @skimbrel R E TA I N A N D P

    R O M OT E Not enough just to hire women/POC/queer people/disabled people/… — ensure they have equal access to growth opportunities
  88. @skimbrel G E T P R O F E S

    S I O N A L A S S I S TA N C E Final note: there are consulting firms helping with this. Engage one and pay them — don’t just make the few URMs you do have do this work as an unpaid side gig!
  89. @skimbrel S U M M A RY: H U M

    A N S F I R S T So: make your on-call reasonable, and make your teams inclusive and safe for *everyone* working with you, because at the end of the day… you work with humans first.
  90. @skimbrel H A P P Y U S E R

    S Close with users — how do we keep *them* happy?
  91. @skimbrel C U S TO M E R E M

    PAT H Y Have empathy.
  92. @skimbrel K N OW YO U R U S E

    R T E C H B A S E First, this means knowing who your users are. Trade-offs, as always, but make sure you have the data and stories.
  93. @skimbrel K N OW YO U R I M PAC

    T Empathy also means knowing the impact on your users when you make changes, or when you go down. Especially when you go down.
  94. @skimbrel S E T E X P E C TAT

    I O N S And using that empathy, manage your users' expectations *ahead of time*. "Underpromise but overdeliver" is *always* a good strategy.
  95. @skimbrel D E G R A D E G R

    AC E F U L LY Netflix has default recommendations in case the personalization engine is down when you open the app.
  96. @skimbrel ( OV E R ) C O M M

    U N I C AT E TALK TO YOUR USERS. Update your status page when you even *think* there might be a problem. Speaking of status pages…
  97. @skimbrel A M A ZO N ( H T T

    P S : / / A W S . A M A ZO N . CO M / M E S S A G E / 4 1 9 2 6 / ) "we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3" Yes, this happened. Put your status page somewhere that is completely independent of your infrastructure. You're on AWS? Great, put it on Google Cloud Platform.
  98. @skimbrel G I T L A B DATA B A

    S E FA I L U R E H T T P S : / / A B O U T. G I T L A B . CO M / 2 0 1 7 / 0 2 / 1 0 / P O S T M O R T E M - O F- DATA B A S E - O U TA G E - O F-JA N UA R Y- 3 1 / Gitlab had a major incident with their primary database deployment. Public Google doc w/ incident notes in realtime.
  99. @skimbrel ( OV E R ) C O M M

    U N I C AT E Err towards overcommunication. Staff up: - Social media - Zendesk or w/e AND LISTEN TO THOSE PEOPLE.
  100. @skimbrel M E A S U R E S U

    P P O RT P E R F O R M A N C E Uptime isn't the only SLA! - Time to first response - Time to resolution - Overall satisfaction score - etc
  101. @skimbrel D I S A S T E R R

    E C OV E RY Because it _will_ happen.
  102. @skimbrel I D E N T I F Y FAU

    LT D O M A I N S
  103. @skimbrel FAU LT TO L E R A N C

    E H A S C O S T S How much does it cost to survive a failure of a: - Host - AWS AZ - AWS region - …?
  104. @skimbrel P R AC T I C E Exercise your

    failover mechanisms and backup recovery ahead of time in controlled conditions. You will thank me later.
  105. @skimbrel S E C U R I T Y Please

    do consider security.
  106. @skimbrel OWA S P Open Web Application Security Project Immensely

    useful guides to just about everything.
  107. @skimbrel K N OW YO U R T H R

    E AT M O D E L - Valuable assets - Vectors of attack - Mitigations
  108. @skimbrel D O N ' T C H E C

    K C R E D E N T I A L S I N TO G I T - Yes, I really need to say this.
  109. @skimbrel A N D D O N ' T D

    O T H I S
  110. @skimbrel C O M M U N I C AT

    E ( AG A I N ) - Treat security breaches like any other incident. The longer you keep it secret the worse the backlash will be. - What was compromised? For how many people? How? Can it happen again (no, it can't)?
  111. So that was a lot! I hope this advice helps

    you get more content and comfortable no matter how big or small your system is. We may not all be Google or Facebook, but we can all learn from their paths to the dizzying heights of scale, and we can all adopt code and ideas from them and everyone else who came before us to build amazing new bowers of technology for our users.
  112. And finally… before I thought to use bowerbirds as the

    metaphor, the best thing I had was dung beetles. Aren't you glad you got a talk with pretty bird pictures instead?
  113. @skimbrel H A P P Y B OW E R

    - B U I L D I N G !
  114. @skimbrel F U RT H E R R E A

    D I N G https://samkimbrel.com/posts/bowerbirds.html I was a bowerbird when I built this talk, so here are some of the pieces that inspired me. (Or there will be, shortly)
  115. @skimbrel S O U R C E S CC BY-NC-ND

    2.0 ccdoh1 https://www.flickr.com/photos/ccdoh1/5282484075/ CC BY 2.0 https://www.flickr.com/photos/rileyfive/25506971724 CC BY-SA 3.0 Andrew West https://commons.wikimedia.org/wiki/ File:Wikipedia_view_distribution_by_article_rank.png CC BY-ND 2.0 Melanie Underwood https://www.flickr.com/photos/warblerlady/7664022750/ CC BY-NC 2.0 Nick Morieson https://www.flickr.com/photos/ngmorieson/8056110806/ CC BY-NC-ND 2.0 Julie Bergher https://www.flickr.com/photos/sunphlo/11578609646/ CC BY-NC-ND 2.0 Nathan Rupert https://www.flickr.com/photos/nathaninsandiego/20291539244/ CC BY-NC-ND 2.0 Nathan Rupert https://www.flickr.com/photos/nathaninsandiego/20726159958/ CC BY-NC-ND 2.0 Julie Burgher https://www.flickr.com/photos/sunphlo/11522540164/ CC BY-NC-ND 2.0 Neil Saunders https://www.flickr.com/photos/nsaunders/22748694318/ CC BY 2.0 thinboyfatter https://www.flickr.com/photos/1234abcd/4717190370/ CC BY-SA 2.0 Jim Bendon https://www.flickr.com/photos/jim_bendon_1957/11722386055 https://cispa.saarland/wp-content/uploads/2015/02/MongoDB_documentation.pdf CC BY-NC-ND 2.0 Julie Bergher https://www.flickr.com/photos/sunphlo/11578609646/ CC BY-SA 3.0 Kay-africa https://commons.wikimedia.org/wiki/ File:Flightless_Dung_Beetle_Circellium_Bachuss,_Addo_Elephant_National_Park,_South_Africa.JPG CC BY-SA 2.0 Robyn Jay https://www.flickr.com/photos/learnscope/14602494872