Slide 1

Slide 1 text

@skimbrel B OW E R B I R D S O F T E C H N O L O G Y P Y C O N 2 0 1 8 S A M K I TA J I M A - K I M B R E L @ S K I M B R E L

Slide 2

Slide 2 text

These are bowerbirds! They build these structures called bowers out of sticks and colorful objects they find in their environment in an effort to attract mates. More on them later; for now enjoy the nice photos of birds that I found on Flickr.

Slide 3

Slide 3 text

@skimbrel CA L H E N D E R S O N , DJA N G O CO N U S 2 0 0 8 "Most websites aren't in the top 100 web sites." Speaking of Flickr, Cal Henderson used to work there. "It turns out all but 100 of them are not in the top 100"

Slide 4

Slide 4 text

@skimbrel Z I P F D I S T R I B U T I O N Zipfian distributions: # links / traffic / users / whatever metric you want is inversely proportional to rank among all sites (this graph happens to be Wikipedia articles and the axes are log/log) Many empirical studies/measurements of the web have shown this holds true.

Slide 5

Slide 5 text

@skimbrel YO U A R E N OT G O O G L E , A N D T H AT ' S O K AY ! Or: most of us, excepting the ones who work at Google, are not Google. And that's okay! Tons of products do just great at not-Google-scale.

Slide 6

Slide 6 text

@skimbrel I A M A L S O N OT G O O G L E I am also not Google. And that is me with a slightly different hair color if you're in the cheap seats. Currently at Nuna (healthcare data, you haven’t heard of it), previously at Twilio, 8 years on large-and-fast-growing-but-not-Facebook-scale web services.

Slide 7

Slide 7 text

@skimbrel W H AT A R E G O O G L E ' S P R O B L E M S ? What do Facebook, Amazon, and Google worry about? - Absurdly high throughput and data storage requirements - 10s of 1000s of servers in dozens to hundreds of datacenters worldwide - Thousands of engineers and the ability to specialize them into very, very tiny niches - Virtually limitless resources and the patience to train people on their systems How does this manifest? Let's look at examples.

Slide 8

Slide 8 text

@skimbrel A B S U R D LY H I G H T H R O U G H P U T A N D S TO R AG E D E M A N D S

Slide 9

Slide 9 text

@skimbrel T E N S O F T H O U S A N D S O F S E RV E R S H U N D R E D S O F DATAC E N T E R S

Slide 10

Slide 10 text

@skimbrel T H O U S A N D S O F D E V S

Slide 11

Slide 11 text

@skimbrel ( N E A R - ) U N L I M I T E D R E S O U R C E S

Slide 12

Slide 12 text

@skimbrel C A S E S T U D I E S Let's do a few brief case studies.

Slide 13

Slide 13 text

@skimbrel U B E R S C H E M A L E S S * * I N VO K I N G U B E R A S A T E C H N O L O G I C A L E X A M P L E D O E S N OT C O N S T I T U T E A N E N D O R S E M E N T O F U B E R ' S B U S I N E S S S T R AT E G Y, E T H I C S , O R C U LT U R E (Uber has given us plenty of bad examples in non-technological things and this is emphatically not an endorsement of any of their behavior towards human beings) Case study: Uber hit scaling issues with Postgres and made themselves a new datastore. What were they looking for?

Slide 14

Slide 14 text

@skimbrel " L I N E A R LY A D D C A PAC I T Y B Y A D D I N G M O R E S E RV E R S "

Slide 15

Slide 15 text

@skimbrel " FAVO R W R I T E AVA I L A B I L I T Y OV E R R E A D - YO U R - W R I T E S E M A N T I C S "

Slide 16

Slide 16 text

@skimbrel E V E N T N OT I F I C AT I O N S ( T R I G G E R S ) Side note: "we had an asynchronous event system built on Kafka 0.7 and we couldn't get it to run lossless". Have you tried upgrading?

Slide 17

Slide 17 text

@skimbrel W H AT I S I T, T H E N ? So what did they build?

Slide 18

Slide 18 text

@skimbrel " A P P E N D - O N LY S PA R S E T H R E E - D I M E N S I O N A L P E R S I S T E N T H A S H M A P, V E RY S I M I L A R TO G O O G L E ' S B I G TA B L E " To which my only reply is this comic.

Slide 19

Slide 19 text

This comic is probably famous enough now but here it is again.

Slide 20

Slide 20 text

@skimbrel "So, how do I query the database?" "It's not a database. It's a key- value store!"

Slide 21

Slide 21 text

@skimbrel "Ok, it's not a database. How do I query it?" "You write a distributed map reduce function in Erlang!"

Slide 22

Slide 22 text

@skimbrel "Did you just tell me to go **** myself?" "I believe I did, Bob."

Slide 23

Slide 23 text

@skimbrel T H I S H A S A C O S T Point is: this has a cost.

Slide 24

Slide 24 text

@skimbrel N E W A B S T R AC T I O N S Boundary between app and database changes — app has to know and enforce schemas and persistence strategies

Slide 25

Slide 25 text

@skimbrel E V E N T UA L C O N S I S T E N C Y You can't read things you just wrote, and neither can other processes.

Slide 26

Slide 26 text

@skimbrel F L E X I B L E Q U E R I E S - Have to know query patterns ahead of time - Mandatory sharding — can't read globally w/o extra work - No joins

Slide 27

Slide 27 text

@skimbrel D E V E L O P E R FA M I L I A R I T Y - Can't hire fast or can't ramp fast - People aren't gonna walk in knowing this - Good luck w/ contractors

Slide 28

Slide 28 text

@skimbrel A M A Z O N A N D S E RV I C E A R C H I T E C T U R E Steve Yegge quit Amazon and went to Google. Accidentally made public a long rant about how AWS was going to eat Google's lunch. Major focus: how Amazon got its service-oriented architecture. In 2002ish, Bezos gave the following orders.

Slide 29

Slide 29 text

@skimbrel "All teams will henceforth expose their data and functionality through service interfaces."

Slide 30

Slide 30 text

@skimbrel "Teams must communicate with each other through these interfaces."

Slide 31

Slide 31 text

@skimbrel "There will be no other form of interprocess communication allowed […]the only communication allowed is via service interface calls over the network."

Slide 32

Slide 32 text

@skimbrel "It doesn't matter what technology they use. HTTP, Corba, Pubsub, custom protocols — doesn't matter."

Slide 33

Slide 33 text

@skimbrel "All service interfaces, without exception, must be designed from the ground up to be externalizable." Externalizable here means "exposed to the outside world", i.e. to customers, and sold as a product.

Slide 34

Slide 34 text

@skimbrel "Anyone who doesn't do this will be fired." So yeah. Amazon's way of making systems and developer teams scale. They were and are serious about this.

Slide 35

Slide 35 text

@skimbrel A M A Z O N L E A R N E D S O M E T H I N G S This also had a cost. Steve, on what Amazon learned:

Slide 36

Slide 36 text

@skimbrel "pager escalation gets way harder, because a ticket might bounce through 20 service calls before the real owner is identified"

Slide 37

Slide 37 text

@skimbrel "every single one of your peer teams suddenly becomes a potential DOS attacker"

Slide 38

Slide 38 text

@skimbrel "monitoring and QA are the same thing…"

Slide 39

Slide 39 text

@skimbrel "…the only thing still functioning in the server is the little component that knows how to say 'I'm fine, roger roger, over and out' in a cheery droid voice"

Slide 40

Slide 40 text

@skimbrel "you won't be able to find any of them without a service-discovery mechanism [...] which is itself another service" Which requires a service *registry*, which…

Slide 41

Slide 41 text

@skimbrel M A S S I V E LY S C A L A B L E I N F R A S T R U C T U R E C O S T S D E V E L O P E R T I M E Again: the crux is that massively-scalable infra costs dev time because the mental model is so much more complicated. And you don't have a lot of that.

Slide 42

Slide 42 text

@skimbrel B U T I WA N T TO B E G O O G L E ! "But I *want* to be Google!", you may cry.

Slide 43

Slide 43 text

@skimbrel G O O G L E WA S N ' T G O O G L E OV E R N I G H T …

Slide 44

Slide 44 text

@skimbrel B E N G O M E S ( H T T P: / / R E A D W R I T E . CO M / 2 0 1 2 / 0 2 / 2 9 / I N T E R V I E W _ C H A N G I N G _ E N G I N E S _ M I D - F L I G H T _ Q A _ W I T H _ G O O G / ) “When I joined Google, it would take us about a month to crawl and build an index of about 50 million pages.” Google in 1999: - 1 month to crawl and index 50MM pages. - 10k queries per day Google 2006: 10k queries per second Google 2012: 1 minute to index 50MM pages

Slide 45

Slide 45 text

@skimbrel A N D S O M E T I M E S S T I L L I S N OT G O O G L E . On good authority: many things internal to Google still run on vanilla MySQL. Really! Even Google doesn't solve problems they don't have. Which goes to show…

Slide 46

Slide 46 text

@skimbrel " B O R I N G " T E C H C A N G O R E A L LY FA R PyCon US 2017: Instagram is still a Django monolith! And doesn't even seem to be using asyncio! Horizontally-sharded RDBMSes — 15-year-old tech that goes 20k MPS at Twilio and still gives us full ACID.

Slide 47

Slide 47 text

@skimbrel E X P O N E N T I A L G R OW T H F E E L S S L OW AT F I R S T If you and your product are lucky enough to experience the joys of irrational exuberance and exponential growth… - Low part of the curve is gentle enough to give you warning - There is no single point where your system will keel over and die instantly

Slide 48

Slide 48 text

@skimbrel I T E R AT E I T E R AT E I T E R AT E Find the Most On Fire thing Evolve/replace it Repeat

Slide 49

Slide 49 text

@skimbrel O K , I ' M N OT G O O G L E ( Y E T ) … . S O W H AT ? At this point in the talk I hope you're starting to think "OK, I'm not Google (yet)." "So what? What does this mean I should worry about instead?"

Slide 50

Slide 50 text

@skimbrel U S E R T R U S T A B OV E A L L E L S E Maintain your users' trust. Meet their needs.

Slide 51

Slide 51 text

@skimbrel FA S T, S A F E I T E R AT I O N Move fast *without* breaking things. Team’s time is one of the most precious resources; make the most of it.

Slide 52

Slide 52 text

@skimbrel H E A LT H Y T E A M S To do that, our team needs to be healthy — will focus on on-call in a bit but plenty of other things to consider

Slide 53

Slide 53 text

@skimbrel I N C L U S I V E T E A M S Beyond on-call being manageable, having an inclusive team will help you. More here later too.

Slide 54

Slide 54 text

@skimbrel L E T ' S B E B OW E R B I R D S ! You don't have to reinvent the wheel. Build bowers instead. Bowerbirds: build structures from found materials to attract mates. Modern software ecosystem is our found environment We want healthy relationships w/ our users and our team — find what we need & combine First: technical decisions within this framework and then how to run a team and business

Slide 55

Slide 55 text

@skimbrel B E A P I C K Y B I R D First, let's talk about picking technologies. OK, so we need a… bottle cap, it seems. Database? Browser framework? Web server? Who knows. What do we want to think about?

Slide 56

Slide 56 text

@skimbrel P R O J E C T M AT U R I T Y Not brand-spanking new Not in Apache Attic

Slide 57

Slide 57 text

@skimbrel M A I N TA I N E R S H I P Not the originating company Apache is the standard here if the project isn't big enough for, say, a DSF. Release velocity?

Slide 58

Slide 58 text

@skimbrel S E C U R I T Y Search for CVEs. How many? Were they resolved? How quickly? How hard is your deployment going to be?

Slide 59

Slide 59 text

@skimbrel S TA B I L I T Y Two types! API stability (is v2.0 gonna come out and break all the things?) System stability (does the database, well, database)

Slide 60

Slide 60 text

@skimbrel P R O J E C T E C O S Y S T E M Library support for your language(s) Developer awareness/familiarity — can you hire people fast enough and how fast do they ramp up? Will consultants know it? Picking tech that everyone knows means you won't have to wait three months for your new developers to be productive on your stack.

Slide 61

Slide 61 text

@skimbrel " O U T O F T H E B OX " "Out of the box"-iness aka friction What's the first 30 minutes like? Are there Dockerfiles? Chef cookbooks?

Slide 62

Slide 62 text

@skimbrel D O C U M E N TAT I O N Existent? Up-to-date? Comprehensive? Searchable? Discoverable?

Slide 63

Slide 63 text

@skimbrel S U P P O RT A N D C O N S U LTA N T S Can you get a support contract from *someone*? When your main DB dies at 1 AM and your backup turns out to be corrupt… you will want help.

Slide 64

Slide 64 text

@skimbrel L I C E N S I N G L A N D M I N E S GPL software can't go in Apple's App Store! Or in the news recently: Apache declared Facebook's license + patent grant model no good. Panic ensued; Facebook ended up re-licensing with MIT.

Slide 65

Slide 65 text

@skimbrel B U Y V S B U I L D Open-source and DIY obviously aren't our only choices. We can pay money for things! How do we decide?

Slide 66

Slide 66 text

@skimbrel W H AT W O U L D I T C O S T TO B U I L D I T ?

Slide 67

Slide 67 text

@skimbrel H OW L O N G W O U L D I T TA K E ? And what would you lose in the meantime to not having it tomorrow? Not only do you not get The Shiny tomorrow, you have to choose something else *not* to build because you're using up some dev time for this.

Slide 68

Slide 68 text

@skimbrel H OW H A R D I S I T TO R E P L AC E ? What happens if the vendor goes down? Goes out of business?

Slide 69

Slide 69 text

@skimbrel R E L AT I O N S H I P S Last: how do we run services, projects, and businesses? What should our relationships with our customers (whom we care about deeply because we want to acquire a billion of them) look like? What does a healthy team look like?

Slide 70

Slide 70 text

@skimbrel T E A M S So, about healthy teams. I’m going to talk about a few things here: on-call, psychological safety, and inclusivity.

Slide 71

Slide 71 text

@skimbrel S U S TA I N A B L E O N - C A L L First up, on-call. We have a problem with on-call and pager rotations. Who does on-call right? Hospitals, nuclear power plants, firefighters…

Slide 72

Slide 72 text

@skimbrel 1 6 8 H O U R S ÷ 4 0 H O U R W O R K W E E K = 4 . 2 P E O P L E Let's do some math.

Slide 73

Slide 73 text

@skimbrel 1 6 8 H O U R S ÷ 4 0 H O U R W O R K W E E K = 4 . 2 5 P E O P L E Oops, we can't have 0.2 of a person. Five.

Slide 74

Slide 74 text

@skimbrel 1 6 8 H O U R S ÷ 4 0 H O U R W O R K W E E K = 4 . 2 5 6 P E O P L E Are we okay with only 32 hours per week for PTO and sick time? Better make it six. Show of hands please. Right. So how do we make on-call be less awful?

Slide 75

Slide 75 text

@skimbrel E M P OW E R M E N T A N D AU TO M AT I O N One of the things that's come out of devops culture is that employing humans to be robots is bad. So don't do it. On-call's job: get paged at 2 AM *maybe once a week, ideally once a month*, find the thing that broke and make it *never do that again*. Give them the time and space to do this.

Slide 76

Slide 76 text

@skimbrel A P P R O P R I AT E AVA I L A B I L I T Y A N D S C A L A B I L I T Y As shown earlier: you won't go 10k/day to 10k/sec overnight. Or even in a year. Obvious path to 10x, and line of sight to 100x. Also think about how available and reliable you need to be — telecom vs… say, doctor's office appointment system.

Slide 77

Slide 77 text

@skimbrel 9 9 % U P T I M E ? 3.65 days/year 7.2 hours/month 1.68 hours/week 14.4 minutes/day MATH TIME AGAIN!

Slide 78

Slide 78 text

@skimbrel T H R E E N I N E S 8.76 hours/year 43.8 minutes/month 10.1 minutes/week 1.44 minutes/day

Slide 79

Slide 79 text

@skimbrel * F O U R * N I N E S ? 52.56 minutes/year 1.01 minutes/week 8.64 seconds/day

Slide 80

Slide 80 text

@skimbrel F I V E ? ! 5.26 minutes/year

Slide 81

Slide 81 text

@skimbrel D O YO U N E E D T H AT S L A ? Don’t overcommit yourself. Office appointment app? Odds are your users *won’t even notice* an offline DB migration over the weekend.

Slide 82

Slide 82 text

@skimbrel H E A LT H Y T E A M S Humane on-call schedule + consciously-chosen SLAs + sensible alerting create a culture where people who might otherwise not have joined can show up — people with children, disabled people, etc. Which goes hand in hand w/ building a safe and inclusive work environment.

Slide 83

Slide 83 text

@skimbrel P S Y C H O L O G I C A L S A F E T Y Google study etc.

Slide 84

Slide 84 text

@skimbrel I N C L U S I V I T Y, N OT J U S T D I V E R S I T Y This is a *start* towards building an inclusive team. Diversity isn’t enough — people of differing backgrounds need to be comfortable being themselves.

Slide 85

Slide 85 text

@skimbrel S E T G R O U N D R U L E S Set some ground rules with your teams — make a charter. Some things that might come up: - Code reviews - Meeting etiquette (no interruptions; 3 in 1 / 1 in 3 rule; give credit) - Space for learning (don’t feign surprise; no RTFM; - Handling conflict

Slide 86

Slide 86 text

@skimbrel A D D R E S S B I A S Be aware of and take steps to address conscious and unconscious bias.

Slide 87

Slide 87 text

@skimbrel R E TA I N A N D P R O M OT E Not enough just to hire women/POC/queer people/disabled people/… — ensure they have equal access to growth opportunities

Slide 88

Slide 88 text

@skimbrel G E T P R O F E S S I O N A L A S S I S TA N C E Final note: there are consulting firms helping with this. Engage one and pay them — don’t just make the few URMs you do have do this work as an unpaid side gig!

Slide 89

Slide 89 text

@skimbrel S U M M A RY: H U M A N S F I R S T So: make your on-call reasonable, and make your teams inclusive and safe for *everyone* working with you, because at the end of the day… you work with humans first.

Slide 90

Slide 90 text

@skimbrel H A P P Y U S E R S Close with users — how do we keep *them* happy?

Slide 91

Slide 91 text

@skimbrel C U S TO M E R E M PAT H Y Have empathy.

Slide 92

Slide 92 text

@skimbrel K N OW YO U R U S E R T E C H B A S E First, this means knowing who your users are. Trade-offs, as always, but make sure you have the data and stories.

Slide 93

Slide 93 text

@skimbrel K N OW YO U R I M PAC T Empathy also means knowing the impact on your users when you make changes, or when you go down. Especially when you go down.

Slide 94

Slide 94 text

@skimbrel S E T E X P E C TAT I O N S And using that empathy, manage your users' expectations *ahead of time*. "Underpromise but overdeliver" is *always* a good strategy.

Slide 95

Slide 95 text

@skimbrel D E G R A D E G R AC E F U L LY Netflix has default recommendations in case the personalization engine is down when you open the app.

Slide 96

Slide 96 text

@skimbrel ( OV E R ) C O M M U N I C AT E TALK TO YOUR USERS. Update your status page when you even *think* there might be a problem. Speaking of status pages…

Slide 97

Slide 97 text

@skimbrel A M A ZO N ( H T T P S : / / A W S . A M A ZO N . CO M / M E S S A G E / 4 1 9 2 6 / ) "we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3" Yes, this happened. Put your status page somewhere that is completely independent of your infrastructure. You're on AWS? Great, put it on Google Cloud Platform.

Slide 98

Slide 98 text

@skimbrel G I T L A B DATA B A S E FA I L U R E H T T P S : / / A B O U T. G I T L A B . CO M / 2 0 1 7 / 0 2 / 1 0 / P O S T M O R T E M - O F- DATA B A S E - O U TA G E - O F-JA N UA R Y- 3 1 / Gitlab had a major incident with their primary database deployment. Public Google doc w/ incident notes in realtime.

Slide 99

Slide 99 text

@skimbrel ( OV E R ) C O M M U N I C AT E Err towards overcommunication. Staff up: - Social media - Zendesk or w/e AND LISTEN TO THOSE PEOPLE.

Slide 100

Slide 100 text

@skimbrel M E A S U R E S U P P O RT P E R F O R M A N C E Uptime isn't the only SLA! - Time to first response - Time to resolution - Overall satisfaction score - etc

Slide 101

Slide 101 text

@skimbrel D I S A S T E R R E C OV E RY Because it _will_ happen.

Slide 102

Slide 102 text

@skimbrel I D E N T I F Y FAU LT D O M A I N S

Slide 103

Slide 103 text

@skimbrel FAU LT TO L E R A N C E H A S C O S T S How much does it cost to survive a failure of a: - Host - AWS AZ - AWS region - …?

Slide 104

Slide 104 text

@skimbrel P R AC T I C E Exercise your failover mechanisms and backup recovery ahead of time in controlled conditions. You will thank me later.

Slide 105

Slide 105 text

@skimbrel S E C U R I T Y Please do consider security.

Slide 106

Slide 106 text

@skimbrel OWA S P Open Web Application Security Project Immensely useful guides to just about everything.

Slide 107

Slide 107 text

@skimbrel K N OW YO U R T H R E AT M O D E L - Valuable assets - Vectors of attack - Mitigations

Slide 108

Slide 108 text

@skimbrel D O N ' T C H E C K C R E D E N T I A L S I N TO G I T - Yes, I really need to say this.

Slide 109

Slide 109 text

@skimbrel A N D D O N ' T D O T H I S

Slide 110

Slide 110 text

@skimbrel C O M M U N I C AT E ( AG A I N ) - Treat security breaches like any other incident. The longer you keep it secret the worse the backlash will be. - What was compromised? For how many people? How? Can it happen again (no, it can't)?

Slide 111

Slide 111 text

So that was a lot! I hope this advice helps you get more content and comfortable no matter how big or small your system is. We may not all be Google or Facebook, but we can all learn from their paths to the dizzying heights of scale, and we can all adopt code and ideas from them and everyone else who came before us to build amazing new bowers of technology for our users.

Slide 112

Slide 112 text

And finally… before I thought to use bowerbirds as the metaphor, the best thing I had was dung beetles. Aren't you glad you got a talk with pretty bird pictures instead?

Slide 113

Slide 113 text

@skimbrel H A P P Y B OW E R - B U I L D I N G !

Slide 114

Slide 114 text

@skimbrel F U RT H E R R E A D I N G https://samkimbrel.com/posts/bowerbirds.html I was a bowerbird when I built this talk, so here are some of the pieces that inspired me. (Or there will be, shortly)

Slide 115

Slide 115 text

@skimbrel S O U R C E S CC BY-NC-ND 2.0 ccdoh1 https://www.flickr.com/photos/ccdoh1/5282484075/ CC BY 2.0 https://www.flickr.com/photos/rileyfive/25506971724 CC BY-SA 3.0 Andrew West https://commons.wikimedia.org/wiki/ File:Wikipedia_view_distribution_by_article_rank.png CC BY-ND 2.0 Melanie Underwood https://www.flickr.com/photos/warblerlady/7664022750/ CC BY-NC 2.0 Nick Morieson https://www.flickr.com/photos/ngmorieson/8056110806/ CC BY-NC-ND 2.0 Julie Bergher https://www.flickr.com/photos/sunphlo/11578609646/ CC BY-NC-ND 2.0 Nathan Rupert https://www.flickr.com/photos/nathaninsandiego/20291539244/ CC BY-NC-ND 2.0 Nathan Rupert https://www.flickr.com/photos/nathaninsandiego/20726159958/ CC BY-NC-ND 2.0 Julie Burgher https://www.flickr.com/photos/sunphlo/11522540164/ CC BY-NC-ND 2.0 Neil Saunders https://www.flickr.com/photos/nsaunders/22748694318/ CC BY 2.0 thinboyfatter https://www.flickr.com/photos/1234abcd/4717190370/ CC BY-SA 2.0 Jim Bendon https://www.flickr.com/photos/jim_bendon_1957/11722386055 https://cispa.saarland/wp-content/uploads/2015/02/MongoDB_documentation.pdf CC BY-NC-ND 2.0 Julie Bergher https://www.flickr.com/photos/sunphlo/11578609646/ CC BY-SA 3.0 Kay-africa https://commons.wikimedia.org/wiki/ File:Flightless_Dung_Beetle_Circellium_Bachuss,_Addo_Elephant_National_Park,_South_Africa.JPG CC BY-SA 2.0 Robyn Jay https://www.flickr.com/photos/learnscope/14602494872