Sam Kitajima-Kimbrel - Bowerbirds of Technology: Architecture and Teams at Less-than-Google Scale

@skimbrel B OW E R B I R D S
O F T E C H N O L O G Y P Y C O N 2 0 1 8 S A M K I TA J I M A - K I M B R E L @ S K I M B R E L

These are bowerbirds! They build these structures called bowers out
of sticks and colorful objects they ﬁnd in their environment in an eﬀort to attract mates. More on them later; for now enjoy the nice photos of birds that I found on Flickr.

@skimbrel CA L H E N D E R S
O N , DJA N G O CO N U S 2 0 0 8 "Most websites aren't in the top 100 web sites." Speaking of Flickr, Cal Henderson used to work there. "It turns out all but 100 of them are not in the top 100"

@skimbrel Z I P F D I S T R
I B U T I O N Zipﬁan distributions: # links / traﬃc / users / whatever metric you want is inversely proportional to rank among all sites (this graph happens to be Wikipedia articles and the axes are log/log) Many empirical studies/measurements of the web have shown this holds true.

@skimbrel YO U A R E N OT G O
O G L E , A N D T H AT ' S O K AY ! Or: most of us, excepting the ones who work at Google, are not Google. And that's okay! Tons of products do just great at not-Google-scale.

@skimbrel I A M A L S O N OT
G O O G L E I am also not Google. And that is me with a slightly diﬀerent hair color if you're in the cheap seats. Currently at Nuna (healthcare data, you haven’t heard of it), previously at Twilio, 8 years on large-and-fast-growing-but-not-Facebook-scale web services.

@skimbrel W H AT A R E G O O
G L E ' S P R O B L E M S ? What do Facebook, Amazon, and Google worry about? - Absurdly high throughput and data storage requirements - 10s of 1000s of servers in dozens to hundreds of datacenters worldwide - Thousands of engineers and the ability to specialize them into very, very tiny niches - Virtually limitless resources and the patience to train people on their systems How does this manifest? Let's look at examples.

@skimbrel A B S U R D LY H I
G H T H R O U G H P U T A N D S TO R AG E D E M A N D S

@skimbrel T E N S O F T H O
U S A N D S O F S E RV E R S H U N D R E D S O F DATAC E N T E R S

@skimbrel T H O U S A N D S
O F D E V S

@skimbrel ( N E A R - ) U N
L I M I T E D R E S O U R C E S

@skimbrel C A S E S T U D I
E S Let's do a few brief case studies.

@skimbrel U B E R S C H E M
A L E S S * * I N VO K I N G U B E R A S A T E C H N O L O G I C A L E X A M P L E D O E S N OT C O N S T I T U T E A N E N D O R S E M E N T O F U B E R ' S B U S I N E S S S T R AT E G Y, E T H I C S , O R C U LT U R E (Uber has given us plenty of bad examples in non-technological things and this is emphatically not an endorsement of any of their behavior towards human beings) Case study: Uber hit scaling issues with Postgres and made themselves a new datastore. What were they looking for?

@skimbrel " L I N E A R LY A
D D C A PAC I T Y B Y A D D I N G M O R E S E RV E R S "

@skimbrel " FAVO R W R I T E AVA
I L A B I L I T Y OV E R R E A D - YO U R - W R I T E S E M A N T I C S "

@skimbrel E V E N T N OT I F
I C AT I O N S ( T R I G G E R S ) Side note: "we had an asynchronous event system built on Kafka 0.7 and we couldn't get it to run lossless". Have you tried upgrading?

@skimbrel W H AT I S I T, T H
E N ? So what did they build?

@skimbrel " A P P E N D - O
N LY S PA R S E T H R E E - D I M E N S I O N A L P E R S I S T E N T H A S H M A P, V E RY S I M I L A R TO G O O G L E ' S B I G TA B L E " To which my only reply is this comic.

This comic is probably famous enough now but here it
is again.

@skimbrel "So, how do I query the database?" "It's not
a database. It's a key- value store!"

@skimbrel "Ok, it's not a database. How do I query
it?" "You write a distributed map reduce function in Erlang!"

@skimbrel "Did you just tell me to go **** myself?"
"I believe I did, Bob."

@skimbrel T H I S H A S A C
O S T Point is: this has a cost.

@skimbrel N E W A B S T R AC
T I O N S Boundary between app and database changes — app has to know and enforce schemas and persistence strategies

@skimbrel E V E N T UA L C O
N S I S T E N C Y You can't read things you just wrote, and neither can other processes.

@skimbrel F L E X I B L E Q
U E R I E S - Have to know query patterns ahead of time - Mandatory sharding — can't read globally w/o extra work - No joins

@skimbrel D E V E L O P E R
FA M I L I A R I T Y - Can't hire fast or can't ramp fast - People aren't gonna walk in knowing this - Good luck w/ contractors

@skimbrel A M A Z O N A N D
S E RV I C E A R C H I T E C T U R E Steve Yegge quit Amazon and went to Google. Accidentally made public a long rant about how AWS was going to eat Google's lunch. Major focus: how Amazon got its service-oriented architecture. In 2002ish, Bezos gave the following orders.

@skimbrel "All teams will henceforth expose their data and functionality
through service interfaces."

@skimbrel "Teams must communicate with each other through these interfaces."

@skimbrel "There will be no other form of interprocess communication
allowed […]the only communication allowed is via service interface calls over the network."

@skimbrel "It doesn't matter what technology they use. HTTP, Corba,
Pubsub, custom protocols — doesn't matter."

@skimbrel "All service interfaces, without exception, must be designed from
the ground up to be externalizable." Externalizable here means "exposed to the outside world", i.e. to customers, and sold as a product.

@skimbrel "Anyone who doesn't do this will be ﬁred." So
yeah. Amazon's way of making systems and developer teams scale. They were and are serious about this.

@skimbrel A M A Z O N L E A
R N E D S O M E T H I N G S This also had a cost. Steve, on what Amazon learned:

@skimbrel "pager escalation gets way harder, because a ticket might
bounce through 20 service calls before the real owner is identiﬁed"

@skimbrel "every single one of your peer teams suddenly becomes
a potential DOS attacker"

@skimbrel "monitoring and QA are the same thing…"

@skimbrel "…the only thing still functioning in the server is
the little component that knows how to say 'I'm ﬁne, roger roger, over and out' in a cheery droid voice"

@skimbrel "you won't be able to ﬁnd any of them
without a service-discovery mechanism [...] which is itself another service" Which requires a service *registry*, which…

@skimbrel M A S S I V E LY S
C A L A B L E I N F R A S T R U C T U R E C O S T S D E V E L O P E R T I M E Again: the crux is that massively-scalable infra costs dev time because the mental model is so much more complicated. And you don't have a lot of that.

@skimbrel B U T I WA N T TO B
E G O O G L E ! "But I *want* to be Google!", you may cry.

@skimbrel G O O G L E WA S N
' T G O O G L E OV E R N I G H T …

@skimbrel B E N G O M E S (
H T T P: / / R E A D W R I T E . CO M / 2 0 1 2 / 0 2 / 2 9 / I N T E R V I E W _ C H A N G I N G _ E N G I N E S _ M I D - F L I G H T _ Q A _ W I T H _ G O O G / ) “When I joined Google, it would take us about a month to crawl and build an index of about 50 million pages.” Google in 1999: - 1 month to crawl and index 50MM pages. - 10k queries per day Google 2006: 10k queries per second Google 2012: 1 minute to index 50MM pages

@skimbrel A N D S O M E T I
M E S S T I L L I S N OT G O O G L E . On good authority: many things internal to Google still run on vanilla MySQL. Really! Even Google doesn't solve problems they don't have. Which goes to show…

@skimbrel " B O R I N G " T
E C H C A N G O R E A L LY FA R PyCon US 2017: Instagram is still a Django monolith! And doesn't even seem to be using asyncio! Horizontally-sharded RDBMSes — 15-year-old tech that goes 20k MPS at Twilio and still gives us full ACID.

@skimbrel E X P O N E N T I
A L G R OW T H F E E L S S L OW AT F I R S T If you and your product are lucky enough to experience the joys of irrational exuberance and exponential growth… - Low part of the curve is gentle enough to give you warning - There is no single point where your system will keel over and die instantly

@skimbrel I T E R AT E I T E
R AT E I T E R AT E Find the Most On Fire thing Evolve/replace it Repeat

@skimbrel O K , I ' M N OT G
O O G L E ( Y E T ) … . S O W H AT ? At this point in the talk I hope you're starting to think "OK, I'm not Google (yet)." "So what? What does this mean I should worry about instead?"

@skimbrel U S E R T R U S T
A B OV E A L L E L S E Maintain your users' trust. Meet their needs.

@skimbrel FA S T, S A F E I T
E R AT I O N Move fast *without* breaking things. Team’s time is one of the most precious resources; make the most of it.

@skimbrel H E A LT H Y T E A
M S To do that, our team needs to be healthy — will focus on on-call in a bit but plenty of other things to consider

@skimbrel I N C L U S I V E
T E A M S Beyond on-call being manageable, having an inclusive team will help you. More here later too.

@skimbrel L E T ' S B E B OW
E R B I R D S ! You don't have to reinvent the wheel. Build bowers instead. Bowerbirds: build structures from found materials to attract mates. Modern software ecosystem is our found environment We want healthy relationships w/ our users and our team — ﬁnd what we need & combine First: technical decisions within this framework and then how to run a team and business

@skimbrel B E A P I C K Y B
I R D First, let's talk about picking technologies. OK, so we need a… bottle cap, it seems. Database? Browser framework? Web server? Who knows. What do we want to think about?

@skimbrel P R O J E C T M AT
U R I T Y Not brand-spanking new Not in Apache Attic

@skimbrel M A I N TA I N E R
S H I P Not the originating company Apache is the standard here if the project isn't big enough for, say, a DSF. Release velocity?

@skimbrel S E C U R I T Y Search
for CVEs. How many? Were they resolved? How quickly? How hard is your deployment going to be?

@skimbrel S TA B I L I T Y Two
types! API stability (is v2.0 gonna come out and break all the things?) System stability (does the database, well, database)

@skimbrel P R O J E C T E C
O S Y S T E M Library support for your language(s) Developer awareness/familiarity — can you hire people fast enough and how fast do they ramp up? Will consultants know it? Picking tech that everyone knows means you won't have to wait three months for your new developers to be productive on your stack.

@skimbrel " O U T O F T H E
B OX " "Out of the box"-iness aka friction What's the ﬁrst 30 minutes like? Are there Dockerﬁles? Chef cookbooks?

@skimbrel D O C U M E N TAT I
O N Existent? Up-to-date? Comprehensive? Searchable? Discoverable?

@skimbrel S U P P O RT A N D
C O N S U LTA N T S Can you get a support contract from *someone*? When your main DB dies at 1 AM and your backup turns out to be corrupt… you will want help.

@skimbrel L I C E N S I N G
L A N D M I N E S GPL software can't go in Apple's App Store! Or in the news recently: Apache declared Facebook's license + patent grant model no good. Panic ensued; Facebook ended up re-licensing with MIT.

@skimbrel B U Y V S B U I L
D Open-source and DIY obviously aren't our only choices. We can pay money for things! How do we decide?

@skimbrel W H AT W O U L D I
T C O S T TO B U I L D I T ?

@skimbrel H OW L O N G W O U
L D I T TA K E ? And what would you lose in the meantime to not having it tomorrow? Not only do you not get The Shiny tomorrow, you have to choose something else *not* to build because you're using up some dev time for this.

@skimbrel H OW H A R D I S I
T TO R E P L AC E ? What happens if the vendor goes down? Goes out of business?

@skimbrel R E L AT I O N S H
I P S Last: how do we run services, projects, and businesses? What should our relationships with our customers (whom we care about deeply because we want to acquire a billion of them) look like? What does a healthy team look like?

@skimbrel T E A M S So, about healthy teams.
I’m going to talk about a few things here: on-call, psychological safety, and inclusivity.

@skimbrel S U S TA I N A B L
E O N - C A L L First up, on-call. We have a problem with on-call and pager rotations. Who does on-call right? Hospitals, nuclear power plants, ﬁreﬁghters…

@skimbrel 1 6 8 H O U R S ÷
4 0 H O U R W O R K W E E K = 4 . 2 P E O P L E Let's do some math.

4 0 H O U R W O R K W E E K = 4 . 2 5 P E O P L E Oops, we can't have 0.2 of a person. Five.

4 0 H O U R W O R K W E E K = 4 . 2 5 6 P E O P L E Are we okay with only 32 hours per week for PTO and sick time? Better make it six. Show of hands please. Right. So how do we make on-call be less awful?

@skimbrel E M P OW E R M E N
T A N D AU TO M AT I O N One of the things that's come out of devops culture is that employing humans to be robots is bad. So don't do it. On-call's job: get paged at 2 AM *maybe once a week, ideally once a month*, ﬁnd the thing that broke and make it *never do that again*. Give them the time and space to do this.

@skimbrel A P P R O P R I AT
E AVA I L A B I L I T Y A N D S C A L A B I L I T Y As shown earlier: you won't go 10k/day to 10k/sec overnight. Or even in a year. Obvious path to 10x, and line of sight to 100x. Also think about how available and reliable you need to be — telecom vs… say, doctor's oﬃce appointment system.

@skimbrel 9 9 % U P T I M E
? 3.65 days/year 7.2 hours/month 1.68 hours/week 14.4 minutes/day MATH TIME AGAIN!

@skimbrel T H R E E N I N E
S 8.76 hours/year 43.8 minutes/month 10.1 minutes/week 1.44 minutes/day

@skimbrel * F O U R * N I N
E S ? 52.56 minutes/year 1.01 minutes/week 8.64 seconds/day

@skimbrel F I V E ? ! 5.26 minutes/year

@skimbrel D O YO U N E E D T
H AT S L A ? Don’t overcommit yourself. Oﬃce appointment app? Odds are your users *won’t even notice* an oﬄine DB migration over the weekend.

@skimbrel H E A LT H Y T E A
M S Humane on-call schedule + consciously-chosen SLAs + sensible alerting create a culture where people who might otherwise not have joined can show up — people with children, disabled people, etc. Which goes hand in hand w/ building a safe and inclusive work environment.

@skimbrel P S Y C H O L O G
I C A L S A F E T Y Google study etc.

@skimbrel I N C L U S I V I
T Y, N OT J U S T D I V E R S I T Y This is a *start* towards building an inclusive team. Diversity isn’t enough — people of diﬀering backgrounds need to be comfortable being themselves.

@skimbrel S E T G R O U N D
R U L E S Set some ground rules with your teams — make a charter. Some things that might come up: - Code reviews - Meeting etiquette (no interruptions; 3 in 1 / 1 in 3 rule; give credit) - Space for learning (don’t feign surprise; no RTFM; - Handling conﬂict

@skimbrel A D D R E S S B I
A S Be aware of and take steps to address conscious and unconscious bias.

@skimbrel R E TA I N A N D P
R O M OT E Not enough just to hire women/POC/queer people/disabled people/… — ensure they have equal access to growth opportunities

@skimbrel G E T P R O F E S
S I O N A L A S S I S TA N C E Final note: there are consulting ﬁrms helping with this. Engage one and pay them — don’t just make the few URMs you do have do this work as an unpaid side gig!

@skimbrel S U M M A RY: H U M
A N S F I R S T So: make your on-call reasonable, and make your teams inclusive and safe for *everyone* working with you, because at the end of the day… you work with humans ﬁrst.

@skimbrel H A P P Y U S E R
S Close with users — how do we keep *them* happy?

@skimbrel C U S TO M E R E M
PAT H Y Have empathy.

@skimbrel K N OW YO U R U S E
R T E C H B A S E First, this means knowing who your users are. Trade-oﬀs, as always, but make sure you have the data and stories.

@skimbrel K N OW YO U R I M PAC
T Empathy also means knowing the impact on your users when you make changes, or when you go down. Especially when you go down.

@skimbrel S E T E X P E C TAT
I O N S And using that empathy, manage your users' expectations *ahead of time*. "Underpromise but overdeliver" is *always* a good strategy.

@skimbrel D E G R A D E G R
AC E F U L LY Netﬂix has default recommendations in case the personalization engine is down when you open the app.

@skimbrel ( OV E R ) C O M M
U N I C AT E TALK TO YOUR USERS. Update your status page when you even *think* there might be a problem. Speaking of status pages…

@skimbrel A M A ZO N ( H T T
P S : / / A W S . A M A ZO N . CO M / M E S S A G E / 4 1 9 2 6 / ) "we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3" Yes, this happened. Put your status page somewhere that is completely independent of your infrastructure. You're on AWS? Great, put it on Google Cloud Platform.

@skimbrel G I T L A B DATA B A
S E FA I L U R E H T T P S : / / A B O U T. G I T L A B . CO M / 2 0 1 7 / 0 2 / 1 0 / P O S T M O R T E M - O F- DATA B A S E - O U TA G E - O F-JA N UA R Y- 3 1 / Gitlab had a major incident with their primary database deployment. Public Google doc w/ incident notes in realtime.

@skimbrel ( OV E R ) C O M M
U N I C AT E Err towards overcommunication. Staﬀ up: - Social media - Zendesk or w/e AND LISTEN TO THOSE PEOPLE.

@skimbrel M E A S U R E S U
P P O RT P E R F O R M A N C E Uptime isn't the only SLA! - Time to ﬁrst response - Time to resolution - Overall satisfaction score - etc

@skimbrel D I S A S T E R R
E C OV E RY Because it _will_ happen.

@skimbrel I D E N T I F Y FAU
LT D O M A I N S

@skimbrel FAU LT TO L E R A N C
E H A S C O S T S How much does it cost to survive a failure of a: - Host - AWS AZ - AWS region - …?

@skimbrel P R AC T I C E Exercise your
failover mechanisms and backup recovery ahead of time in controlled conditions. You will thank me later.

@skimbrel S E C U R I T Y Please
do consider security.

@skimbrel OWA S P Open Web Application Security Project Immensely
useful guides to just about everything.

@skimbrel K N OW YO U R T H R
E AT M O D E L - Valuable assets - Vectors of attack - Mitigations

@skimbrel D O N ' T C H E C
K C R E D E N T I A L S I N TO G I T - Yes, I really need to say this.

@skimbrel A N D D O N ' T D
O T H I S

@skimbrel C O M M U N I C AT
E ( AG A I N ) - Treat security breaches like any other incident. The longer you keep it secret the worse the backlash will be. - What was compromised? For how many people? How? Can it happen again (no, it can't)?

So that was a lot! I hope this advice helps
you get more content and comfortable no matter how big or small your system is. We may not all be Google or Facebook, but we can all learn from their paths to the dizzying heights of scale, and we can all adopt code and ideas from them and everyone else who came before us to build amazing new bowers of technology for our users.

And ﬁnally… before I thought to use bowerbirds as the
metaphor, the best thing I had was dung beetles. Aren't you glad you got a talk with pretty bird pictures instead?

@skimbrel H A P P Y B OW E R
- B U I L D I N G !

@skimbrel F U RT H E R R E A
D I N G https://samkimbrel.com/posts/bowerbirds.html I was a bowerbird when I built this talk, so here are some of the pieces that inspired me. (Or there will be, shortly)

@skimbrel S O U R C E S CC BY-NC-ND
2.0 ccdoh1 https://www.flickr.com/photos/ccdoh1/5282484075/ CC BY 2.0 https://www.flickr.com/photos/rileyfive/25506971724 CC BY-SA 3.0 Andrew West https://commons.wikimedia.org/wiki/ File:Wikipedia_view_distribution_by_article_rank.png CC BY-ND 2.0 Melanie Underwood https://www.flickr.com/photos/warblerlady/7664022750/ CC BY-NC 2.0 Nick Morieson https://www.flickr.com/photos/ngmorieson/8056110806/ CC BY-NC-ND 2.0 Julie Bergher https://www.flickr.com/photos/sunphlo/11578609646/ CC BY-NC-ND 2.0 Nathan Rupert https://www.flickr.com/photos/nathaninsandiego/20291539244/ CC BY-NC-ND 2.0 Nathan Rupert https://www.flickr.com/photos/nathaninsandiego/20726159958/ CC BY-NC-ND 2.0 Julie Burgher https://www.flickr.com/photos/sunphlo/11522540164/ CC BY-NC-ND 2.0 Neil Saunders https://www.flickr.com/photos/nsaunders/22748694318/ CC BY 2.0 thinboyfatter https://www.flickr.com/photos/1234abcd/4717190370/ CC BY-SA 2.0 Jim Bendon https://www.flickr.com/photos/jim_bendon_1957/11722386055 https://cispa.saarland/wp-content/uploads/2015/02/MongoDB_documentation.pdf CC BY-NC-ND 2.0 Julie Bergher https://www.flickr.com/photos/sunphlo/11578609646/ CC BY-SA 3.0 Kay-africa https://commons.wikimedia.org/wiki/ File:Flightless_Dung_Beetle_Circellium_Bachuss,_Addo_Elephant_National_Park,_South_Africa.JPG CC BY-SA 2.0 Robyn Jay https://www.flickr.com/photos/learnscope/14602494872

Sam Kitajima-Kimbrel - Bowerbirds of Technology...

Sam Kitajima-Kimbrel - Bowerbirds of Technology: Architecture and Teams at Less-than-Google Scale

More Decks by PyCon 2018

Other Decks in Programming

Featured

Transcript