What do you look in an SRE?
1. Automation
3. Curiosity
Slide 9
Slide 9 text
GLOSSARY OF SYSTEMS DESIGN
www.sitereliabilityenginering.co .
www.yurynino.com
Load
Balancing
Data
Partitioning
Proxies
Caching
Indexes
Redundancy
Replication
SQL vs
NoSQL
Consistent
Hashing
CAP
Theorem
PACELC
Theorem
Bloom
Quorum
Leader and
Follower
Slide 10
Slide 10 text
Consistent
Core
Follower
Readers
Generation Clock
Gossip
Dissemination
HeartBeat
Hybrid Clock
Idempotent Receiver
State
Watch
Quorum
SYSTEMS DESIGN PATTERNS
www.sitereliabilityenginering.co .
www.yurynino.com
https://martinfowler.com/articles/patterns-of-distributed-systems/
Slide 11
Slide 11 text
SYSTEMS DESIGN FALLACIES
The Network
is Reliable
Latency is
Zero
The Topology
doesn’t change
Transport Cost
Is Zero
Bandwidth is
Infinity
The Network is
Secure
www.sitereliabilityenginering.co .
www.yurynino.com
Slide 12
Slide 12 text
Iterative style
for designing
and implementing
systems.
NON
ABSTRACT
LARGE
SYSTEM
DESIGN
WHAT IS NALSD?
SRE Ability to
assess, design,
and evaluate
large systems.
Robust and
scalable
designs with
low operational
costs.
www.sitereliabilityenginering.co .
www.yurynino.com
Slide 13
Slide 13 text
NALSD IN DETAIL
Google SREs are expected to be able
to start resource planning with a basic
whiteboard diagram of a system,
think through the various scaling and
failure domains, and focus their design
into a concrete proposal for resources.
www.sitereliabilityenginering.co .
www.yurynino.com
Slide 14
Slide 14 text
WHY NALSD?
Google has learned (the hard way) that the people
designing distributed systems need to develop and
continuously exercise the muscle of design into
concrete estimates of resources at multiple steps in
the process.
www.sitereliabilityenginering.co .
www.yurynino.com
Slide 15
Slide 15 text
Consider running our
entire application on a
single computer.
One Machine
Now we’ll need multiple
machines, what’s the best
design to join them?
Distributed System
* Is it possible?
* Can we do better?
* Is it feasible?
* Is it resilient?
Design Process
* Read & Understand
* Required SLOs
* Ask that you consider
Initial Requirements
NALSD IN DETAIL
www.sitereliabilityenginering.co .
www.yurynino.com
Slide 16
Slide 16 text
HOW TO BEGIN?
https://danrl.com/sre-flash-cards/SRE%20Flash%20Cards.pdf
‘The numbers everyone
should know’
Time Main
Memory Reference
Time Round trip
within same datacenter
Power of ten?
ns / us / ms
Speed Read
sequentially from
SSD
From:
https://cloud.google.com/blog/products/manage
ment-tools/sre-principles-and-flashcards-to-design-
nalsd
Time Read 1 MB
sequentially from
memory
www.sitereliabilityenginering.co .
www.yurynino.com
Slide 17
Slide 17 text
HOW TO BEGIN?
www.sitereliabilityenginering.co .
www.yurynino.com
Slide 18
Slide 18 text
USE CASE
● The Google AdWords service displays text advertisements on Google Web Search.
● The click-through rate (CTR) metric tells advertisers how well their ads are
performing.
● CTR is the ratio of times the ad is clicked versus the number of times the ad is
shown.
AdWords Challenge
Design a system capable of measuring and reporting an accurate CTR for every AdWords ad.
www.sitereliabilityenginering.co .
www.yurynino.com
Slide 19
Slide 19 text
INITIAL REQUIREMENTS
Each advertiser may have
multiple advertisements.
Each ad is keyed by ad_id
and is associated with a list
of search terms selected
by the advertiser.
* How often this search term triggered
this ad to be shown?
* How many times the ad was clicked by
someone who saw the ad?
* With this information, we can calculate
the CTR
CTR: the number of clicks divided by the
number of impressions.
www.sitereliabilityenginering.co .
www.yurynino.com
Slide 20
Slide 20 text
INITIAL REQUIREMENTS
● We know our advertisers care about two things:
○ That the dashboard displays quickly!
○ That the data is recent.
Therefore, we will consider our requirements in terms of SLOs:
● 99.9% of dashboard queries complete in < 1 second.
● 99.9% of the time, the CTR data displayed is less than 5 minutes old.
www.sitereliabilityenginering.co .
www.yurynino.com
Slide 21
Slide 21 text
ONE MACHINE
For every web
search query, we log
The TIME the
query occurred
A QUERY_ID
unique identifier
An AD_ID
The AD IDs of THE
AdWords advertisements
shown for the search
A SEARCH_TERM
the query content
www.sitereliabilityenginering.co .
www.yurynino.com
Slide 22
Slide 22 text
ONE MACHINE
Calculations
TIME
64-bit integer,
8 bytes
QUERY_ID
64-bit integer,
8 bytes
An AD_ID
3 64-bit integer,
24 bytes
A SEARCH_TERM
A long string, up to
500 bytes
www.sitereliabilityenginering.co .
www.yurynino.com
Slide 23
Slide 23 text
ONE MACHINE
The volume of query logs generated in a 24-hour period::
* (5 × 105 queries/sec) × (8.64 × 104 seconds/day) × (2 × 103 bytes) = 86.4 TB/day
--
A common 4 TB HDD sustains 200 input/output operations per second (IOPS):
* (5 × 105 queries/sec) / (200 IOPS/disk) = 2.5 × 103 disks or 2,500 disks
--
* (100 TB) / (64 GB RAM/machine) = 1,563 machines
www.sitereliabilityenginering.co .
www.yurynino.com
Slide 24
Slide 24 text
ASSESSMENT
We can not we reasonably support our SLOs if
one of these components fails.
One-machine design looks
unfeasible
www.sitereliabilityenginering.co .
www.yurynino.com
Slide 25
Slide 25 text
EXPLORE ANOTHER IDEA
* We can process and join the logs with a MapReduce.
* We can grab the accumulated query logs and click logs.
MapReduce will produce a data set organized by ad_id &
the number of clicks each search_term received.
Unfortunately, this type of batch process can’t
meet our SLO of joined log availability within 5
minutes of logs being received.
www.sitereliabilityenginering.co .
www.yurynino.com
Slide 26
Slide 26 text
EXPLORE ANOTHER IDEA
LogJoiner
www.sitereliabilityenginering.co .
www.yurynino.com
Slide 27
Slide 27 text
DISTRIBUTED SYSTEM
The amount of network throughput LogJoiner needs to process the logs:
* (104 clicks/sec) × (2 × 103 bytes) = 2 × 107 = 20 MB/sec = 160 Mbps
--
* 3 × (5 × 105 queries/sec) × (8.64 × 104 seconds/day) × (8 bytes + 8
bytes) = 2 × 1012 = 2 TB/day for QueryMap
The next step in scaling the design is to shard the
inputs and outputs.
To divide the incoming query logs and
click logs into multiple streams.
www.sitereliabilityenginering.co .
www.yurynino.com
Slide 28
Slide 28 text
THE FINAL DESIGN
www.sitereliabilityenginering.co .
www.yurynino.com
Slide 29
Slide 29 text
www.sitereliabilityenginering.co .
www.yurynino.com
Si no DUELE no SIRVE! :O
Slide 30
Slide 30 text
THANKS
A Colombia
Bienvenidos
www.sitereliabilityenginering.co .
www.yurynino.com
Slide 31
Slide 31 text
#DevOpsDaysMDE
¡Muchas Gracias!
Slide 32
Slide 32 text
#DevOpsDaysMDE
Participa en las redes sociales
@DevopsdaysMed
DevOpsDays Medellín
Devopsdays_medellin
DevOpsDaysMedellin