Slide 1

Slide 1 text

DevopsDays Medellín 2021 ¡La conferencia de referencia mundial sobre DevOps llega a Medellín! Julio 30 al 31, 2021

Slide 2

Slide 2 text

#DevOpsDaysMDE Participa en las redes sociales @DevopsdaysMed DevOpsDays Medellín Devopsdays_medellin DevOpsDaysMedellin

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Non-Abstract Large Design Systems

Slide 5

Slide 5 text

YURY NIÑO ROA Site Reliability Engineer Chaos Engineering Advocate ADL Digital Labs www.sitereliabilityenginering.co . www.yurynino.com yury nino

Slide 6

Slide 6 text

AGENDA Systems Design Fundamentals How NALSD? A Use Case What is NALSD? An Introduction

Slide 7

Slide 7 text

CREDITS www.sitereliabilityenginering.co . www.yurynino.com

Slide 8

Slide 8 text

What do you look in an SRE? 1. Automation 3. Curiosity

Slide 9

Slide 9 text

GLOSSARY OF SYSTEMS DESIGN www.sitereliabilityenginering.co . www.yurynino.com Load Balancing Data Partitioning Proxies Caching Indexes Redundancy Replication SQL vs NoSQL Consistent Hashing CAP Theorem PACELC Theorem Bloom Quorum Leader and Follower

Slide 10

Slide 10 text

Consistent Core Follower Readers Generation Clock Gossip Dissemination HeartBeat Hybrid Clock Idempotent Receiver State Watch Quorum SYSTEMS DESIGN PATTERNS www.sitereliabilityenginering.co . www.yurynino.com https://martinfowler.com/articles/patterns-of-distributed-systems/

Slide 11

Slide 11 text

SYSTEMS DESIGN FALLACIES The Network is Reliable Latency is Zero The Topology doesn’t change Transport Cost Is Zero Bandwidth is Infinity The Network is Secure www.sitereliabilityenginering.co . www.yurynino.com

Slide 12

Slide 12 text

Iterative style for designing and implementing systems. NON ABSTRACT LARGE SYSTEM DESIGN WHAT IS NALSD? SRE Ability to assess, design, and evaluate large systems. Robust and scalable designs with low operational costs. www.sitereliabilityenginering.co . www.yurynino.com

Slide 13

Slide 13 text

NALSD IN DETAIL Google SREs are expected to be able to start resource planning with a basic whiteboard diagram of a system, think through the various scaling and failure domains, and focus their design into a concrete proposal for resources. www.sitereliabilityenginering.co . www.yurynino.com

Slide 14

Slide 14 text

WHY NALSD? Google has learned (the hard way) that the people designing distributed systems need to develop and continuously exercise the muscle of design into concrete estimates of resources at multiple steps in the process. www.sitereliabilityenginering.co . www.yurynino.com

Slide 15

Slide 15 text

Consider running our entire application on a single computer. One Machine Now we’ll need multiple machines, what’s the best design to join them? Distributed System * Is it possible? * Can we do better? * Is it feasible? * Is it resilient? Design Process * Read & Understand * Required SLOs * Ask that you consider Initial Requirements NALSD IN DETAIL www.sitereliabilityenginering.co . www.yurynino.com

Slide 16

Slide 16 text

HOW TO BEGIN? https://danrl.com/sre-flash-cards/SRE%20Flash%20Cards.pdf ‘The numbers everyone should know’ Time Main Memory Reference Time Round trip within same datacenter Power of ten? ns / us / ms Speed Read sequentially from SSD From: https://cloud.google.com/blog/products/manage ment-tools/sre-principles-and-flashcards-to-design- nalsd Time Read 1 MB sequentially from memory www.sitereliabilityenginering.co . www.yurynino.com

Slide 17

Slide 17 text

HOW TO BEGIN? www.sitereliabilityenginering.co . www.yurynino.com

Slide 18

Slide 18 text

USE CASE ● The Google AdWords service displays text advertisements on Google Web Search. ● The click-through rate (CTR) metric tells advertisers how well their ads are performing. ● CTR is the ratio of times the ad is clicked versus the number of times the ad is shown. AdWords Challenge Design a system capable of measuring and reporting an accurate CTR for every AdWords ad. www.sitereliabilityenginering.co . www.yurynino.com

Slide 19

Slide 19 text

INITIAL REQUIREMENTS Each advertiser may have multiple advertisements. Each ad is keyed by ad_id and is associated with a list of search terms selected by the advertiser. * How often this search term triggered this ad to be shown? * How many times the ad was clicked by someone who saw the ad? * With this information, we can calculate the CTR CTR: the number of clicks divided by the number of impressions. www.sitereliabilityenginering.co . www.yurynino.com

Slide 20

Slide 20 text

INITIAL REQUIREMENTS ● We know our advertisers care about two things: ○ That the dashboard displays quickly! ○ That the data is recent. Therefore, we will consider our requirements in terms of SLOs: ● 99.9% of dashboard queries complete in < 1 second. ● 99.9% of the time, the CTR data displayed is less than 5 minutes old. www.sitereliabilityenginering.co . www.yurynino.com

Slide 21

Slide 21 text

ONE MACHINE For every web search query, we log The TIME the query occurred A QUERY_ID unique identifier An AD_ID The AD IDs of THE AdWords advertisements shown for the search A SEARCH_TERM the query content www.sitereliabilityenginering.co . www.yurynino.com

Slide 22

Slide 22 text

ONE MACHINE Calculations TIME 64-bit integer, 8 bytes QUERY_ID 64-bit integer, 8 bytes An AD_ID 3 64-bit integer, 24 bytes A SEARCH_TERM A long string, up to 500 bytes www.sitereliabilityenginering.co . www.yurynino.com

Slide 23

Slide 23 text

ONE MACHINE The volume of query logs generated in a 24-hour period:: * (5 × 105 queries/sec) × (8.64 × 104 seconds/day) × (2 × 103 bytes) = 86.4 TB/day -- A common 4 TB HDD sustains 200 input/output operations per second (IOPS): * (5 × 105 queries/sec) / (200 IOPS/disk) = 2.5 × 103 disks or 2,500 disks -- * (100 TB) / (64 GB RAM/machine) = 1,563 machines www.sitereliabilityenginering.co . www.yurynino.com

Slide 24

Slide 24 text

ASSESSMENT We can not we reasonably support our SLOs if one of these components fails. One-machine design looks unfeasible www.sitereliabilityenginering.co . www.yurynino.com

Slide 25

Slide 25 text

EXPLORE ANOTHER IDEA * We can process and join the logs with a MapReduce. * We can grab the accumulated query logs and click logs. MapReduce will produce a data set organized by ad_id & the number of clicks each search_term received. Unfortunately, this type of batch process can’t meet our SLO of joined log availability within 5 minutes of logs being received. www.sitereliabilityenginering.co . www.yurynino.com

Slide 26

Slide 26 text

EXPLORE ANOTHER IDEA LogJoiner www.sitereliabilityenginering.co . www.yurynino.com

Slide 27

Slide 27 text

DISTRIBUTED SYSTEM The amount of network throughput LogJoiner needs to process the logs: * (104 clicks/sec) × (2 × 103 bytes) = 2 × 107 = 20 MB/sec = 160 Mbps -- * 3 × (5 × 105 queries/sec) × (8.64 × 104 seconds/day) × (8 bytes + 8 bytes) = 2 × 1012 = 2 TB/day for QueryMap The next step in scaling the design is to shard the inputs and outputs. To divide the incoming query logs and click logs into multiple streams. www.sitereliabilityenginering.co . www.yurynino.com

Slide 28

Slide 28 text

THE FINAL DESIGN www.sitereliabilityenginering.co . www.yurynino.com

Slide 29

Slide 29 text

www.sitereliabilityenginering.co . www.yurynino.com Si no DUELE no SIRVE! :O

Slide 30

Slide 30 text

THANKS A Colombia Bienvenidos www.sitereliabilityenginering.co . www.yurynino.com

Slide 31

Slide 31 text

#DevOpsDaysMDE ¡Muchas Gracias!

Slide 32

Slide 32 text

#DevOpsDaysMDE Participa en las redes sociales @DevopsdaysMed DevOpsDays Medellín Devopsdays_medellin DevOpsDaysMedellin