Slide 1

Slide 1 text

Production Engineering for Youngbloods A small collection of things I have learned interacting with production environments.

Slide 2

Slide 2 text

Notes on Distributed Systems for Youngbloods https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/

Slide 3

Slide 3 text

Hector Castro GitHub, Twitter, LinkedIn, etc. is @hectcastro.

Slide 4

Slide 4 text

We build applications that use maps, location, and aerial imagery for civic and social impact. Azavea https://careers.azavea.com

Slide 5

Slide 5 text

Humming It’s better than raising hands.

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Overview 1. Databases 2. Caches 3. Queues 4. Something special

Slide 8

Slide 8 text

Databases Rare photo of an actual databass.

Slide 9

Slide 9 text

Connection Count An unapologetic villain that thwarts attempts at horizontal scalability. Databases

Slide 10

Slide 10 text

Slow Queries The query planner is pretty smart, but an oracle it is not. Databases

Slide 11

Slide 11 text

Schema Changes Adding this single line ALTER TABLE statement will be trivial. Databases

Slide 12

Slide 12 text

Humming Let’s hear it for databases!

Slide 13

Slide 13 text

Caches Cache rules everything around me.

Slide 14

Slide 14 text

Cache Failure This person is like your system—about to wipe out. Caches

Slide 15

Slide 15 text

Kill Switches Make it easy to tell your application to avoid the cache. Caches

Slide 16

Slide 16 text

New Class of Bugs Usually, it’s the car that crashes into something, not the other way around. Caches

Slide 17

Slide 17 text

Debugging Complexity Many of the costs of caching aren’t paid up-front. Caches

Slide 18

Slide 18 text

Humming Let’s hear it for caches!

Slide 19

Slide 19 text

Queues I spy an unbounded queue!

Slide 20

Slide 20 text

Response Time A simple equation that leads to a good mental model for queueing systems. Queues

Slide 21

Slide 21 text

Request Response Queues !-> Response Time

Slide 22

Slide 22 text

Request Response Queues !-> Response Time

Slide 23

Slide 23 text

Request Response Queues !-> Response Time

Slide 24

Slide 24 text

Request Response } Queueing delay + Service time = Response time } Queues !-> Response Time

Slide 25

Slide 25 text

f(f(x)) = f(x) Idempotence A useful property for tasks in a queue, hidden behind a big word. Queues

Slide 26

Slide 26 text

const sgMail = require('@sendgrid/mail'); exports.nonIdempotentEmailFunction = (event) => { const message = event.data; // Send email. sgMail.setApiKey(...); sgMail.send({..., text: message}); }; Queues !-> Idempotence

Slide 27

Slide 27 text

const sgMail = require('@sendgrid/mail'); exports.nonIdempotentEmailFunction = (event) => { const message = event.data; // Send email. sgMail.setApiKey(...); sgMail.send({..., text: message}); }; Queues !-> Idempotence

Slide 28

Slide 28 text

const sgMail = require('@sendgrid/mail'); const db = nosql.database(); exports.idempotentEmailFunction = (event) => { const message = event.data; const eventId = event.id; const emailRef = db.collection('sentEmails').doc(eventId); return shouldSend(emailRef).then(send => { if (send) { // Send email. sgMail.setApiKey(...); sgMail.send({..., text: message}); return markSent(emailRef); } }); }; function shouldSend(emailRef) { return emailRef.get().then(emailDoc => { return !emailDoc.exists || !emailDoc.data().sent; }); } function markSent(emailRef) { return emailRef.set({sent: true}); } Queues !-> Idempotence

Slide 29

Slide 29 text

const eventId = event.id; const emailRef = db.collection('sentEmails').doc(eventId); const sgMail = require('@sendgrid/mail'); const db = nosql.database(); exports.idempotentEmailFunction = (event) => { const message = event.data; const eventId = event.id; const emailRef = db.collection('sentEmails').doc(eventId); return shouldSend(emailRef).then(send => { if (send) { // Send email. sgMail.setApiKey(...); sgMail.send({..., text: message}); return markSent(emailRef); } }); }; function shouldSend(emailRef) { return emailRef.get().then(emailDoc => { return !emailDoc.exists || !emailDoc.data().sent; }); } function markSent(emailRef) { return emailRef.set({sent: true}); } Queues !-> Idempotence

Slide 30

Slide 30 text

return shouldSend(emailRef).then(send => { function shouldSend(emailRef) { return emailRef.get().then(emailDoc => { return !emailDoc.exists || !emailDoc.data().sent; }); } const sgMail = require('@sendgrid/mail'); const db = nosql.database(); exports.idempotentEmailFunction = (event) => { const message = event.data; const eventId = event.id; const emailRef = db.collection('sentEmails').doc(eventId); return shouldSend(emailRef).then(send => { if (send) { // Send email. sgMail.setApiKey(...); sgMail.send({..., text: message}); return markSent(emailRef); } }); }; function shouldSend(emailRef) { return emailRef.get().then(emailDoc => { return !emailDoc.exists || !emailDoc.data().sent; }); } function markSent(emailRef) { return emailRef.set({sent: true}); } Queues !-> Idempotence

Slide 31

Slide 31 text

return markSent(emailRef); function markSent(emailRef) { return emailRef.set({sent: true}); } const sgMail = require('@sendgrid/mail'); const db = nosql.database(); exports.idempotentEmailFunction = (event) => { const message = event.data; const eventId = event.id; const emailRef = db.collection('sentEmails').doc(eventId); return shouldSend(emailRef).then(send => { if (send) { // Send email. sgMail.setApiKey(...); sgMail.send({..., text: message}); return markSent(emailRef); } }); }; function shouldSend(emailRef) { return emailRef.get().then(emailDoc => { return !emailDoc.exists || !emailDoc.data().sent; }); } function markSent(emailRef) { return emailRef.set({sent: true}); } Queues !-> Idempotence

Slide 32

Slide 32 text

Humming Let’s hear it for queues!

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

“It’s slow” The hardest problem you’ll ever debug.

Slide 35

Slide 35 text

Latency Numbers https://gist.github.com/jboner/2841832 L1 cache reference ......................... 0.5 ns Branch mispredict ............................ 5 ns L2 cache reference ........................... 7 ns Mutex lock/unlock ........................... 25 ns Main memory reference ...................... 100 ns Compress 1K bytes with Zippy ............. 3,000 ns = 3 µs Send 2K bytes over 1 Gbps network ....... 20,000 ns = 20 µs SSD random read ........................ 150,000 ns = 150 µs Read 1 MB sequentially from memory ..... 250,000 ns = 250 µs Round trip within same datacenter ...... 500,000 ns = 0.5 ms Read 1 MB sequentially from SSD* ..... 1,000,000 ns = 1 ms Disk seek ........................... 10,000,000 ns = 10 ms Read 1 MB sequentially from disk .... 20,000,000 ns = 20 ms Send packet CA->Netherlands->CA .... 150,000,000 ns = 150 ms It’s Slow

Slide 36

Slide 36 text

L1 cache reference ......................... 0.5 ns Branch mispredict ............................ 5 ns L2 cache reference ........................... 7 ns Mutex lock/unlock ........................... 25 ns Main memory reference ...................... 100 ns Compress 1K bytes with Zippy ............. 3,000 ns = 3 µs Send 2K bytes over 1 Gbps network ....... 20,000 ns = 20 µs SSD random read ........................ 150,000 ns = 150 µs Read 1 MB sequentially from memory ..... 250,000 ns = 250 µs Round trip within same datacenter ...... 500,000 ns = 0.5 ms Read 1 MB sequentially from SSD* ..... 1,000,000 ns = 1 ms Disk seek ........................... 10,000,000 ns = 10 ms Read 1 MB sequentially from disk .... 20,000,000 ns = 20 ms Send packet CA->Netherlands->CA .... 150,000,000 ns = 150 ms Latency Numbers https://gist.github.com/jboner/2841832 Main memory reference ...................... 100 ns It’s Slow

Slide 37

Slide 37 text

L1 cache reference ......................... 0.5 ns Branch mispredict ............................ 5 ns L2 cache reference ........................... 7 ns Mutex lock/unlock ........................... 25 ns Main memory reference ...................... 100 ns Compress 1K bytes with Zippy ............. 3,000 ns = 3 µs Send 2K bytes over 1 Gbps network ....... 20,000 ns = 20 µs SSD random read ........................ 150,000 ns = 150 µs Read 1 MB sequentially from memory ..... 250,000 ns = 250 µs Round trip within same datacenter ...... 500,000 ns = 0.5 ms Read 1 MB sequentially from SSD* ..... 1,000,000 ns = 1 ms Disk seek ........................... 10,000,000 ns = 10 ms Read 1 MB sequentially from disk .... 20,000,000 ns = 20 ms Send packet CA->Netherlands->CA .... 150,000,000 ns = 150 ms Latency Numbers https://gist.github.com/jboner/2841832 Main memory reference ...................... 100 ns Read 1 MB sequentially from SSD* ..... 1,000,000 ns = 1 ms It’s Slow

Slide 38

Slide 38 text

L1 cache reference 0.5 s One heart beat (0.5 s) Branch mispredict 5 s Yawn L2 cache reference 7 s Long yawn Mutex lock/unlock 25 s Making a coffee Main memory reference 100 s Brushing your teeth Compress 1K bytes with Zippy 50 min One episode of a TV show Send 2K bytes over 1 Gbps network 5.5 hr Lunch to end of work day SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium vacation Disk seek 16.5 weeks A semester in university Read 1 MB sequentially from disk 7.8 months Producing a new human being Humanized Latency Numbers https://gist.github.com/hellerbarde/2843375 It’s Slow !-> Latency Numbers

Slide 39

Slide 39 text

Humanized Latency Numbers https://gist.github.com/hellerbarde/2843375 L1 cache reference 0.5 s One heart beat (0.5 s) Branch mispredict 5 s Yawn L2 cache reference 7 s Long yawn Mutex lock/unlock 25 s Making a coffee Main memory reference 100 s Brushing your teeth Compress 1K bytes with Zippy 50 min One episode of a TV show Send 2K bytes over 1 Gbps network 5.5 hr Lunch to end of work day SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium vacation Disk seek 16.5 weeks A semester in university Read 1 MB sequentially from disk 7.8 months Producing a new human being The above 2 together 1 year Main memory reference 100 s Brushing your teeth It’s Slow !-> Latency Numbers

Slide 40

Slide 40 text

Humanized Latency Numbers https://gist.github.com/hellerbarde/2843375 L1 cache reference 0.5 s One heart beat (0.5 s) Branch mispredict 5 s Yawn L2 cache reference 7 s Long yawn Mutex lock/unlock 25 s Making a coffee Main memory reference 100 s Brushing your teeth Compress 1K bytes with Zippy 50 min One episode of a TV show Send 2K bytes over 1 Gbps network 5.5 hr Lunch to end of work day SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium vacation Disk seek 16.5 weeks A semester in university Read 1 MB sequentially from disk 7.8 months Producing a new human being The above 2 together 1 year Read 1 MB sequentially from disk 7.8 months Producing a new human being It’s Slow !-> Latency Numbers Main memory reference 100 s Brushing your teeth

Slide 41

Slide 41 text

Percentiles Just focusing on the mean is mean. It’s Slow

Slide 42

Slide 42 text

Be Curious About the System Strive to develop a mental model of the application and the architecture it resides on. It’s Slow

Slide 43

Slide 43 text

Humming Let’s hear it for things being slow!

Slide 44

Slide 44 text

Thank you.