Production Engineering for Youngbloods

Production Engineering for Youngbloods

A lot can be learned about shipping software through books, tutorials, and coursework, but there are a whole class of lessons that only ever show up in production environments. Those lessons are extremely valuable, but the barrier to learning about them is high. I'm hoping to lower that barrier a bit with this talk.

This talk contains a handful of lessons I've learned through operating production environments, as told to beginner engineers. Some are obvious, and some are surprising, but none are fiction. Not in-depth, but a good starting point for engineers who want to learn about production engineering.

B32443719f266e1da10dc301688642b4?s=128

Hector Castro

October 15, 2019
Tweet

Transcript

  1. Production Engineering for Youngbloods A small collection of things I

    have learned interacting with production environments.
  2. Notes on Distributed Systems for Youngbloods https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/

  3. Hector Castro GitHub, Twitter, LinkedIn, etc. is @hectcastro.

  4. We build applications that use maps, location, and aerial imagery

    for civic and social impact. Azavea https://careers.azavea.com
  5. Humming It’s better than raising hands.

  6. None
  7. Overview 1. Databases 2. Caches 3. Queues 4. Something special

  8. Databases Rare photo of an actual databass.

  9. Connection Count An unapologetic villain that thwarts attempts at horizontal

    scalability. Databases
  10. Slow Queries The query planner is pretty smart, but an

    oracle it is not. Databases
  11. Schema Changes Adding this single line ALTER TABLE statement will

    be trivial. Databases
  12. Humming Let’s hear it for databases!

  13. Caches Cache rules everything around me.

  14. Cache Failure This person is like your system—about to wipe

    out. Caches
  15. Kill Switches Make it easy to tell your application to

    avoid the cache. Caches
  16. New Class of Bugs Usually, it’s the car that crashes

    into something, not the other way around. Caches
  17. Debugging Complexity Many of the costs of caching aren’t paid

    up-front. Caches
  18. Humming Let’s hear it for caches!

  19. Queues I spy an unbounded queue!

  20. Response Time A simple equation that leads to a good

    mental model for queueing systems. Queues
  21. Request Response Queues !-> Response Time

  22. Request Response Queues !-> Response Time

  23. Request Response Queues !-> Response Time

  24. Request Response } Queueing delay + Service time = Response

    time } Queues !-> Response Time
  25. f(f(x)) = f(x) Idempotence A useful property for tasks in

    a queue, hidden behind a big word. Queues
  26. const sgMail = require('@sendgrid/mail'); exports.nonIdempotentEmailFunction = (event) => { const

    message = event.data; // Send email. sgMail.setApiKey(...); sgMail.send({..., text: message}); }; Queues !-> Idempotence
  27. const sgMail = require('@sendgrid/mail'); exports.nonIdempotentEmailFunction = (event) => { const

    message = event.data; // Send email. sgMail.setApiKey(...); sgMail.send({..., text: message}); }; Queues !-> Idempotence
  28. const sgMail = require('@sendgrid/mail'); const db = nosql.database(); exports.idempotentEmailFunction =

    (event) => { const message = event.data; const eventId = event.id; const emailRef = db.collection('sentEmails').doc(eventId); return shouldSend(emailRef).then(send => { if (send) { // Send email. sgMail.setApiKey(...); sgMail.send({..., text: message}); return markSent(emailRef); } }); }; function shouldSend(emailRef) { return emailRef.get().then(emailDoc => { return !emailDoc.exists || !emailDoc.data().sent; }); } function markSent(emailRef) { return emailRef.set({sent: true}); } Queues !-> Idempotence
  29. const eventId = event.id; const emailRef = db.collection('sentEmails').doc(eventId); const sgMail

    = require('@sendgrid/mail'); const db = nosql.database(); exports.idempotentEmailFunction = (event) => { const message = event.data; const eventId = event.id; const emailRef = db.collection('sentEmails').doc(eventId); return shouldSend(emailRef).then(send => { if (send) { // Send email. sgMail.setApiKey(...); sgMail.send({..., text: message}); return markSent(emailRef); } }); }; function shouldSend(emailRef) { return emailRef.get().then(emailDoc => { return !emailDoc.exists || !emailDoc.data().sent; }); } function markSent(emailRef) { return emailRef.set({sent: true}); } Queues !-> Idempotence
  30. return shouldSend(emailRef).then(send => { function shouldSend(emailRef) { return emailRef.get().then(emailDoc =>

    { return !emailDoc.exists || !emailDoc.data().sent; }); } const sgMail = require('@sendgrid/mail'); const db = nosql.database(); exports.idempotentEmailFunction = (event) => { const message = event.data; const eventId = event.id; const emailRef = db.collection('sentEmails').doc(eventId); return shouldSend(emailRef).then(send => { if (send) { // Send email. sgMail.setApiKey(...); sgMail.send({..., text: message}); return markSent(emailRef); } }); }; function shouldSend(emailRef) { return emailRef.get().then(emailDoc => { return !emailDoc.exists || !emailDoc.data().sent; }); } function markSent(emailRef) { return emailRef.set({sent: true}); } Queues !-> Idempotence
  31. return markSent(emailRef); function markSent(emailRef) { return emailRef.set({sent: true}); } const

    sgMail = require('@sendgrid/mail'); const db = nosql.database(); exports.idempotentEmailFunction = (event) => { const message = event.data; const eventId = event.id; const emailRef = db.collection('sentEmails').doc(eventId); return shouldSend(emailRef).then(send => { if (send) { // Send email. sgMail.setApiKey(...); sgMail.send({..., text: message}); return markSent(emailRef); } }); }; function shouldSend(emailRef) { return emailRef.get().then(emailDoc => { return !emailDoc.exists || !emailDoc.data().sent; }); } function markSent(emailRef) { return emailRef.set({sent: true}); } Queues !-> Idempotence
  32. Humming Let’s hear it for queues!

  33. None
  34. “It’s slow” The hardest problem you’ll ever debug.

  35. Latency Numbers https://gist.github.com/jboner/2841832 L1 cache reference ......................... 0.5 ns Branch

    mispredict ............................ 5 ns L2 cache reference ........................... 7 ns Mutex lock/unlock ........................... 25 ns Main memory reference ...................... 100 ns Compress 1K bytes with Zippy ............. 3,000 ns = 3 µs Send 2K bytes over 1 Gbps network ....... 20,000 ns = 20 µs SSD random read ........................ 150,000 ns = 150 µs Read 1 MB sequentially from memory ..... 250,000 ns = 250 µs Round trip within same datacenter ...... 500,000 ns = 0.5 ms Read 1 MB sequentially from SSD* ..... 1,000,000 ns = 1 ms Disk seek ........................... 10,000,000 ns = 10 ms Read 1 MB sequentially from disk .... 20,000,000 ns = 20 ms Send packet CA->Netherlands->CA .... 150,000,000 ns = 150 ms It’s Slow
  36. L1 cache reference ......................... 0.5 ns Branch mispredict ............................ 5

    ns L2 cache reference ........................... 7 ns Mutex lock/unlock ........................... 25 ns Main memory reference ...................... 100 ns Compress 1K bytes with Zippy ............. 3,000 ns = 3 µs Send 2K bytes over 1 Gbps network ....... 20,000 ns = 20 µs SSD random read ........................ 150,000 ns = 150 µs Read 1 MB sequentially from memory ..... 250,000 ns = 250 µs Round trip within same datacenter ...... 500,000 ns = 0.5 ms Read 1 MB sequentially from SSD* ..... 1,000,000 ns = 1 ms Disk seek ........................... 10,000,000 ns = 10 ms Read 1 MB sequentially from disk .... 20,000,000 ns = 20 ms Send packet CA->Netherlands->CA .... 150,000,000 ns = 150 ms Latency Numbers https://gist.github.com/jboner/2841832 Main memory reference ...................... 100 ns It’s Slow
  37. L1 cache reference ......................... 0.5 ns Branch mispredict ............................ 5

    ns L2 cache reference ........................... 7 ns Mutex lock/unlock ........................... 25 ns Main memory reference ...................... 100 ns Compress 1K bytes with Zippy ............. 3,000 ns = 3 µs Send 2K bytes over 1 Gbps network ....... 20,000 ns = 20 µs SSD random read ........................ 150,000 ns = 150 µs Read 1 MB sequentially from memory ..... 250,000 ns = 250 µs Round trip within same datacenter ...... 500,000 ns = 0.5 ms Read 1 MB sequentially from SSD* ..... 1,000,000 ns = 1 ms Disk seek ........................... 10,000,000 ns = 10 ms Read 1 MB sequentially from disk .... 20,000,000 ns = 20 ms Send packet CA->Netherlands->CA .... 150,000,000 ns = 150 ms Latency Numbers https://gist.github.com/jboner/2841832 Main memory reference ...................... 100 ns Read 1 MB sequentially from SSD* ..... 1,000,000 ns = 1 ms It’s Slow
  38. L1 cache reference 0.5 s One heart beat (0.5 s)

    Branch mispredict 5 s Yawn L2 cache reference 7 s Long yawn Mutex lock/unlock 25 s Making a coffee Main memory reference 100 s Brushing your teeth Compress 1K bytes with Zippy 50 min One episode of a TV show Send 2K bytes over 1 Gbps network 5.5 hr Lunch to end of work day SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium vacation Disk seek 16.5 weeks A semester in university Read 1 MB sequentially from disk 7.8 months Producing a new human being Humanized Latency Numbers https://gist.github.com/hellerbarde/2843375 It’s Slow !-> Latency Numbers
  39. Humanized Latency Numbers https://gist.github.com/hellerbarde/2843375 L1 cache reference 0.5 s One

    heart beat (0.5 s) Branch mispredict 5 s Yawn L2 cache reference 7 s Long yawn Mutex lock/unlock 25 s Making a coffee Main memory reference 100 s Brushing your teeth Compress 1K bytes with Zippy 50 min One episode of a TV show Send 2K bytes over 1 Gbps network 5.5 hr Lunch to end of work day SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium vacation Disk seek 16.5 weeks A semester in university Read 1 MB sequentially from disk 7.8 months Producing a new human being The above 2 together 1 year Main memory reference 100 s Brushing your teeth It’s Slow !-> Latency Numbers
  40. Humanized Latency Numbers https://gist.github.com/hellerbarde/2843375 L1 cache reference 0.5 s One

    heart beat (0.5 s) Branch mispredict 5 s Yawn L2 cache reference 7 s Long yawn Mutex lock/unlock 25 s Making a coffee Main memory reference 100 s Brushing your teeth Compress 1K bytes with Zippy 50 min One episode of a TV show Send 2K bytes over 1 Gbps network 5.5 hr Lunch to end of work day SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium vacation Disk seek 16.5 weeks A semester in university Read 1 MB sequentially from disk 7.8 months Producing a new human being The above 2 together 1 year Read 1 MB sequentially from disk 7.8 months Producing a new human being It’s Slow !-> Latency Numbers Main memory reference 100 s Brushing your teeth
  41. Percentiles Just focusing on the mean is mean. It’s Slow

  42. Be Curious About the System Strive to develop a mental

    model of the application and the architecture it resides on. It’s Slow
  43. Humming Let’s hear it for things being slow!

  44. Thank you.