Pro Yearly is on sale from $80 to $50! »

The Trouble with Distribution

The Trouble with Distribution

Centralized applications are easy: your entire system lives in one physical location and you can reason about, vertically scale, and manage your system with minimal friction. Unfortunately, applications aren’t built this way anymore. Our systems are distributed, have external dependencies, and may even have to be geographically redundant.

Dealing with distribution is a must at Fastly, where our applications are deployed all over the world and must be highly performant and resilient. But there are some inherent challenges related to designing and building systems that scale. In this talk we’ll go over the key lessons we learned while building our Image Optimization service (https://www.fastly.com/io). What worked, what didn’t, the tradeoffs we made, and what can you do as a systems engineer to learn from our experiences while building your own applications.

C64a0152c9b0928e62d88f0bb5eb8138?s=128

Ines Sombra

May 18, 2017
Tweet

Transcript

  1. Distribution The trouble with (and pugs)

  2. Hi, I’m Inés Sombra! @Randommood

  3. Globally Distributed & Highly available

  4. Today’s Agenda Conclusions 
 & takeaways Context setting blah blah

    blah Hard things Random pug photos Ines fast talking
  5. TRADEOFFS

  6. IMAGEOPTO

  7. Meet
 Gordo

  8. None
  9. JPEG 1920 × 1080 px Max quality 2.3MB

  10. A sample site img-large im g-thum b im g-thum b

    im g-thum b im g-thum b im g-xs im g-xs
  11. Making this site faster Many image sizes stored in your

    origin
  12. Pre-processing Tradeoffs Good open source options available for DIY Scale

    and tailor pre- processing pipeline as you need Setup, hosting, and operability burden on you Increases number of stack components Variability (1-2 vs all) Costly & takes time
  13. Crop with aspect ratio http://www.fastly.io/gordo.jpg? crop=10:7,offset-x80&width=800 Crop the image square

    & resize the width to 200px http://www.fastly.io/gordo.jpg?crop=1:1&width=200 200 × 200 px Quality 85 6.93KB vs 2.3MB JPEG 800 × 560 px Quality 85 47 KB vs 2.3MB
  14. Hard Things

  15. Centralized vs Distributed computation

  16. Centralized vs Distributed computation

  17. Centralized vs Distributed computation

  18. Centralized vs Distributed computation

  19. ”A distributed system is a collection of independent entities that

    cooperate to solve a problem that cannot be individually solved” Kshemkalyani & Singhal “Distributed Computing: Principles, Algorithms, and Systems”
  20. Meeting system Goals

  21. Our System’s Goals Store a single large original Perform many

    transformations on the fly Cache everything Do this fast!
  22. Your origin & its images gordo-thumb please! Our System’s Goals

  23. Your origin & its images thanks! Our System’s Goals 2.3

    MB 6.3 KB
  24. Serial vs Parallel Computation

  25. Serial vs Parallel Computation

  26. Image Optimizers

  27. Image Optimizers

  28. Image Optimizers gordo-thumb please!

  29. Image Optimizers gordo-thumb please! original gordo please!

  30. Image Optimizers gordo-thumb please! there you go

  31. Image Optimizers gordo-thumb please! gordo-thum b please!

  32. Image Optimizers gordo-thumb please! Done!

  33. Image Optimizers gordo-thumb please! there you go

  34. Image Optimizers YASS!

  35. Image Optimizers gordo-thumb please!

  36. Image Optimizers YASS!

  37. ImageOpto Tradeoffs Parallel request processing Fully distributed Relying on CDN

    for targeted functionality Stateless We have to deal with different kinds of parallelization Increased system complexity No SPOF but many failure domains
  38. Strategies from Literature Break into subsystems Randomness to make the

    worst-case & average-case the same Only provide strong consistency for the subsystems that need it 1999
  39. CDN for logging CDN for authentication CDN for request management

    CDN for state management CDN for purging CDN for API translation CDN for doing less work! Orthogonality / Composability
  40. Meeting System Goals Use orthogonality to meet your system’s goals

    Graceful degradation under faults is a goal too Simple but composable operations is a good design aesthetic
  41. Pug Tradeoff #1 About 90 dB!

  42. Global Visibility

  43. Centralized vs Distributed insight vs ?

  44. System Design & Visibility Image Optimizers Cost of decoding an

    image Caches Varnish
  45. System Design & Visibility

  46. System Design & Visibility

  47. System Design & Visibility Decodes from an uncompressed format are

    faster About 90% reduction in processing time by saving the decoded image into an uncompressed format Let’s use this strategy to provide speedier transformations!
  48. Image Optimizers Caches Varnish Varnish Varnish Varnish Varnish Varnish Varnish

    System Design & Visibility
  49. Image Optimizers Varnish Varnish Varnish Varnish Varnish Varnish Edge Varnish

    Origin System Design & Visibility Shield Varnish
  50. Shield Varnish Image Optimizers Varnish Varnish Varnish Varnish Varnish Varnish

    Edge Varnish System Design & Visibility Origin
  51. Shield Varnish Image Optimizers Varnish Varnish Varnish Varnish Varnish Varnish

    Edge Varnish System Design & Visibility Origin
  52. Shield Varnish Image Optimizers Varnish Varnish Varnish Varnish Varnish Varnish

    Edge Varnish System Design & Visibility Origin
  53. Shield Varnish Image Optimizers Varnish Varnish Varnish Varnish Varnish Varnish

    Edge Varnish System Design & Visibility Origin
  54. Shield Varnish Image Optimizers Varnish Varnish Varnish Varnish Varnish Varnish

    Edge Varnish System Design & Visibility Origin
  55. Shield Varnish Image Optimizers Varnish Varnish Varnish Varnish Varnish Varnish

    Edge Varnish System Design & Visibility Origin
  56. Shield Varnish Image Optimizers Varnish Varnish Varnish Varnish Varnish Varnish

    Edge Varnish System Design & Visibility Origin
  57. Shield Varnish Image Optimizers Varnish Varnish Varnish Varnish Varnish Varnish

    Edge Varnish System Design & Visibility Origin
  58. Shield Varnish Image Optimizers Varnish Varnish Varnish Varnish Varnish Varnish

    Edge Varnish System Design & Visibility Origin
  59. Shield Varnish Image Optimizers Varnish Varnish Varnish Varnish Varnish Varnish

    Edge Varnish System Design & Visibility Origin YASS!
  60. Bitmap Cache Tradeoffs Faster requests due to decoding image once

    & caching it Use shielding to increase the chance of getting a HIT for a given resource Complex request cycle: what happened in this request? was difficult to answer Starting with a non-trivial solution takes a lot of time Unexpected interactions with purging & shielding
  61. None
  62. Shield Varnish Edge Varnish Designing system Visibility (again) Image Optimizers

    Origin
  63. Edge Varnish Shield Varnish Designing system Visibility (again) Image Optimizers

    Origin
  64. Shield Varnish Edge Varnish Designing system Visibility (again) Image Optimizers

    Origin
  65. Shield Varnish Edge Varnish Designing system Visibility (again) Image Optimizers

    Origin
  66. Shield Varnish Edge Varnish Designing system Visibility (again) Image Optimizers

    Origin
  67. Shield Varnish Edge Varnish Designing system Visibility (again) Image Optimizers

    Origin
  68. Shield Varnish Edge Varnish Designing system Visibility (again) Image Optimizers

    Origin
  69. Shield Varnish Edge Varnish Designing system Visibility (again) Image Optimizers

    Origin
  70. Edge Varnish Shield Varnish Designing system Visibility (again) Image Optimizers

    Origin YASS!
  71. No Bitmap Cache Tradeoffs Fast enough requests Simplified architecture Simpler

    request path Purging works! Not the theoretical fastest Needed a few new caching features We wasted a lot of time & effort
  72. Visibility & metadata

  73. Visibility & metadata Stripping image metadata reduces the file size

    Smaller files are faster to transform & deliver Metadata has quality impacting information EXIF for image orientation ICC profiles define color attributes / viewing requirements
  74. Visibility & ICC metadata

  75. “It’s slow”

  76. Jeff Hodges “Notes on Distributed Systems for Young Bloods” “It’s

    slow” might mean: one or more of the systems involved in performing a request is slow… one or more of the parts of a pipeline of transformations is slow. “It’s slow” is hard, in part, because the problem statement doesn’t provide many clues to location of the flaw and, until the degradation becomes very obvious, you won’t receive as many resources (time, money, & tooling) to solve it.”
  77. System Visibility & debugging Image Optimizers Image size & format

    Utilization of hardware resources: CPUs, RAM Concurrency of used libraries The network can make you slow Things you didn’t know you had or that you were not doing well can make you slow
  78. Logging in distributed systems: usefulness, verbosity, aggregation, etc System health

    & failure detectors Inspection of code & libraries used varnishlog, grep, & a whole lotta Jed Tools in place
  79. System Visibility & debugging SCOPE FORMAT SIZE AVG Response Before

    (ms) AVG Response AFTER (ms) Response time decrease FR WebP 250x250 95 44 54% FR WebP 72x72 80 40 50% FR WebP 641x641 200 170 15% FR JPEG 250x250 95 150 -58% FR JPEG 72x72 80 40 50% FR JPEG 641x641 200 170 15% INTER WebP + JPEG All 750 250 67%
  80. No Global System Visibility Ability to reason about your system’s

    goals & its dependencies is key Tracking and fixing “slow” is an ongoing activity Seemingly small amounts of performance variability in critical components quickly add up to create less than ideal conditions* * Ilya Grigorik - Building Fast & Resilient Web Applications
  81. Pug Tradeoff #2 99.999% Available

  82. None
  83. Resilience

  84. Image Optimizers Edge Varnish System Design & Visibility

  85. Image Optimizers Edge Varnish X System Design & Visibility

  86. Resilience & Geographical distribution

  87. Geographical Distribution Tradeoffs Dedicated hardware- based ImageOpto POPs Beefy machines

    with tons of cores & RAM Fast network connectivity Harder to dynamically grow the service with customer demand Cannot accommodate customers with origins not in USA
  88. IO GCP IO POP Resilience & Geographical distribution

  89. Resilience & Operability Image Optimizers

  90. Resilience & Operability Image Optimizers

  91. Resilience & Operability Image Optimizers

  92. Resilience & Operability Image Optimizers

  93. Resilience & Operability Image Optimizers

  94. Resilience & Operability Image Optimizers

  95. Resilience & Mixed Mode Fastly-IO-Info Header ETags are a function

    of the server handling a particular request Use HTTP ETags to guard against output encoding changes
  96. Resilience & Operability Complex operations make systems less resilient &

    more incident-prone New systems/ functionality tend to shake new bugs Expect everything to be awful (always) so try to isolate your failure domains
  97. Operability Tradeoffs Pay-as-you-go investment model in system operability /resilience Redundancies

    are key Less to do the less your API does Corners cut here will come back at the most inopportune time Complexity sometimes is unavoidable Cannot be bolted on
  98. Adding resilience may come at the cost of other desired

    goals (e.g. time, performance, simplicity, cost, etc) Dependencies are hard: customer setup, customer inputs, caching layer, libraries, and other systems. We have to be resilient to all of them Designing for operability increases robustness Ensuring System Resilience
  99. Parting 
 Thoughts

  100. Tradeoffs are made in context and should be revisited often

    Goal tunnel vision may lead you to work harder on the wrong solution A narrow API that grows later is great, specially in early phases It’s all about tradeoffs
  101. Our tradeoffs in hindsight GA’ed in April System evolving &

    growing Operability, performance, & increasing resilience are key Used by companies like Airbnb, Nordstrom Rack, Beatport, Gannett, LaRedoute, 1stdibs, Surfdome, and more! www.fastly.com/io
  102. tl;dr DESIGN VISIBILITY RESILIENCE Simple utilitarian design helps you meet

    system goals Few composable operations & expand API later Keep global system context in mind! Ability to reason about what’s happening in your system is key Use logging, request tracing, & system instrumentation Many perspectives Hardening system against failure domains Have barriers to contain cascading failures Operability design matters
  103. Thank you! github.com/Randommood/TroubleWithDistribution @Randommood Special thanks to: Tyler McMullen, Jed

    Denlea, Adam Thomason, Ian Fung, Joao Taveira, Ezekiel Templin, Ashok Lalwani, Matt Whiteley, Kyle Kingsbury, Peter Bourgon, Camille Fournier, Caitie McCaffrey, Lorenzo Saino, Elaine Greenberg, & Greg Bako.