Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Trouble with Distribution

The Trouble with Distribution

Centralized applications are easy: your entire system lives in one physical location and you can reason about, vertically scale, and manage your system with minimal friction. Unfortunately, applications aren’t built this way anymore. Our systems are distributed, have external dependencies, and may even have to be geographically redundant.

Dealing with distribution is a must at Fastly, where our applications are deployed all over the world and must be highly performant and resilient. But there are some inherent challenges related to designing and building systems that scale. In this talk we’ll go over the key lessons we learned while building our Image Optimization service (https://www.fastly.com/io). What worked, what didn’t, the tradeoffs we made, and what can you do as a systems engineer to learn from our experiences while building your own applications.

Ines Sombra

May 18, 2017
Tweet

More Decks by Ines Sombra

Other Decks in Technology

Transcript

  1. Today’s Agenda Conclusions 
 & takeaways Context setting blah blah

    blah Hard things Random pug photos Ines fast talking
  2. A sample site img-large im g-thum b im g-thum b

    im g-thum b im g-thum b im g-xs im g-xs
  3. Pre-processing Tradeoffs Good open source options available for DIY Scale

    and tailor pre- processing pipeline as you need Setup, hosting, and operability burden on you Increases number of stack components Variability (1-2 vs all) Costly & takes time
  4. Crop with aspect ratio http://www.fastly.io/gordo.jpg? crop=10:7,offset-x80&width=800 Crop the image square

    & resize the width to 200px http://www.fastly.io/gordo.jpg?crop=1:1&width=200 200 × 200 px Quality 85 6.93KB vs 2.3MB JPEG 800 × 560 px Quality 85 47 KB vs 2.3MB
  5. ”A distributed system is a collection of independent entities that

    cooperate to solve a problem that cannot be individually solved” Kshemkalyani & Singhal “Distributed Computing: Principles, Algorithms, and Systems”
  6. Our System’s Goals Store a single large original Perform many

    transformations on the fly Cache everything Do this fast!
  7. ImageOpto Tradeoffs Parallel request processing Fully distributed Relying on CDN

    for targeted functionality Stateless We have to deal with different kinds of parallelization Increased system complexity No SPOF but many failure domains
  8. Strategies from Literature Break into subsystems Randomness to make the

    worst-case & average-case the same Only provide strong consistency for the subsystems that need it 1999
  9. CDN for logging CDN for authentication CDN for request management

    CDN for state management CDN for purging CDN for API translation CDN for doing less work! Orthogonality / Composability
  10. Meeting System Goals Use orthogonality to meet your system’s goals

    Graceful degradation under faults is a goal too Simple but composable operations is a good design aesthetic
  11. System Design & Visibility Decodes from an uncompressed format are

    faster About 90% reduction in processing time by saving the decoded image into an uncompressed format Let’s use this strategy to provide speedier transformations!
  12. Shield Varnish Image Optimizers Varnish Varnish Varnish Varnish Varnish Varnish

    Edge Varnish System Design & Visibility Origin YASS!
  13. Bitmap Cache Tradeoffs Faster requests due to decoding image once

    & caching it Use shielding to increase the chance of getting a HIT for a given resource Complex request cycle: what happened in this request? was difficult to answer Starting with a non-trivial solution takes a lot of time Unexpected interactions with purging & shielding
  14. No Bitmap Cache Tradeoffs Fast enough requests Simplified architecture Simpler

    request path Purging works! Not the theoretical fastest Needed a few new caching features We wasted a lot of time & effort
  15. Visibility & metadata Stripping image metadata reduces the file size

    Smaller files are faster to transform & deliver Metadata has quality impacting information EXIF for image orientation ICC profiles define color attributes / viewing requirements
  16. Jeff Hodges “Notes on Distributed Systems for Young Bloods” “It’s

    slow” might mean: one or more of the systems involved in performing a request is slow… one or more of the parts of a pipeline of transformations is slow. “It’s slow” is hard, in part, because the problem statement doesn’t provide many clues to location of the flaw and, until the degradation becomes very obvious, you won’t receive as many resources (time, money, & tooling) to solve it.”
  17. System Visibility & debugging Image Optimizers Image size & format

    Utilization of hardware resources: CPUs, RAM Concurrency of used libraries The network can make you slow Things you didn’t know you had or that you were not doing well can make you slow
  18. Logging in distributed systems: usefulness, verbosity, aggregation, etc System health

    & failure detectors Inspection of code & libraries used varnishlog, grep, & a whole lotta Jed Tools in place
  19. System Visibility & debugging SCOPE FORMAT SIZE AVG Response Before

    (ms) AVG Response AFTER (ms) Response time decrease FR WebP 250x250 95 44 54% FR WebP 72x72 80 40 50% FR WebP 641x641 200 170 15% FR JPEG 250x250 95 150 -58% FR JPEG 72x72 80 40 50% FR JPEG 641x641 200 170 15% INTER WebP + JPEG All 750 250 67%
  20. No Global System Visibility Ability to reason about your system’s

    goals & its dependencies is key Tracking and fixing “slow” is an ongoing activity Seemingly small amounts of performance variability in critical components quickly add up to create less than ideal conditions* * Ilya Grigorik - Building Fast & Resilient Web Applications
  21. Geographical Distribution Tradeoffs Dedicated hardware- based ImageOpto POPs Beefy machines

    with tons of cores & RAM Fast network connectivity Harder to dynamically grow the service with customer demand Cannot accommodate customers with origins not in USA
  22. Resilience & Mixed Mode Fastly-IO-Info Header ETags are a function

    of the server handling a particular request Use HTTP ETags to guard against output encoding changes
  23. Resilience & Operability Complex operations make systems less resilient &

    more incident-prone New systems/ functionality tend to shake new bugs Expect everything to be awful (always) so try to isolate your failure domains
  24. Operability Tradeoffs Pay-as-you-go investment model in system operability /resilience Redundancies

    are key Less to do the less your API does Corners cut here will come back at the most inopportune time Complexity sometimes is unavoidable Cannot be bolted on
  25. Adding resilience may come at the cost of other desired

    goals (e.g. time, performance, simplicity, cost, etc) Dependencies are hard: customer setup, customer inputs, caching layer, libraries, and other systems. We have to be resilient to all of them Designing for operability increases robustness Ensuring System Resilience
  26. Tradeoffs are made in context and should be revisited often

    Goal tunnel vision may lead you to work harder on the wrong solution A narrow API that grows later is great, specially in early phases It’s all about tradeoffs
  27. Our tradeoffs in hindsight GA’ed in April System evolving &

    growing Operability, performance, & increasing resilience are key Used by companies like Airbnb, Nordstrom Rack, Beatport, Gannett, LaRedoute, 1stdibs, Surfdome, and more! www.fastly.com/io
  28. tl;dr DESIGN VISIBILITY RESILIENCE Simple utilitarian design helps you meet

    system goals Few composable operations & expand API later Keep global system context in mind! Ability to reason about what’s happening in your system is key Use logging, request tracing, & system instrumentation Many perspectives Hardening system against failure domains Have barriers to contain cascading failures Operability design matters
  29. Thank you! github.com/Randommood/TroubleWithDistribution @Randommood Special thanks to: Tyler McMullen, Jed

    Denlea, Adam Thomason, Ian Fung, Joao Taveira, Ezekiel Templin, Ashok Lalwani, Matt Whiteley, Kyle Kingsbury, Peter Bourgon, Camille Fournier, Caitie McCaffrey, Lorenzo Saino, Elaine Greenberg, & Greg Bako.