Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Handling a tremendous amount of images with Fastly / Yamagoya Traverse 2020

Tatsuhiko Kubo
November 26, 2020

Handling a tremendous amount of images with Fastly / Yamagoya Traverse 2020

Yamagoya Traverse 2020 DAY-2の登壇資料になります。

https://www.fastly.jp/yamagoya2020

Tatsuhiko Kubo

November 26, 2020
Tweet

More Decks by Tatsuhiko Kubo

Other Decks in Technology

Transcript

  1. Handling a tremendous amount of images with Fastly
    Tatsuhiko Kubo@cubicdaiya
    Yamagoya Traverse 2020
    2020/11/26

    View full-size slide

  2. What Is Mercari?
    ! Service start: July 2013
    ! OS: Android, iOS
    *Can also be accessed by web browsers
    ! Usage fee: Free
    *Commission fee for sold items: 10% of the sales price
    ! Regions/languages supported: Base specs for Japan/Japanese
    !Total number of listings to date: More than 1.5 billion
    Many sellers enjoy having the items they no longer need
    purchased and used by buyers who need them, and buyers enjoy
    the feeling of hunting for treasure as they search through unique
    and diverse items for lucky finds. In addition to buying and selling,
    users actively communicate through the buyer/seller chat and the
    “Like” feature.
    The Mercari app is a C2C marketplace where individuals can
    easily sell used items. We want to provide both buyers and sellers
    with a service where they can enjoy safe and secure transactions.
    Mercari offers a unique customer experience, with a transaction
    environment that uses the payments Mercari holds in escrow, and
    simple and affordable shipping options.

    View full-size slide

  3. GitHub, Twitter: @cubicdaiya
    Name: Tatsuhiko Kubo
    Tech Lead, Network at Mercari, Inc.

    View full-size slide

  4. Responsibilities of the Network team
    • Ensure the Mercari Edge system reliability

    • CDN, TLS, DNS, Load Balancing, Reverse Proxy, …

    • Networking for Cloud and On-Premises

    • Routing between multiple DCs, Cloud Interconnect, …

    • Service Mesh

    • Istio, mTLS, …

    View full-size slide

  5. Topics
    • Fastly in Mercari

    • System architecture of Mercari with Fastly

    • CI/CD pipeline with Fastly

    • Monitoring with Fastly

    • Mercari Image Delivery with Fastly

    • Keeping high Cache Hit Ratio

    • Optimizing Images

    • Automating cache purge

    View full-size slide

  6. Fastly in Mercari

    View full-size slide

  7. Fastly in Mercari
    • Both static and dynamic contents are handled with Fastly

    • Images (item photo, user profile photo, …)

    • Static assets (JavaScript, CSS, …)

    • API / Web

    View full-size slide

  8. Fastly in Mercari
    • Scale of traffic

    • 300k+ RPS at peak

    • 20+ Gbps at peak

    • Other stats

    • 40+ services

    • 10+ TLS domains

    • 80+% of total traffic volume are Images

    View full-size slide

  9. Edge of Mercari JP Infrastructure
    API/Web
    Static assets
    (js, css, etc…)
    Image
    ImageFlux Amazon S3
    Cloud
    Load Balancing
    GKE
    GCS

    View full-size slide

  10. CI/CD pipeline for Fastly
    Pull Request
    Run CI
    Terraform
    plan/apply
    Configure
    Store tfstate
    GCS

    View full-size slide

  11. Monitoring Fastly metrics

    View full-size slide

  12. Datadog Integration with Fastly
    +
    https://docs.datadoghq.com/integrations/fastly/

    View full-size slide

  13. Datadog Integration with Fastly
    • Fastly metrics can be shown and customized on Datadog

    • e.g. hit_ratio, requests, bandwidth, status_4xx, status_5xx, etc…

    • Advantages of Datadog Integration with Fastly

    • Easy to integrate (Only need to register Fastly API token and Fastly
    Service IDs)

    • We can combine multiple metrics and create original metrics

    View full-size slide

  14. CI/CD pipeline for Datadog dashboard and monitor
    Pull Request
    Terraform
    plan/apply
    Run CI
    by GitHub Actions
    Configure
    Store tfstate
    GCS

    View full-size slide

  15. Mercari Image Delivery

    View full-size slide

  16. Images on Mercari app
    • Images are the main content on a lot of screens

    • Timeline, Search Results, Recommened Items

    • Liked Items, Browse Item History

    • Item Details, …
    Timeline Item Details
    → A lot of images are displayed

    View full-size slide

  17. A tremendous amount of images are delivered from CDN
    • Mercari JP

    • Total number of listings to date: More than 1.5 billion

    • Up to 10 photos can be uploaded per one listed item

    • Displayed item photos on Mercari app are resized and
    transformed from JPEG to WebP on-the-fly

    • Cached objects on CDN increase
    Number of images handled by CDN snowballsʂ

    View full-size slide

  18. Mercari Image Delivery in JP
    ImageFlux Amazon S3

    View full-size slide

  19. Mercari Image Delivery in US
    Amazon S3

    View full-size slide

  20. Mercari Image Delivery in US
    Amazon S3
    +
    Image Optimizer

    View full-size slide

  21. Fastly Image Optimizer in Mercari US
    • Originally, we used an internal image conversion proxy in Go

    • To resize, crop, convert format, … on-the-fly

    • We switched to Fastly Image Optimizer in 2018

    • Fastly VCL was useful to keep the original manipulation rule at that time


    sub vcl_recv {
    # absorb the difference between our proxy and Image Optimizer

    set req.url = regsub(req.url, “([&\?])w=([0-9]+)”, “\1width=\2”);
    set req.url = regsub(req.url, “([&\?])h=([0-9]+)”, “\1height=\2”);
    set req.url = regsub(req.url, “([&\?])fmt=([a-z]+)”, “\1format=\2”);

    }

    View full-size slide

  22. Our best practice for Image Delivery
    • Keep high Cache Hit Ratio(CHR) in any case!

    • Enable Origin Shielding

    • Set long TTL in Cache-Control: max-age=…

    • Optimize image while keeping appropriate quality

    • Balance UX and cost saving

    • Pay attention to the image size distribution

    • Automate cache purge

    View full-size slide

  23. Origin Shielding

    View full-size slide

  24. Origin Shielding
    • Sandwiching a POP between Edge POP and Origin

    • Cover cache miss on Edge POP

    • Official document

    • https://docs.fastly.com/en/guides/shielding

    View full-size slide

  25. Sandwich Shielding POP between Edge POP and Origin
    Edge POP
    Edge POP
    Edge POP
    Shielding POP
    ImageFlux Amazon S3
    Cache Hit on Edge POP
    Cache Hit on Shielding POP
    Cache Miss

    View full-size slide

  26. Pros/Cons of Origin Shielding
    • Pros

    • Cache Hit Ratio improves significantly

    • Cons

    • Additional traffic fee on Shielding POP is charged

    View full-size slide

  27. Cache Hit Ratio on Fastly-Stats

    View full-size slide

  28. Cache Hit Ratio on Fastly-Stats

    View full-size slide

  29. Cache Hit Ratio on Fastly-Stats
    HIT RATIO does not contain Shielding hits

    View full-size slide

  30. Cache Hit Ratio on Fastly Stats
    • Hit RATIO does not contain Shielding hits

    • The same applies to hit_ratio in Historical Stats

    • We need to calculate Cache Hit Ratio with Shielding by combining other
    metrics

    View full-size slide

  31. CHR CalculationʢIf Shielding is enabledʣ
    Cache Hit Ratio(True) = (1 − miss − shield
    requests − shield
    ) × 100
    miss: Number of cache misses

    shield: Number of requests from edge to the shield POP

    requests: Number of Requests Processed
    The truth about cache hit ratios: https://www.fastly.com/blog/truth-about-cache-hit-ratios
    * Taking no account of number of some states like pass

    View full-size slide

  32. CHR Calculation on Datadog

    View full-size slide

  33. CHR Calculation on Datadog
    widget {
    query_value_definition {
    autoscale = false
    custom_unit = “%”
    precision = 2
    request {
    aggregator = “avg”
    q = “(1-(avg:fastly.miss{${local.datadog_tag}}-
    avg:fastly.shield{${local.datadog_tag}})/(avg:fastly.requests{$
    {local.datadog_tag}}-avg:fastly.shield{${local.datadog_tag}}))*100”
    }
    titile = “Cache Hit Rate (True)”
    }
    }
    Terraforming

    View full-size slide

  34. Daily CHR (Mercari Image Delivery in JP)
    CHR with Shielding
    hit_ratio in Historical Stats

    View full-size slide

  35. Daily CHR (Mercari Image Delivery in US)
    CHR with Shielding
    hit_ratio in Historical Stats

    View full-size slide

  36. Impact of Origin Shielding
    • Mercari Image Delivery’s Cache Hit Ratio improves significantly

    • In approximately,

    • JP: 96.x% -> 98.x%

    • US: 60~70+% -> 80~90+%
    CHR for a given month when Shielding is enabled
    Cache Hit Rate(Edge): hit_ratio in Historical Stats
    Cache Hit Rate(True): CHR with Shielding

    View full-size slide

  37. Why is there such a big difference in CHR between JP and US?
    • The United States is larger than Japan

    • Fastly has more POPs in the United States than in Japan

    • As the number of POP increases, CHR on the edge decreases

    • Japan: 3 POPs, North America: 20+ POPs

    • References

    • Fastly Network Map: https://www.fastly.com/network-map

    • Why having more POPs isn’t always better: https://www.fastly.com/blog/why-
    having-more-pops-isnt-always-better

    View full-size slide

  38. Optimizing Images

    View full-size slide

  39. Image size distribution in Mercari Image Delivery in JP
    1kɿ~1KB
    10kɿ1KB~10KB
    100kɿ10KB~100KB
    1mɿ100KB~1MB
    10mɿ1MB~10MB
    100mɿ10MB~100MB
    1gɿ100MB~1GB

    View full-size slide

  40. It’s useful to know the image size distribution
    1kɿ~1KB
    10kɿ1KB~10KB
    100kɿ10KB~100KB
    1mɿ100KB~1MB
    10mɿ1MB~10MB
    100mɿ10MB~100MB
    1gɿ100MB~1GB

    View full-size slide

  41. 100k size objects started to increase
    1kɿ~1KB
    10kɿ1KB~10KB
    100kɿ10KB~100KB
    1mɿ100KB~1MB
    10mɿ1MB~10MB
    100mɿ10MB~100MB
    1gɿ100MB~1GB
    It’s useful to know the image size distribution

    View full-size slide

  42. 100k size objects started to increase
    1kɿ~1KB
    10kɿ1KB~10KB
    100kɿ10KB~100KB
    1mɿ100KB~1MB
    10mɿ1MB~10MB
    100mɿ10MB~100MB
    1gɿ100MB~1GB
    It’s useful to know the image size distribution
    Detected and fixed the cause

    View full-size slide

  43. 100k size objects started to increase Detected and fixed the cause
    The cause was that JPEG is delivered instead of WebP in some microservices
    1kɿ~1KB
    10kɿ1KB~10KB
    100kɿ10KB~100KB
    1mɿ100KB~1MB
    10mɿ1MB~10MB
    100mɿ10MB~100MB
    1gɿ100MB~1GB
    It’s useful to know the image size distribution

    View full-size slide

  44. Optimizing Images
    • Displayed images on Mercari app are resized and
    transformed from JPEG to WebP on-the-fly

    • Around 2017~2018, on-the-fly resizing and WebP transformation were
    introduced for item photo and user profile photo

    • We decreased the traffic volume by 30~40% at that time

    View full-size slide

  45. Why is Optimizing Images important?
    • To balance UX and cost saving for CDN

    • By optimizing image, the UX impacted by network latency can be
    improved while saving costs

    • Optimizing images leads to save monthly data volume for users

    View full-size slide

  46. Automating cache purge

    View full-size slide

  47. Cache purge on Slack

    View full-size slide

  48. Cache purge on Slack
    ᶃ Type /purge_cache URL
    ᶄ Build and transfer an API payload
    ᶅ Issue a cache purge API request



    Google Cloud Functions

    View full-size slide

  49. Cache purge on Slack
    • Only Typing /purge_cache URL in Slack

    • Implemented by Slack Slash Commands

    • https://api.slack.com/interactivity/slash-commands

    • Finally, Google Cloud Functions in Go runs a cache purge for multiple CDNs

    View full-size slide

  50. Cloud Functions with Cloud Pub/Sub trigger

    View full-size slide

  51. Cloud Functions with Cloud Pub/Sub trigger
    • Cloud Functions can be triggered by message published to Pub/Sub topics

    • https://cloud.google.com/functions/docs/calling/pubsub

    • It’s useful to automate event-driven cache purge

    View full-size slide

  52. References
    • System Integration with Fastly

    • https://speakerdeck.com/cubicdaiya/system-integration-with-fastly

    • Google Cloud FunctionsΛ࢖ͬͯSlackͰ؆୯ʹCDN্ͷΩϟογϡΛফͤΔ
    Α͏ʹ͢Δ࿩

    • https://engineering.mercari.com/blog/entry/2019-09-20-110000/

    • CDNͰੜ͖ӬΒ͑Δݹ͍ը૾ͷΩϟογϡΛফ͢Cloud Functionsͷ࿩

    • https://engineering.mercari.com/blog/entry/2019-12-05-180000/

    View full-size slide