Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Handling a tremendous amount of images with Fastly / Yamagoya Traverse 2020

Tatsuhiko Kubo
November 26, 2020

Handling a tremendous amount of images with Fastly / Yamagoya Traverse 2020

Yamagoya Traverse 2020 DAY-2の登壇資料になります。

https://www.fastly.jp/yamagoya2020

Tatsuhiko Kubo

November 26, 2020
Tweet

More Decks by Tatsuhiko Kubo

Other Decks in Technology

Transcript

  1. Handling a tremendous amount of images with Fastly
    Tatsuhiko [email protected]
    Yamagoya Traverse 2020
    2020/11/26

    View Slide

  2. View Slide

  3. What Is Mercari?
    ! Service start: July 2013
    ! OS: Android, iOS
    *Can also be accessed by web browsers
    ! Usage fee: Free
    *Commission fee for sold items: 10% of the sales price
    ! Regions/languages supported: Base specs for Japan/Japanese
    !Total number of listings to date: More than 1.5 billion
    Many sellers enjoy having the items they no longer need
    purchased and used by buyers who need them, and buyers enjoy
    the feeling of hunting for treasure as they search through unique
    and diverse items for lucky finds. In addition to buying and selling,
    users actively communicate through the buyer/seller chat and the
    “Like” feature.
    The Mercari app is a C2C marketplace where individuals can
    easily sell used items. We want to provide both buyers and sellers
    with a service where they can enjoy safe and secure transactions.
    Mercari offers a unique customer experience, with a transaction
    environment that uses the payments Mercari holds in escrow, and
    simple and affordable shipping options.

    View Slide

  4. GitHub, Twitter: @cubicdaiya
    Name: Tatsuhiko Kubo
    Tech Lead, Network at Mercari, Inc.

    View Slide

  5. Responsibilities of the Network team
    • Ensure the Mercari Edge system reliability

    • CDN, TLS, DNS, Load Balancing, Reverse Proxy, …

    • Networking for Cloud and On-Premises

    • Routing between multiple DCs, Cloud Interconnect, …

    • Service Mesh

    • Istio, mTLS, …

    View Slide

  6. Topics
    • Fastly in Mercari

    • System architecture of Mercari with Fastly

    • CI/CD pipeline with Fastly

    • Monitoring with Fastly

    • Mercari Image Delivery with Fastly

    • Keeping high Cache Hit Ratio

    • Optimizing Images

    • Automating cache purge

    View Slide

  7. Fastly in Mercari

    View Slide

  8. Fastly in Mercari
    • Both static and dynamic contents are handled with Fastly

    • Images (item photo, user profile photo, …)

    • Static assets (JavaScript, CSS, …)

    • API / Web

    View Slide

  9. Fastly in Mercari
    • Scale of traffic

    • 300k+ RPS at peak

    • 20+ Gbps at peak

    • Other stats

    • 40+ services

    • 10+ TLS domains

    • 80+% of total traffic volume are Images

    View Slide

  10. Edge of Mercari JP Infrastructure
    API/Web
    Static assets
    (js, css, etc…)
    Image
    ImageFlux Amazon S3
    Cloud
    Load Balancing
    GKE
    GCS

    View Slide

  11. CI/CD pipeline for Fastly
    Pull Request
    Run CI
    Terraform
    plan/apply
    Configure
    Store tfstate
    GCS

    View Slide

  12. Monitoring Fastly metrics

    View Slide

  13. Datadog Integration with Fastly
    +
    https://docs.datadoghq.com/integrations/fastly/

    View Slide

  14. Datadog Integration with Fastly
    • Fastly metrics can be shown and customized on Datadog

    • e.g. hit_ratio, requests, bandwidth, status_4xx, status_5xx, etc…

    • Advantages of Datadog Integration with Fastly

    • Easy to integrate (Only need to register Fastly API token and Fastly
    Service IDs)

    • We can combine multiple metrics and create original metrics

    View Slide

  15. CI/CD pipeline for Datadog dashboard and monitor
    Pull Request
    Terraform
    plan/apply
    Run CI
    by GitHub Actions
    Configure
    Store tfstate
    GCS

    View Slide

  16. Mercari Image Delivery

    View Slide

  17. Images on Mercari app
    • Images are the main content on a lot of screens

    • Timeline, Search Results, Recommened Items

    • Liked Items, Browse Item History

    • Item Details, …
    Timeline Item Details
    → A lot of images are displayed

    View Slide

  18. A tremendous amount of images are delivered from CDN
    • Mercari JP

    • Total number of listings to date: More than 1.5 billion

    • Up to 10 photos can be uploaded per one listed item

    • Displayed item photos on Mercari app are resized and
    transformed from JPEG to WebP on-the-fly

    • Cached objects on CDN increase
    Number of images handled by CDN snowballsʂ

    View Slide

  19. Mercari Image Delivery in JP
    ImageFlux Amazon S3

    View Slide

  20. Mercari Image Delivery in US
    Amazon S3

    View Slide

  21. Mercari Image Delivery in US
    Amazon S3
    +
    Image Optimizer

    View Slide

  22. Fastly Image Optimizer in Mercari US
    • Originally, we used an internal image conversion proxy in Go

    • To resize, crop, convert format, … on-the-fly

    • We switched to Fastly Image Optimizer in 2018

    • Fastly VCL was useful to keep the original manipulation rule at that time


    sub vcl_recv {
    # absorb the difference between our proxy and Image Optimizer

    set req.url = regsub(req.url, “([&\?])w=([0-9]+)”, “\1width=\2”);
    set req.url = regsub(req.url, “([&\?])h=([0-9]+)”, “\1height=\2”);
    set req.url = regsub(req.url, “([&\?])fmt=([a-z]+)”, “\1format=\2”);

    }

    View Slide

  23. Our best practice for Image Delivery
    • Keep high Cache Hit Ratio(CHR) in any case!

    • Enable Origin Shielding

    • Set long TTL in Cache-Control: max-age=…

    • Optimize image while keeping appropriate quality

    • Balance UX and cost saving

    • Pay attention to the image size distribution

    • Automate cache purge

    View Slide

  24. Origin Shielding

    View Slide

  25. Origin Shielding
    • Sandwiching a POP between Edge POP and Origin

    • Cover cache miss on Edge POP

    • Official document

    • https://docs.fastly.com/en/guides/shielding

    View Slide

  26. Sandwich Shielding POP between Edge POP and Origin
    Edge POP
    Edge POP
    Edge POP
    Shielding POP
    ImageFlux Amazon S3
    Cache Hit on Edge POP
    Cache Hit on Shielding POP
    Cache Miss

    View Slide

  27. Pros/Cons of Origin Shielding
    • Pros

    • Cache Hit Ratio improves significantly

    • Cons

    • Additional traffic fee on Shielding POP is charged

    View Slide

  28. Cache Hit Ratio on Fastly-Stats

    View Slide

  29. Cache Hit Ratio on Fastly-Stats

    View Slide

  30. Cache Hit Ratio on Fastly-Stats
    HIT RATIO does not contain Shielding hits

    View Slide

  31. Cache Hit Ratio on Fastly Stats
    • Hit RATIO does not contain Shielding hits

    • The same applies to hit_ratio in Historical Stats

    • We need to calculate Cache Hit Ratio with Shielding by combining other
    metrics

    View Slide

  32. CHR CalculationʢIf Shielding is enabledʣ
    Cache Hit Ratio(True) = (1 − miss − shield
    requests − shield
    ) × 100
    miss: Number of cache misses

    shield: Number of requests from edge to the shield POP

    requests: Number of Requests Processed
    The truth about cache hit ratios: https://www.fastly.com/blog/truth-about-cache-hit-ratios
    * Taking no account of number of some states like pass

    View Slide

  33. CHR Calculation on Datadog

    View Slide

  34. CHR Calculation on Datadog
    widget {
    query_value_definition {
    autoscale = false
    custom_unit = “%”
    precision = 2
    request {
    aggregator = “avg”
    q = “(1-(avg:fastly.miss{${local.datadog_tag}}-
    avg:fastly.shield{${local.datadog_tag}})/(avg:fastly.requests{$
    {local.datadog_tag}}-avg:fastly.shield{${local.datadog_tag}}))*100”
    }
    titile = “Cache Hit Rate (True)”
    }
    }
    Terraforming

    View Slide

  35. Daily CHR (Mercari Image Delivery in JP)
    CHR with Shielding
    hit_ratio in Historical Stats

    View Slide

  36. Daily CHR (Mercari Image Delivery in US)
    CHR with Shielding
    hit_ratio in Historical Stats

    View Slide

  37. Impact of Origin Shielding
    • Mercari Image Delivery’s Cache Hit Ratio improves significantly

    • In approximately,

    • JP: 96.x% -> 98.x%

    • US: 60~70+% -> 80~90+%
    CHR for a given month when Shielding is enabled
    Cache Hit Rate(Edge): hit_ratio in Historical Stats
    Cache Hit Rate(True): CHR with Shielding

    View Slide

  38. Why is there such a big difference in CHR between JP and US?
    • The United States is larger than Japan

    • Fastly has more POPs in the United States than in Japan

    • As the number of POP increases, CHR on the edge decreases

    • Japan: 3 POPs, North America: 20+ POPs

    • References

    • Fastly Network Map: https://www.fastly.com/network-map

    • Why having more POPs isn’t always better: https://www.fastly.com/blog/why-
    having-more-pops-isnt-always-better

    View Slide

  39. Optimizing Images

    View Slide

  40. Image size distribution in Mercari Image Delivery in JP
    1kɿ~1KB
    10kɿ1KB~10KB
    100kɿ10KB~100KB
    1mɿ100KB~1MB
    10mɿ1MB~10MB
    100mɿ10MB~100MB
    1gɿ100MB~1GB

    View Slide

  41. It’s useful to know the image size distribution
    1kɿ~1KB
    10kɿ1KB~10KB
    100kɿ10KB~100KB
    1mɿ100KB~1MB
    10mɿ1MB~10MB
    100mɿ10MB~100MB
    1gɿ100MB~1GB

    View Slide

  42. 100k size objects started to increase
    1kɿ~1KB
    10kɿ1KB~10KB
    100kɿ10KB~100KB
    1mɿ100KB~1MB
    10mɿ1MB~10MB
    100mɿ10MB~100MB
    1gɿ100MB~1GB
    It’s useful to know the image size distribution

    View Slide

  43. 100k size objects started to increase
    1kɿ~1KB
    10kɿ1KB~10KB
    100kɿ10KB~100KB
    1mɿ100KB~1MB
    10mɿ1MB~10MB
    100mɿ10MB~100MB
    1gɿ100MB~1GB
    It’s useful to know the image size distribution
    Detected and fixed the cause

    View Slide

  44. 100k size objects started to increase Detected and fixed the cause
    The cause was that JPEG is delivered instead of WebP in some microservices
    1kɿ~1KB
    10kɿ1KB~10KB
    100kɿ10KB~100KB
    1mɿ100KB~1MB
    10mɿ1MB~10MB
    100mɿ10MB~100MB
    1gɿ100MB~1GB
    It’s useful to know the image size distribution

    View Slide

  45. Optimizing Images
    • Displayed images on Mercari app are resized and
    transformed from JPEG to WebP on-the-fly

    • Around 2017~2018, on-the-fly resizing and WebP transformation were
    introduced for item photo and user profile photo

    • We decreased the traffic volume by 30~40% at that time

    View Slide

  46. Why is Optimizing Images important?
    • To balance UX and cost saving for CDN

    • By optimizing image, the UX impacted by network latency can be
    improved while saving costs

    • Optimizing images leads to save monthly data volume for users

    View Slide

  47. Automating cache purge

    View Slide

  48. Cache purge on Slack

    View Slide

  49. Cache purge on Slack
    ᶃ Type /purge_cache URL
    ᶄ Build and transfer an API payload
    ᶅ Issue a cache purge API request



    Google Cloud Functions

    View Slide

  50. Cache purge on Slack
    • Only Typing /purge_cache URL in Slack

    • Implemented by Slack Slash Commands

    • https://api.slack.com/interactivity/slash-commands

    • Finally, Google Cloud Functions in Go runs a cache purge for multiple CDNs

    View Slide

  51. Cloud Functions with Cloud Pub/Sub trigger

    View Slide

  52. Cloud Functions with Cloud Pub/Sub trigger
    • Cloud Functions can be triggered by message published to Pub/Sub topics

    • https://cloud.google.com/functions/docs/calling/pubsub

    • It’s useful to automate event-driven cache purge

    View Slide

  53. References
    • System Integration with Fastly

    • https://speakerdeck.com/cubicdaiya/system-integration-with-fastly

    • Google Cloud FunctionsΛ࢖ͬͯSlackͰ؆୯ʹCDN্ͷΩϟογϡΛফͤΔ
    Α͏ʹ͢Δ࿩

    • https://engineering.mercari.com/blog/entry/2019-09-20-110000/

    • CDNͰੜ͖ӬΒ͑Δݹ͍ը૾ͷΩϟογϡΛফ͢Cloud Functionsͷ࿩

    • https://engineering.mercari.com/blog/entry/2019-12-05-180000/

    View Slide