Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Handling a tremendous amount of images with Fas...

Tatsuhiko Kubo
November 26, 2020

Handling a tremendous amount of images with Fastly / Yamagoya Traverse 2020

Yamagoya Traverse 2020 DAY-2の登壇資料になります。

https://www.fastly.jp/yamagoya2020

Tatsuhiko Kubo

November 26, 2020
Tweet

More Decks by Tatsuhiko Kubo

Other Decks in Technology

Transcript

  1. What Is Mercari? ! Service start: July 2013 ! OS:

    Android, iOS *Can also be accessed by web browsers ! Usage fee: Free *Commission fee for sold items: 10% of the sales price ! Regions/languages supported: Base specs for Japan/Japanese !Total number of listings to date: More than 1.5 billion Many sellers enjoy having the items they no longer need purchased and used by buyers who need them, and buyers enjoy the feeling of hunting for treasure as they search through unique and diverse items for lucky finds. In addition to buying and selling, users actively communicate through the buyer/seller chat and the “Like” feature. The Mercari app is a C2C marketplace where individuals can easily sell used items. We want to provide both buyers and sellers with a service where they can enjoy safe and secure transactions. Mercari offers a unique customer experience, with a transaction environment that uses the payments Mercari holds in escrow, and simple and affordable shipping options.
  2. Responsibilities of the Network team • Ensure the Mercari Edge

    system reliability • CDN, TLS, DNS, Load Balancing, Reverse Proxy, … • Networking for Cloud and On-Premises • Routing between multiple DCs, Cloud Interconnect, … • Service Mesh • Istio, mTLS, …
  3. Topics • Fastly in Mercari • System architecture of Mercari

    with Fastly • CI/CD pipeline with Fastly • Monitoring with Fastly • Mercari Image Delivery with Fastly • Keeping high Cache Hit Ratio • Optimizing Images • Automating cache purge
  4. Fastly in Mercari • Both static and dynamic contents are

    handled with Fastly • Images (item photo, user profile photo, …) • Static assets (JavaScript, CSS, …) • API / Web
  5. Fastly in Mercari • Scale of traffic • 300k+ RPS

    at peak • 20+ Gbps at peak • Other stats • 40+ services • 10+ TLS domains • 80+% of total traffic volume are Images
  6. Edge of Mercari JP Infrastructure API/Web Static assets (js, css,

    etc…) Image ImageFlux Amazon S3 Cloud Load Balancing GKE GCS
  7. Datadog Integration with Fastly • Fastly metrics can be shown

    and customized on Datadog • e.g. hit_ratio, requests, bandwidth, status_4xx, status_5xx, etc… • Advantages of Datadog Integration with Fastly • Easy to integrate (Only need to register Fastly API token and Fastly Service IDs) • We can combine multiple metrics and create original metrics
  8. CI/CD pipeline for Datadog dashboard and monitor Pull Request Terraform

    plan/apply Run CI by GitHub Actions Configure Store tfstate GCS
  9. Images on Mercari app • Images are the main content

    on a lot of screens • Timeline, Search Results, Recommened Items • Liked Items, Browse Item History • Item Details, … Timeline Item Details → A lot of images are displayed
  10. A tremendous amount of images are delivered from CDN •

    Mercari JP • Total number of listings to date: More than 1.5 billion • Up to 10 photos can be uploaded per one listed item • Displayed item photos on Mercari app are resized and transformed from JPEG to WebP on-the-fly • Cached objects on CDN increase Number of images handled by CDN snowballsʂ
  11. Fastly Image Optimizer in Mercari US • Originally, we used

    an internal image conversion proxy in Go • To resize, crop, convert format, … on-the-fly • We switched to Fastly Image Optimizer in 2018 • Fastly VCL was useful to keep the original manipulation rule at that time • sub vcl_recv { # absorb the difference between our proxy and Image Optimizer … set req.url = regsub(req.url, “([&\?])w=([0-9]+)”, “\1width=\2”); set req.url = regsub(req.url, “([&\?])h=([0-9]+)”, “\1height=\2”); set req.url = regsub(req.url, “([&\?])fmt=([a-z]+)”, “\1format=\2”); … }
  12. Our best practice for Image Delivery • Keep high Cache

    Hit Ratio(CHR) in any case! • Enable Origin Shielding • Set long TTL in Cache-Control: max-age=… • Optimize image while keeping appropriate quality • Balance UX and cost saving • Pay attention to the image size distribution • Automate cache purge
  13. Origin Shielding • Sandwiching a POP between Edge POP and

    Origin • Cover cache miss on Edge POP • Official document • https://docs.fastly.com/en/guides/shielding
  14. Sandwich Shielding POP between Edge POP and Origin Edge POP

    Edge POP Edge POP Shielding POP ImageFlux Amazon S3 Cache Hit on Edge POP Cache Hit on Shielding POP Cache Miss
  15. Pros/Cons of Origin Shielding • Pros • Cache Hit Ratio

    improves significantly • Cons • Additional traffic fee on Shielding POP is charged
  16. Cache Hit Ratio on Fastly Stats • Hit RATIO does

    not contain Shielding hits • The same applies to hit_ratio in Historical Stats • We need to calculate Cache Hit Ratio with Shielding by combining other metrics
  17. CHR CalculationʢIf Shielding is enabledʣ Cache Hit Ratio(True) = (1

    − miss − shield requests − shield ) × 100 miss: Number of cache misses shield: Number of requests from edge to the shield POP requests: Number of Requests Processed The truth about cache hit ratios: https://www.fastly.com/blog/truth-about-cache-hit-ratios * Taking no account of number of some states like pass
  18. CHR Calculation on Datadog widget { query_value_definition { autoscale =

    false custom_unit = “%” precision = 2 request { aggregator = “avg” q = “(1-(avg:fastly.miss{${local.datadog_tag}}- avg:fastly.shield{${local.datadog_tag}})/(avg:fastly.requests{$ {local.datadog_tag}}-avg:fastly.shield{${local.datadog_tag}}))*100” } titile = “Cache Hit Rate (True)” } } Terraforming
  19. Impact of Origin Shielding • Mercari Image Delivery’s Cache Hit

    Ratio improves significantly • In approximately, • JP: 96.x% -> 98.x% • US: 60~70+% -> 80~90+% CHR for a given month when Shielding is enabled Cache Hit Rate(Edge): hit_ratio in Historical Stats Cache Hit Rate(True): CHR with Shielding
  20. Why is there such a big difference in CHR between

    JP and US? • The United States is larger than Japan • Fastly has more POPs in the United States than in Japan • As the number of POP increases, CHR on the edge decreases • Japan: 3 POPs, North America: 20+ POPs • References • Fastly Network Map: https://www.fastly.com/network-map • Why having more POPs isn’t always better: https://www.fastly.com/blog/why- having-more-pops-isnt-always-better
  21. Image size distribution in Mercari Image Delivery in JP 1kɿ~1KB

    10kɿ1KB~10KB 100kɿ10KB~100KB 1mɿ100KB~1MB 10mɿ1MB~10MB 100mɿ10MB~100MB 1gɿ100MB~1GB
  22. It’s useful to know the image size distribution 1kɿ~1KB 10kɿ1KB~10KB

    100kɿ10KB~100KB 1mɿ100KB~1MB 10mɿ1MB~10MB 100mɿ10MB~100MB 1gɿ100MB~1GB
  23. 100k size objects started to increase 1kɿ~1KB 10kɿ1KB~10KB 100kɿ10KB~100KB 1mɿ100KB~1MB

    10mɿ1MB~10MB 100mɿ10MB~100MB 1gɿ100MB~1GB It’s useful to know the image size distribution
  24. 100k size objects started to increase 1kɿ~1KB 10kɿ1KB~10KB 100kɿ10KB~100KB 1mɿ100KB~1MB

    10mɿ1MB~10MB 100mɿ10MB~100MB 1gɿ100MB~1GB It’s useful to know the image size distribution Detected and fixed the cause
  25. 100k size objects started to increase Detected and fixed the

    cause The cause was that JPEG is delivered instead of WebP in some microservices 1kɿ~1KB 10kɿ1KB~10KB 100kɿ10KB~100KB 1mɿ100KB~1MB 10mɿ1MB~10MB 100mɿ10MB~100MB 1gɿ100MB~1GB It’s useful to know the image size distribution
  26. Optimizing Images • Displayed images on Mercari app are resized

    and transformed from JPEG to WebP on-the-fly • Around 2017~2018, on-the-fly resizing and WebP transformation were introduced for item photo and user profile photo • We decreased the traffic volume by 30~40% at that time
  27. Why is Optimizing Images important? • To balance UX and

    cost saving for CDN • By optimizing image, the UX impacted by network latency can be improved while saving costs • Optimizing images leads to save monthly data volume for users
  28. Cache purge on Slack ᶃ Type /purge_cache URL ᶄ Build

    and transfer an API payload ᶅ Issue a cache purge API request ᶃ ᶄ ᶅ Google Cloud Functions
  29. Cache purge on Slack • Only Typing /purge_cache URL in

    Slack • Implemented by Slack Slash Commands • https://api.slack.com/interactivity/slash-commands • Finally, Google Cloud Functions in Go runs a cache purge for multiple CDNs
  30. Cloud Functions with Cloud Pub/Sub trigger • Cloud Functions can

    be triggered by message published to Pub/Sub topics • https://cloud.google.com/functions/docs/calling/pubsub • It’s useful to automate event-driven cache purge
  31. References • System Integration with Fastly • https://speakerdeck.com/cubicdaiya/system-integration-with-fastly • Google

    Cloud FunctionsΛ࢖ͬͯSlackͰ؆୯ʹCDN্ͷΩϟογϡΛফͤΔ Α͏ʹ͢Δ࿩ • https://engineering.mercari.com/blog/entry/2019-09-20-110000/ • CDNͰੜ͖ӬΒ͑Δݹ͍ը૾ͷΩϟογϡΛফ͢Cloud Functionsͷ࿩ • https://engineering.mercari.com/blog/entry/2019-12-05-180000/