Slide 1

Slide 1 text

Handling a tremendous amount of images with Fastly Tatsuhiko Kubo@cubicdaiya Yamagoya Traverse 2020 2020/11/26

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

What Is Mercari? ! Service start: July 2013 ! OS: Android, iOS *Can also be accessed by web browsers ! Usage fee: Free *Commission fee for sold items: 10% of the sales price ! Regions/languages supported: Base specs for Japan/Japanese !Total number of listings to date: More than 1.5 billion Many sellers enjoy having the items they no longer need purchased and used by buyers who need them, and buyers enjoy the feeling of hunting for treasure as they search through unique and diverse items for lucky finds. In addition to buying and selling, users actively communicate through the buyer/seller chat and the “Like” feature. The Mercari app is a C2C marketplace where individuals can easily sell used items. We want to provide both buyers and sellers with a service where they can enjoy safe and secure transactions. Mercari offers a unique customer experience, with a transaction environment that uses the payments Mercari holds in escrow, and simple and affordable shipping options.

Slide 4

Slide 4 text

GitHub, Twitter: @cubicdaiya Name: Tatsuhiko Kubo Tech Lead, Network at Mercari, Inc.

Slide 5

Slide 5 text

Responsibilities of the Network team • Ensure the Mercari Edge system reliability • CDN, TLS, DNS, Load Balancing, Reverse Proxy, … • Networking for Cloud and On-Premises • Routing between multiple DCs, Cloud Interconnect, … • Service Mesh • Istio, mTLS, …

Slide 6

Slide 6 text

Topics • Fastly in Mercari • System architecture of Mercari with Fastly • CI/CD pipeline with Fastly • Monitoring with Fastly • Mercari Image Delivery with Fastly • Keeping high Cache Hit Ratio • Optimizing Images • Automating cache purge

Slide 7

Slide 7 text

Fastly in Mercari

Slide 8

Slide 8 text

Fastly in Mercari • Both static and dynamic contents are handled with Fastly • Images (item photo, user profile photo, …) • Static assets (JavaScript, CSS, …) • API / Web

Slide 9

Slide 9 text

Fastly in Mercari • Scale of traffic • 300k+ RPS at peak • 20+ Gbps at peak • Other stats • 40+ services • 10+ TLS domains • 80+% of total traffic volume are Images

Slide 10

Slide 10 text

Edge of Mercari JP Infrastructure API/Web Static assets (js, css, etc…) Image ImageFlux Amazon S3 Cloud Load Balancing GKE GCS

Slide 11

Slide 11 text

CI/CD pipeline for Fastly Pull Request Run CI Terraform plan/apply Configure Store tfstate GCS

Slide 12

Slide 12 text

Monitoring Fastly metrics

Slide 13

Slide 13 text

Datadog Integration with Fastly + https://docs.datadoghq.com/integrations/fastly/

Slide 14

Slide 14 text

Datadog Integration with Fastly • Fastly metrics can be shown and customized on Datadog • e.g. hit_ratio, requests, bandwidth, status_4xx, status_5xx, etc… • Advantages of Datadog Integration with Fastly • Easy to integrate (Only need to register Fastly API token and Fastly Service IDs) • We can combine multiple metrics and create original metrics

Slide 15

Slide 15 text

CI/CD pipeline for Datadog dashboard and monitor Pull Request Terraform plan/apply Run CI by GitHub Actions Configure Store tfstate GCS

Slide 16

Slide 16 text

Mercari Image Delivery

Slide 17

Slide 17 text

Images on Mercari app • Images are the main content on a lot of screens • Timeline, Search Results, Recommened Items • Liked Items, Browse Item History • Item Details, … Timeline Item Details → A lot of images are displayed

Slide 18

Slide 18 text

A tremendous amount of images are delivered from CDN • Mercari JP • Total number of listings to date: More than 1.5 billion • Up to 10 photos can be uploaded per one listed item • Displayed item photos on Mercari app are resized and transformed from JPEG to WebP on-the-fly • Cached objects on CDN increase Number of images handled by CDN snowballsʂ

Slide 19

Slide 19 text

Mercari Image Delivery in JP ImageFlux Amazon S3

Slide 20

Slide 20 text

Mercari Image Delivery in US Amazon S3

Slide 21

Slide 21 text

Mercari Image Delivery in US Amazon S3 + Image Optimizer

Slide 22

Slide 22 text

Fastly Image Optimizer in Mercari US • Originally, we used an internal image conversion proxy in Go • To resize, crop, convert format, … on-the-fly • We switched to Fastly Image Optimizer in 2018 • Fastly VCL was useful to keep the original manipulation rule at that time • sub vcl_recv { # absorb the difference between our proxy and Image Optimizer … set req.url = regsub(req.url, “([&\?])w=([0-9]+)”, “\1width=\2”); set req.url = regsub(req.url, “([&\?])h=([0-9]+)”, “\1height=\2”); set req.url = regsub(req.url, “([&\?])fmt=([a-z]+)”, “\1format=\2”); … }

Slide 23

Slide 23 text

Our best practice for Image Delivery • Keep high Cache Hit Ratio(CHR) in any case! • Enable Origin Shielding • Set long TTL in Cache-Control: max-age=… • Optimize image while keeping appropriate quality • Balance UX and cost saving • Pay attention to the image size distribution • Automate cache purge

Slide 24

Slide 24 text

Origin Shielding

Slide 25

Slide 25 text

Origin Shielding • Sandwiching a POP between Edge POP and Origin • Cover cache miss on Edge POP • Official document • https://docs.fastly.com/en/guides/shielding

Slide 26

Slide 26 text

Sandwich Shielding POP between Edge POP and Origin Edge POP Edge POP Edge POP Shielding POP ImageFlux Amazon S3 Cache Hit on Edge POP Cache Hit on Shielding POP Cache Miss

Slide 27

Slide 27 text

Pros/Cons of Origin Shielding • Pros • Cache Hit Ratio improves significantly • Cons • Additional traffic fee on Shielding POP is charged

Slide 28

Slide 28 text

Cache Hit Ratio on Fastly-Stats

Slide 29

Slide 29 text

Cache Hit Ratio on Fastly-Stats

Slide 30

Slide 30 text

Cache Hit Ratio on Fastly-Stats HIT RATIO does not contain Shielding hits

Slide 31

Slide 31 text

Cache Hit Ratio on Fastly Stats • Hit RATIO does not contain Shielding hits • The same applies to hit_ratio in Historical Stats • We need to calculate Cache Hit Ratio with Shielding by combining other metrics

Slide 32

Slide 32 text

CHR CalculationʢIf Shielding is enabledʣ Cache Hit Ratio(True) = (1 − miss − shield requests − shield ) × 100 miss: Number of cache misses shield: Number of requests from edge to the shield POP requests: Number of Requests Processed The truth about cache hit ratios: https://www.fastly.com/blog/truth-about-cache-hit-ratios * Taking no account of number of some states like pass

Slide 33

Slide 33 text

CHR Calculation on Datadog

Slide 34

Slide 34 text

CHR Calculation on Datadog widget { query_value_definition { autoscale = false custom_unit = “%” precision = 2 request { aggregator = “avg” q = “(1-(avg:fastly.miss{${local.datadog_tag}}- avg:fastly.shield{${local.datadog_tag}})/(avg:fastly.requests{$ {local.datadog_tag}}-avg:fastly.shield{${local.datadog_tag}}))*100” } titile = “Cache Hit Rate (True)” } } Terraforming

Slide 35

Slide 35 text

Daily CHR (Mercari Image Delivery in JP) CHR with Shielding hit_ratio in Historical Stats

Slide 36

Slide 36 text

Daily CHR (Mercari Image Delivery in US) CHR with Shielding hit_ratio in Historical Stats

Slide 37

Slide 37 text

Impact of Origin Shielding • Mercari Image Delivery’s Cache Hit Ratio improves significantly • In approximately, • JP: 96.x% -> 98.x% • US: 60~70+% -> 80~90+% CHR for a given month when Shielding is enabled Cache Hit Rate(Edge): hit_ratio in Historical Stats Cache Hit Rate(True): CHR with Shielding

Slide 38

Slide 38 text

Why is there such a big difference in CHR between JP and US? • The United States is larger than Japan • Fastly has more POPs in the United States than in Japan • As the number of POP increases, CHR on the edge decreases • Japan: 3 POPs, North America: 20+ POPs • References • Fastly Network Map: https://www.fastly.com/network-map • Why having more POPs isn’t always better: https://www.fastly.com/blog/why- having-more-pops-isnt-always-better

Slide 39

Slide 39 text

Optimizing Images

Slide 40

Slide 40 text

Image size distribution in Mercari Image Delivery in JP 1kɿ~1KB 10kɿ1KB~10KB 100kɿ10KB~100KB 1mɿ100KB~1MB 10mɿ1MB~10MB 100mɿ10MB~100MB 1gɿ100MB~1GB

Slide 41

Slide 41 text

It’s useful to know the image size distribution 1kɿ~1KB 10kɿ1KB~10KB 100kɿ10KB~100KB 1mɿ100KB~1MB 10mɿ1MB~10MB 100mɿ10MB~100MB 1gɿ100MB~1GB

Slide 42

Slide 42 text

100k size objects started to increase 1kɿ~1KB 10kɿ1KB~10KB 100kɿ10KB~100KB 1mɿ100KB~1MB 10mɿ1MB~10MB 100mɿ10MB~100MB 1gɿ100MB~1GB It’s useful to know the image size distribution

Slide 43

Slide 43 text

100k size objects started to increase 1kɿ~1KB 10kɿ1KB~10KB 100kɿ10KB~100KB 1mɿ100KB~1MB 10mɿ1MB~10MB 100mɿ10MB~100MB 1gɿ100MB~1GB It’s useful to know the image size distribution Detected and fixed the cause

Slide 44

Slide 44 text

100k size objects started to increase Detected and fixed the cause The cause was that JPEG is delivered instead of WebP in some microservices 1kɿ~1KB 10kɿ1KB~10KB 100kɿ10KB~100KB 1mɿ100KB~1MB 10mɿ1MB~10MB 100mɿ10MB~100MB 1gɿ100MB~1GB It’s useful to know the image size distribution

Slide 45

Slide 45 text

Optimizing Images • Displayed images on Mercari app are resized and transformed from JPEG to WebP on-the-fly • Around 2017~2018, on-the-fly resizing and WebP transformation were introduced for item photo and user profile photo • We decreased the traffic volume by 30~40% at that time

Slide 46

Slide 46 text

Why is Optimizing Images important? • To balance UX and cost saving for CDN • By optimizing image, the UX impacted by network latency can be improved while saving costs • Optimizing images leads to save monthly data volume for users

Slide 47

Slide 47 text

Automating cache purge

Slide 48

Slide 48 text

Cache purge on Slack

Slide 49

Slide 49 text

Cache purge on Slack ᶃ Type /purge_cache URL ᶄ Build and transfer an API payload ᶅ Issue a cache purge API request ᶃ ᶄ ᶅ Google Cloud Functions

Slide 50

Slide 50 text

Cache purge on Slack • Only Typing /purge_cache URL in Slack • Implemented by Slack Slash Commands • https://api.slack.com/interactivity/slash-commands • Finally, Google Cloud Functions in Go runs a cache purge for multiple CDNs

Slide 51

Slide 51 text

Cloud Functions with Cloud Pub/Sub trigger

Slide 52

Slide 52 text

Cloud Functions with Cloud Pub/Sub trigger • Cloud Functions can be triggered by message published to Pub/Sub topics • https://cloud.google.com/functions/docs/calling/pubsub • It’s useful to automate event-driven cache purge

Slide 53

Slide 53 text

References • System Integration with Fastly • https://speakerdeck.com/cubicdaiya/system-integration-with-fastly • Google Cloud FunctionsΛ࢖ͬͯSlackͰ؆୯ʹCDN্ͷΩϟογϡΛফͤΔ Α͏ʹ͢Δ࿩ • https://engineering.mercari.com/blog/entry/2019-09-20-110000/ • CDNͰੜ͖ӬΒ͑Δݹ͍ը૾ͷΩϟογϡΛফ͢Cloud Functionsͷ࿩ • https://engineering.mercari.com/blog/entry/2019-12-05-180000/