Keeping Throughput High on Green Saturday (incl. presenter notes)

Keeping Throughput HIGH on Green Saturday Alexander Reiff

Discovery API My team is focused on what we at
Weedmaps call our “Discovery API” Read-only Rails API that serves most our content on the FE, from the web and our native mobile apps Its data source is Elasticsearch, which gives us super quick reads, and a simple horizontal scaling mechanism when pressure gets too high on the cluster. We source Elasticsearch from our queue pipeline we’ve implemented called “Dabbit”, a layer on top of RabbitMQ.

! WM Core Discovery API Messages are put on the
queue by our “Core” application that houses the CMS, most of the business logic, and maintains the source data in Postgres. (If this setup sounds interesting, and you want to talk about it more, come talk to us at our booth. We’re deﬁnitely hiring!)

! Discovery API 3 main areas of focus  Determines where
I am, retailers/delivery services near me, their products and the best deals on those products When people are searching for green, they are using services my team are responsible. And there’s on particular day when all kinds of people are searching for green

Green Saturday ⁉

Friday Saturday Sunday Monday 19 20 21 22   Green
Saturday* Earth Day Passover (Night 1) Easter Calendar of April Last week, Earth Day on Monday the 22nd. 2 days, early April 20th WM’s highest trafﬁc spike of the year, consistently elevated year over year

2016 2017 2018 2019 ❔ ❓ 2016 my first year
at WM. 420 at WM, expecting a party Core application database was deadlocked within a few hours of start of business 2017: Split PG reads and writes. Also had this new “V2” API backed by ES that was used for some routes   2018: CA went recreational. Needed much more scale. Apps moved to Docker, run by Rancher, can quickly scale and deploy any changes as needed. Most read traffic going to “V2”, now Discovery API. ES2 was shaky. Upgraded to managed ES6 and improved our shard configuration. Boring day. 2019: Baseline traffic is about double from a year ago. Few traffic spikes early in the year made us a little nervous. What else to upgrade?

/location /brands/categories Two routes are hit by our FE clients
on each homepage load (at least once!): /location /brands/categories

/location 3 4 ! 5 Determines a user’s location based
on device geolocation-provided coordinates -> maps location to one or more sales regions based on what services are available -> pulls list of businesses advertising in those sales regions In general, data that does not change with extremely high frequency Cache? But we still want a good experience for a business owner who updates their page or buys some new advertising position

microcache! Microcaching: caching the entire web response for a brief
period Micro refers to the TTL, not the size of cached payload

http {  proxy_cache_path /tmp/cache keys_zone=micache: 150m max_size=200m; ...    server
{ server_name railsconf2019.local    location / { proxy_pass http: //localhost:3000;  proxy_redirect off; ...    proxy_cache micache;  proxy_cache_valid 200 30s;  proxy_cache_key 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 We love Nginx at weedmaps. Have it enabled at many layers of our network. Nginx makes this pretty easy! proxy_cache_path proxy_cache proxy_cache_key but… local ﬁle system is used as the cache store. We’re in Docker so we have a bunch of Nginx instances and they cannot share the cache

  location / { proxy_pass http: //localhost:3000;  proxy_redirect off; ... 
  proxy_cache micache;  proxy_cache_valid 200 30s;  proxy_cache_key “$proxy_host$request_uri$http_authorization”; } }  } 7 8 9 10 11 12 13 14 15 16 17 18 We love Nginx at weedmaps. Have it enabled at many layers of our network. Nginx makes this pretty easy! proxy_cache_path proxy_cache proxy_cache_key but… local ﬁle system is used as the cache store. We’re in Docker so we have a bunch of Nginx instances and they cannot share the cache

OpenResty Enter OpenResty: Nginx with gourmet accoutrement * LuaJIT *
A bunch of Lua libraries * A series of Nginx modules that allow Nginx to talk to non-HTTP upstreams…. Like Memcached * Just like Rails.cache, Memcached lets us share cached store across app instances, in this case the app is Nginx

So… cache hit is not great We don’t have that
many kinds of requests, but we get a lot of variation between requests. Why? We rely on mobile devices to send a user’s coordinates to geocode their location… and mobile devices sent verrrry precise coordinates Most applications don’t need this level of precision.

Decimal degree precision 1 2 3 4 5 6 Distance
(meters) 0.01 0.1 1 10 100 1000 10000 This chart is stolen from Wikipedia that I turned into a graph. At 45ºN, these are the distances represented by each decimal degree of precise. (Minneapolis is at 44.9ºN, so these should be pretty accurate.) Beyond 5 decimal points, we’re down to sub-meter precisions

Or more concretely… Here’s the location of our Weedmaps After
Party at the Aria, just a few blocks from Convention Center. 44.984549ºN, 93.268504ºW So what if we round it to 4 decimals? 44.9845, -93.2686 2? 44.98,-93.27 1? 45.0,-93.3 (up I-94 in the north part of the city) So…. We can probably drop some of these decimal degrees from our user input. But where do we do that?

class Types ::LatLon < Dry ::Struct ::Value attribute :latitude, Types
::Coercible ::Float attribute :longitude, Types ::Coercible ::Float def round(decimal_degrees) new(latitude: latitude.round(decimal_degrees), longitude: longitude.round(decimal_degrees)) end end 1 2 3 4 5 6 7 8 9 We have a Dry Struct (shout out to dry-rb!) modeling LatLon. We could easily stick a round method in here.

coord_params = params.permit(:latitude, :longitude).to_h rounded = Types ::LatLon[coord_params].round(4) Query ::Region.new(intersects_with:
rounded).first 1 2 3 And then round out coordinates before passing to a query But if we’re doing it in our Rails app, we’re already behind our microcache, so that won’t improve our cache rate. So remember, we have our microcache set up in Nginx. Maybe we do an Nginx rewrite of some sort before the cache is accessed?

The sort of Nginx rewrite we chose to implement was
done using a Lua plugin for our API-Gateway to sits in front of all our API services, Discovery API, our Core Rails app, our Elixir services, all that. The API-Gateway is powered by a service called “Kong”. Kong is OpenResty, so it is Nginx, with dynamically implemented routes and an awesome plugin architecture that works just like Rails middleware. There’s hooks to modify the request and response at various stages of the handling. But I will say… Lua is not so fun. Who’s going to build Nginx bindings for Crystal!?

❔ The sort of Nginx rewrite we chose to implement
was done using a Lua plugin for our API-Gateway to sits in front of all our API services, Discovery API, our Core Rails app, our Elixir services, all that. The API-Gateway is powered by a service called “Kong”. Kong is OpenResty, so it is Nginx, with dynamically implemented routes and an awesome plugin architecture that works just like Rails middleware. There’s hooks to modify the request and response at various stages of the handling. But I will say… Lua is not so fun. Who’s going to build Nginx bindings for Crystal!?

So… here is that before graph… cache rate about 5-6%
We roll out the plugin and enable coordinate rounding

After rounding: ~9-10%. Much better. So cool. Nginx is handling
a few thousand requests for us that Rails does not need to handle. Thanks Nginx!

3 4 ! 5 /location But we still got thousands
more requests to locations to process. So, it was time to look into the locations route controller to see what else we really optimize. Our NewRelic transaction traces identified our region query as consistently the slowest running operation. We turned on Elasticsearch’s slow query logs, they confirmed the same. So out of the three main parts to this request—geolocation, determining the user’s region, and finding the retailers in that region—we chose to focus on the regions. But… our Sales Operations team is not too often changing region boundaries for established areas, so it’s not very likely a given coordinate will yield different region results minute to minute, or even day to day.

coord_params = params.permit(:latitude, :longitude).to_h rounded = Types ::LatLon[coord_params].round(4) cache_key =
"region_ #{rounded}" Rails.cache.fetch(cache_key, expires_in: 10.minutes) do Query ::Region.new(intersects_with: rounded).first end 1 2 3 4 5 6 7 So let’s cache it! Here’s that rounded region snippet again. This time we’re wrapping the lookup in a standard `Rails.cache.fetch` with expiry (10 minutes here, compared to 30 seconds in the microcache). See that at line 5. If the key hasn’t been set, or if the old value has expired, the block gets called and we make our query.

read-through cache This what is termed a read-through cache. Pretty
standard, pretty simple. But there is an alternative! Ask yourself, will the data I’ll get on a cache refresh likely be different from what I have stored? Will my users notice even if it is? If you can say to No either of these you might want to replace the simple cache with what’s called a write-behind cache

write-behind cache What’s the difference? With the write behind, if
a cache key is expired, rather than refreshing as part of the fetch and returning the new data, the expired cache value gets returned. At the same time, we enqueue a background job to go ahead, fetch the new data and store it in the cache.

class WriteBehindCache def initialize(ttl = 1.day) @ttl = ttl @store
= Redis.current end def set(key, value) to_cache = { payload: value, expires_at: Time.current + @ttl } @ttl.set(key, to_cache) end def get(key) cached = @store.get(key) return unless cached refresh_cache(key) if cached[:expires_at] < Time.current cached[:payload] end private 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Don’t the TTL on the cache key itself. Store an expiration timestamp as part of the cached value. (#set Line 7) When retrieved in `#get`, check the expiration date (line 16). Before that, return the payload and move on. Else enqueue a worker that will go fetch and store the updated cache value. Ultimate goal is to shift any spikes in upstream latency to latency at the background worker layer, rather than latency that gets passed on our end users with a slow response.

cached[:payload] end private def refresh_cache(key) WriteBehindRefresher.perform_later(key, @ttl) end end class
WriteBehindRefresher < ApplicationJob def perform(key, ttl) payload = fetch_resource(key) cache = WriteBehindCache.new(ttl) cache.set(key, payload) end def fetch_response(key) Faraday.get("https: //myapi.org/ #{key}").body end end 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 Don’t the TTL on the cache key itself. Store an expiration timestamp as part of the cached value. (#set Line 7) When retrieved in `#get`, check the expiration date (line 16). Before that, return the payload and move on. Else enqueue a worker that will go fetch and store the updated cache value. Ultimate goal is to shift any spikes in upstream latency to latency at the background worker layer, rather than latency that gets passed on our end users with a slow response.

A Cautionary Tale In this use case, our upstream service
is Elasticsearch. What are some ways to add upstream latency to Elasticsearch? (1)Bursts of writes under heavy read loads; On April 20th at noon, POS syncing an enormous menu (2)Continually running expensive read queries…

region_geometry_join bounding_box geometry … parent document child document id name
aliases region_path region_geometry_join … regions index ES has a feature where you can `join` documents into a parent-child relationship. We used this to store our region document metadata separate from the geometry data. Sufﬁce to say it made indexing more convenient for us at the time it was implemented. Querying was still pretty convenient too: Find region that has a child document with geometry intersecting with my coordinate. Not the nicest code to generate the deeply nested ES DSL to express this query, but it worked. [effect] Or did it?

region_geometry_join bounding_box geometry … parent document child document id name
aliases region_path region_geometry_join … regions index region_geometry_join bounding_box geometry … id name aliases region_path region_geometry_join … ES has a feature where you can `join` documents into a parent-child relationship. We used this to store our region document metadata separate from the geometry data. Sufﬁce to say it made indexing more convenient for us at the time it was implemented. Querying was still pretty convenient too: Find region that has a child document with geometry intersecting with my coordinate. Not the nicest code to generate the deeply nested ES DSL to express this query, but it worked. [effect] Or did it?

—Elasticsearch Query DSL docs  has_child ﬁlter “If you care about
query performance   you should not use this query.” We used this on our main route. As so often the case in life, what seemed cheap and convenient at the time ends up being expensive and harmful down the way. Rethought our indexing strategy, collapsed the two documents into a single structure, reworked our updates so we could maintain delta updates.

Peak around 45ms spent querying ES in the location route.
Now peak around 30ms 33% improvement. Just what we were hoping for! Now there was one place were using that has_child query.

/location /brands/categories Remember that other route from the homepage with
the brand categories? That had some legacy inclusion functionality from an earlier iteration of the feature using join documents. We no longer needed that query and could map the new card data to the other structures. With such a great result with regions, replacing this will probably great too right?

Whooooops. What happened there? Now you might be thinking that
big brown spike is Elasticsearch time. But ES is actually the purple The brown is the time spent converting Elasticsearch’s query response to *our* API response formats. Moving data in and out of Ruby hashes is a pretty time consuming and expensive operation in C-land!

It is beyond the scope of this talk, but take
my word for it that a lot of data structures are involved. Take this code behind a simple hash lookup. I’m no C developer, but I do know that universally, nested conditionals often end up causing some pain.

So what caused the big spike in Ruby wasting time
with Hash? It turns out, that when we started using the new non-join query result to build that legacy response, we ended up parsing the Elasticsearch response two times over. So twice the CPU time. Okay darn. How did we ﬁx that?

class Query ::PromotedProducts def results - parser.parse(es_response) + @results ||=
parser.parse(es_response) end end 1 2 3 4 5 Simple memoization. What was really was the problem here? We forgot to test the performance of our changes! We made assumptions about how a change manipulating our data would perform, but we never conﬁrmed those assumptions. Once we did run a load test locally, we were able to conﬁrm the latency dropped back down to what we’d expect. 2 routes two improvements, ready for Green Saturday, April 20th. But the only performance test that really counts is how production performs on the big day. Anyone curious what that looked like?

80 MS avg response time 100K RPM peak services throughput
9% microcache hit rate 53K RPM Discovery API 20% location requests 100k requests per minute per Kong 53k overall requests per minute Almost 20% of those were location requests, so all that ﬁne tuning paid off 5k requests served out of cache Not quite 3x throughput a normal Saturday, with our average latency stayed under 100ms, only 80ms

100% uptime Quite a success! As you might imagine, that
party vibe I was expecting back in 2016 was in the air on this Green Saturday. So to reﬂect back, what did we do to achieve this success?

Cached a lot of stuff. At different layers of the
service. Consider whether your users always require the most up-to-date information coming straight from the data source. If you run a public website, there are probably more than a few cases where you can cache web responses, at least for a brief period

Tune user input   to make it more useful The
various sensors and signals that are carried with us all day in our smartphones can share a wide set of very precise data with our apps. Depending on the use case, you may want to manipulate this data before plugging it into your business logic… or your cache key generator. Your GPS sensor does not know that a particular route only needs know a user’s location with only ZIP code-level granularity.

Limit external requests made while serving users One of the
worst feelings is looking at your application performance monitor and seeing a big spike in your response time and tracing that back to corresponding response time in some external service. Where you can, try to move that external request to a background worker and persist it for your API service to fetch and return it to the end user.

Reconfirm your schema is right for your queries This can
apply to any database where you are defining indices: Elasticsearch as well as Postgres or MySQL or whoever. Often as your application evolves, your initial index setup may no longer be ideal. Maybe there’s a screen that filters data on a field that wasn’t originally intended for filtering. Or data is being joined in unanticipated ways. Or, in our case, our index set up just proved to be against best practices for performance. I recommend going back and reviewing the documentation for the latest versions of your databases. Amongst the new feature details, I’ve found existing functionality gets clarified. Gotchas that are discovered in the Real World might be called out. You might feel called out for a poor implementation as we did.

Benchmark your improvements! When you’re in the midst of making
these improvements to your application, it’s easy to move a little too fast. Once you have a ﬁx, it’s human nature to want to rush to get it into production as quick as possible. But spend the time to set up a performance test. You can simulate load using tools like Apache Benchmark, JMeter or wrk. Checkout your master branch and run your test there to get a baseline, and then compare the results when running against your feature branch. When you validate your assumptions, you may just catch something unexpected before it’s messing with your Apdex.

For HIGH performance, Rails needs to be MELLOW. Overall the
key to the high throughput we achieved was giving our Rails app and its database a break. Take advantage of other services you have setup. Very likely your proxy layer and your cache stores are underutilized. Let them take some of the edge off.

And we at WM know a little something about taking
the edge off.

❓❓

Keeping Throughput High on Green Saturday (incl...

Keeping Throughput High on Green Saturday (incl. presenter notes)

More Decks by Alexander Reiff

Other Decks in Technology

Featured

Transcript