Slide 1

Slide 1 text

Caching at Bloc Post Mortem + Tech Talk

Slide 2

Slide 2 text

Hi, I’m Megan Tech Lead @ Bloc part-time online bootcamp for aspiring developers and designers

Slide 3

Slide 3 text

Caching at Bloc: Post Mortem + Tech Talk ● Debugging an unknown downtime ● Learning how Rails caches work ● How Bloc implemented caching ● How we fixed it

Slide 4

Slide 4 text

Memory Overuse Incident - November

Slide 5

Slide 5 text

This is a story all about how...

Slide 6

Slide 6 text

The Timeline Bloc has two alert mechanisms: Rollbar for exceptions, Pingdom for downtime Wednesday 8:46pm Rollbar error, single occurrence. No pingdom alert. Thursday 4:00am Pingdom alert for various landing pages Thursday 7:17am Pingdom alert for bloc.io, intermittent uptime Thursday 8:00am Consistent downtime for bloc.io Thursday 8:30am Begin investigation on the bus Thursday 9:20am Adjusted redistogo memory limit from 2GB to 5GB

Slide 7

Slide 7 text

So What Happened? ? !?

Slide 8

Slide 8 text

The Investigation The initial error surfaced by Rollbar: Redis::CommandError: OOM command not allowed when used memory > 'maxmemory'. ↳ Search through codebase for familiar code: none ↳ Check heroku database configuration: there’s a Redis To Go add-on ↳ Memory usage at 2GB, 100%: that must be it. What do we use redis for? How do we increase/decrease usage? What if we turned it off? Will restarting it do anything?

Slide 9

Slide 9 text

Interesting Graphs What’s interesting about these?

Slide 10

Slide 10 text

Initial Conclusions ● Not a customer scaling issue ● Scales linearly with time ● Key usage and memory usage are correlated ● Memory has climbed before and hit the max, what happened there? fails to reset fails to reset reaches memory maximum no data linear slope

Slide 11

Slide 11 text

Digging In Further What do we use redis for? ● Caching ○ Maybe our expirations aren’t working? ● Resque workers (emails, enrollment) ○ /resque/overview has multiple failed jobs Failed attempts: ● Restart redis db: no effect on memory ● Clear failed redis jobs: no effect on memory (still a problem to fix though) Success: ● Just throw money at the problem! Hot fix alert Now we need to figure out the real fix before June...

Slide 12

Slide 12 text

Rails Caching: An Overview

Slide 13

Slide 13 text

What is caching? Save it for later. ● Every layer of hardware and software ● Keys with values, might get expired kind of like a menu at a restaurant

Slide 14

Slide 14 text

Rails Cache Settings Multiple Types: ● Page caching ● Action caching ● Fragment caching ○ Russian doll caching ● Low-level caching ● SQL caching Enabled through two variables in the configuration file: config.action_controller.perform_caching = true (does not affect low level caching) config.cache_store = :redis_store

Slide 15

Slide 15 text

Rails Caching: Page Caching Using the actionpack-page_caching gem Code looks like: class WeblogController < ActionController::Base caches_page :show, :new end Keys look like: cache:views/www.bloc.io/users/carlton-banks/checkpoints/all Caveats: ● Can’t be used for controller actions that have before filters Page URL

Slide 16

Slide 16 text

Rails Caching: Action Caching Using the actionpack-action_caching gem Code looks like: class LandingController < ApplicationController caches_action :about, expires_in: 6.hour, if: -> { guest? } caches_action :mentors, expires_in: 6.hour, if: -> { guest? } end Keys looks like: cache:views/www.bloc.io/mentors Page URL

Slide 17

Slide 17 text

Object cache key Rails Caching: Fragment Caching (Part 1) Natively exists in Rails. Can fragment cache a partial or an object instance. Caching an instance Code looks like: / app/views/alum_stories/layouts/_text_photo_layout.html.haml - cache story do .headline = story.headline Key looks like: cache:views/alum_stories/3-20150430203540769345000/44e227db194e26779273184afa632eef md5 hash of view contents story.cache_key

Slide 18

Slide 18 text

Rails Caching: Fragment Caching (Part 2) Natively exists in Rails. Can fragment cache a partial or an object instance. Caching a partial Code looks like: / app/views/layouts/_footer.html.haml - cache do .full-page-footer %h6 Programs Keys look like cache:views/www.bloc.io/users/nicky-banks/9e8e8931e0697aff02864b001ebdc99c md5 hash of view contents Page URL

Slide 19

Slide 19 text

Rails Caching: Low-Level Caching Natively exists in Rails. Code looks like: Rails.cache.fetch('my_unique_cache_key') do Calculator.new.expensive_calculation end Keys look like: cache:my_unique_cache_key The cache key is key: ● Unique to the object’s current state ● If you expect to search keys at some point, namespacing is great specified key

Slide 20

Slide 20 text

Rails Caching: SQL Caching

Slide 21

Slide 21 text

Cache Keys: Expiration For each level of caching, you can set an expiration: Rails.cache.fetch(my_unique_cache_key, expires_in: 6.hours) An expiration value gets set on the key, aka TTL The database clears that key and cache once it expires. Keys: 137 Expires: 2 Memory Used: 1.77MB Expired Keys: 10 Evicted Keys: 0 Keys: 138 Expires: 3 Memory Used: 1.86MB Expired Keys: 10 Evicted Keys: 0 Keys: 137 Expires: 2 Memory Used: 1.77MB Expired Keys: 11 Evicted Keys: 0 Load cached page that expires in a minute Wait a minute > redis-cli info

Slide 22

Slide 22 text

Cache Keys: Eviction The redis configuration has an eviction policy of keys: noeviction | allkeys-lru | volatile-lru | allkeys-random | volatile-random | volatile-ttl Some keywords: ● allkeys will expire keys regardless of expiration ● volatile only evicts keys with an expiration set ● random selects keys randomly ● lru selects keys that are less recently used ● ttl selects keys with least time to live

Slide 23

Slide 23 text

Okay so that’s how caching works What did we do wrong?

Slide 24

Slide 24 text

Profiling Our Keys Using redis-audit, I found the distribution of a sampling of keys: > ruby redis-audit.rb -h spinyfin.redistogo.com -p 9340 -a mypassword -s 1000 Sampling 1000 keys, 0.2% of total keys, the tool found 16 groups.

Slide 25

Slide 25 text

The problem is caching The 16 groups found by the tool actually narrowed down to 9, where 4 groups accounted for more than 1% cache:views/www.bloc.io/users/carlton-banks/checkp... 46.94% .38% cache:views/checkpoints/1368-201607251822231046450... 24.54% 0% cache:https://github.com/will/bloc-jams/commit/aef... 21.34% 0% cache:111325,111326,111327,111328,111329,111330,11... 5.55% 0% other caches + workers 1.63% 0% Memory Usage % Expires

Slide 26

Slide 26 text

The problem is caching: expiration + eviction Of all our keys, 0.06% have expiration dates, 294 keys of 435,474. Our max memory policy is volatile-lru.

Slide 27

Slide 27 text

Fixing Key Eviction We use Redis To Go as our redis server host. Pro: We don’t have to maintain the server Con: Limited options, just maxmemory

Slide 28

Slide 28 text

Fixing expiration ● Page caching ● Action caching (our action caching has proper expirations for guest users) ● Fragment caching (71%) ● Low-level caching (27%) ● SQL caching

Slide 29

Slide 29 text

Fixing the footer cache: 47% Keys like: cache:views/www.bloc.io/users/carlton-banks/9e8e8931e0697aff02864b001ebdc99c Code is: / app/views/layouts/_footer.html.haml - cache do

Slide 30

Slide 30 text

Keys like: cache:views/www.bloc.io/users/carlton-banks/9e8e8931e0697aff02864b001ebdc99c Code is: / app/views/layouts/_footer.html.haml - cache do Fixing the footer cache: 47% Debugging Questions: ● How big is this key space? ○ Dependent on page URL (including user) and content hash ● What’s wrong with this key? ○ Many unique keys for same content ● Do we need to cache this? ○ It’s not computationally intensive, so no. TODO: delete existing keys, remove caching mechanism in code

Slide 31

Slide 31 text

Fixing the checkpoint nav cache: 25% Keys like: cache:views/checkpoints/1368-20160725182223104645000/roadmap_sections/95-201602110012 50316888000/users/2302534-20160904083834996975000/user/f21882266d6ec6d0f9331f1e16d3c1 76 Code is: / app/views/users/checkpoints/_checkpoint_nav.html.haml - cache [@checkpoint, @checkpoint.section, @user, current_user.role] do

Slide 32

Slide 32 text

Fixing the checkpoint nav cache: 25% Keys like: cache:views/checkpoints/1368-20160725182223104645000/roadmap_sections/95-201602110012 50316888000/users/2302534-20160904083834996975000/user/f21882266d6ec6d0f9331f1e16d3c1 76 Code is: / app/views/users/checkpoints/_checkpoint_nav.html.haml - cache [@checkpoint, @checkpoint.section, @user, current_user.role] do ● How big is this key space? ○ Dependent on checkpoint, section, user, and role: big. ● What’s wrong with this key? ○ There’s a lot of them for similar data ● Do we need to cache this? ○ Might be able to use lower level caching of position, index, etc. TODO: delete existing keys, set expiration on new keys

Slide 33

Slide 33 text

Fixing the github commit cache: 21% Keys like: cache:https://github.com/will/Blocly/commit/c0ad999d1bd91361858481e807724410407e Code is: commit = Rails.cache.fetch(commit_link) do Github::Commit.new(commit_link) end

Slide 34

Slide 34 text

Fixing the github commit cache: 21% Keys like: cache:https://github.com/will/Blocly/commit/c0ad999d1bd91361858481e807724410407e Code is: commit = Rails.cache.fetch(commit_link) do Github::Commit.new(commit_link) end ● How big is this key space? ○ # uniq commits for all students ● What’s wrong with this key? ○ It’s not namespaced ● Do we need to cache this? ○ Seems like a good idea TODO: namespace key, set expiration on new keys, delete existing keys

Slide 35

Slide 35 text

Fixing the calendar cache: 6% Keys like: 111325,111326,111327,111328,111329,111330,111331,111332,111333,111334,111335,111336,... Code is: key = @appointments.pluck(:id).sort.join(',') @data = Rails.cache.fetch (key) do Appointment.calendar_for(@appointments, @user).export end

Slide 36

Slide 36 text

Fixing the calendar cache: 6% Keys like: 111325,111326,111327,111328,111329,111330,111331,111332,111333,111334,111335,111336,... Code is: key = @appointments.pluck(:id).sort.join(',') @data = Rails.cache.fetch (key) do Appointment.calendar_for(@appointments, @user).export end ● How big is this key space? ○ All combination of appointment keys ● What’s wrong with this key? ○ Freakin nonsensical, no human words ● Do we need to cache this? ○ Sure but not with this key TODO: use a better key, set expiration, delete existing keys

Slide 37

Slide 37 text

Results

Slide 38

Slide 38 text

Massive Performance Improvements We applied the correct caching technique to our slow api endpoints: THE GOAL

Slide 39

Slide 39 text

Constant Memory Usage: Predictable!

Slide 40

Slide 40 text

Another Cache Key Learning One more mistake when implementing our improved caching: Keys like: calendar_export:appointments/709028-20170612032235056661000 Code is: most_recent_appointment = @appointments.reorder(:updated_at).last cache_key = most_recent_appointment.cache_key Rails.cache.fetch(cache_key, expires_in: 1.day) do Appointment.calendar_for(@appointments, @user).export end Most recent appointment as the cache key! Smart, right? Wrong. Forgot the user, we needed to add the user’s cache key too: calendar_export:users/2389812-20170612042305184969000:appointments/709028-20170612032 235056661000

Slide 41

Slide 41 text

Takeaways ● When determining cache keys: ○ Consider the keyspace ○ Consider how the keyspace correlates to time, scale ○ Set an expiration that makes sense according to the volatility of the cached information ● Become familiar with your app’s caching mechanism: ○ Cache implementation, expiration policy, default expiration ● Use caching to quickly make massive performance improvements ● If using redis, check out redis-cli for cache debugging > redis-cli monitor > redis-cli info > redis-cli get > redis-cli -h HOST -p PORT -a MYPASSWORD

Slide 42

Slide 42 text

The End! twitter: @meganmarie610 email: [email protected] and the classic: we’re hiring at Bloc!

Slide 43

Slide 43 text

Appendix

Slide 44

Slide 44 text

What do the cache values look like? For view caches, the full controller response as a string. For object caches, the object encoded in a JSON string.

Slide 45

Slide 45 text

How do you debug the redis cache? > redis-cli > redis-cli -h HOST -p PORT -a MYPASSWORD Most used commands: > info > monitor > ttl > get > expire

Slide 46

Slide 46 text

How does Redis use the LRU algorithm? Redis uses an approximated LRU algorithm, using a sampling size. The light gray band are objects that were evicted. The gray band are objects that were not evicted. The green band are objects that were added.

Slide 47

Slide 47 text

References Rails cache overview Using Redis as an LRU cache DHH's key-based cache expiration overview redis-audit