Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Goruco 2015: Building and Testing Resilient Applications

Goruco 2015: Building and Testing Resilient Applications

Simon Hørup Eskildsen

June 20, 2015
Tweet

More Decks by Simon Hørup Eskildsen

Other Decks in Technology

Transcript

  1. Shopify 3 165,000+ ACTIVE SHOPIFY MERCHANTS $8 BILLION+ CUMULATIVE GMV

    200+ DEVELOPERS 500+ SERVERS 2 DATACENTERS Ruby on Rails ONE OF THE LARGEST RAILS DEPLOYMENTS IN THE WORLD 3000+ CONTAINERS RUNNING AT ANY TIME 10,000+ MAX CHECKOUTS PER MINUTE 12+ DEPLOYS PER DAY 300M unique visits/month LEAGUE OF APPLE, EBAY AND AMAZON
  2. 9 Availability 70 80 90 100 Services 10 50 100

    500 1000 99.98 99.99 99.999 99.95
  3. Even monolithic services easily have tens of dependencies: RDBMS, Redis,

    Memcached, ElasticSearch, S3, CDN, APIs, Mailing, CRM, … 10
  4. Avoid HTTP 500 for single service failing .. or suffer

    the faith of the (micro)service equation
  5. 16 Naïve session code def index @user = fetch_user @posts

    = Post.all end private def fetch_user # accesses potentially two data stores: # session + data store that stores user User.find(session[:user_id]) end
  6. 17 Resilient session code def index @user = fetch_user @posts

    = Post.all end private def fetch_user User.find(session[:user_id]) # return nil if session store is down # in this example sessions are in Redis rescue Redis::BaseError nil end
  7. How do we test it? 18 Mocks are client specific

    and easily covers as many bugs as it uncovers Production testing means it’s too late and difficult to reproduce Network-level simulation in development and test environments would give full, reproducible confidence
  8. 19 https://github.com/shopify/toxiproxy Toxiproxy[:mysql_master].downstream(:latency, latency: 1000).apply do Shop.first # this takes

    at least 1s end Toxiproxy[/redis/].down do session[:user_id] # this will throw an exception end Simulate TCP conditions with Toxiproxy
  9. 23 Checkout Admin Storefront MySQL Shard Unavailable Unavailable Degraded MySQL

    Master Available Unavailable Available Kafka Available Degraded Available External HTTP API Degraded Available Unavailable redis-sessions Unavailable Unavailable Degraded Resiliency Matrix
  10. Write a Toxiproxy test for each cell 24 # test/integration/resiliency_matrix_test.rb

    def test_section_a_mq_a_down Toxiproxy[:message_queue_a].down do get '/section_a' assert_response :success end end def test_section_b_datastore_b Toxiproxy[:datastore_b].down do get '/section_b' assert_response 500 end end # ... and every other cell
  11. Beware of non-resilient dependencies 25 # https://github. com/rails/rails/blob/master/activerecord/lib/active_record/query_cache.rb#L30- 45 def

    call(env) connection = ActiveRecord::Base.connection enabled = connection.query_cache_enabled connection_id = ActiveRecord::Base.connection_id connection.enable_query_cache! response = @app.call(env) response[2] = Rack::BodyProxy.new(response[2]) do restore_query_cache_settings(connection_id, enabled) end response rescue Exception => e restore_query_cache_settings(connection_id, enabled) raise e end
  12. 29 Decorators class PersistentSet def to_a redis.smembers(key) rescue Redis::BaseError []

    end def include?(value) redis.sismember(key, value) rescue Redis::BaseError false end end
  13. With fallbacks the system is still vulnerable to slowness. ECONNREFUSED

    is a luxury in distributed systems, slowness is the killer. 30
  14. 33 0.001s 0.01s 0.002s 0.01s 0.01s 0.01s 0.01s 0.01s 400

    RPS Infrastructure operating normally
  15. 34 0.001s 0.01s 0.020s 0.10s 0.10s 0.10s 0.10s 0.10s 40

    RPS Database latency increases by 10x, throughput drops 10x
  16. Timeouts not good enough 35 Response time suffers—even if timeout

    is low Little’s law: capacity reduced Need ways to fail even faster! Setting timeouts problematic, super low (<10ms) for frequent resources doesn’t account for natural outliers
  17. Strong heuristics for failing fast 37 Circuit breakers makes clients

    fail fast once they’ve failed repeatedly Bulkheads controls the access to resources, especially handy if you have high timeouts for some components
  18. Circuit Breakers 39 Gives your failing components room to breathe

    Fails fast after several timeouts If some timeouts are high circuit breakers don’t help much. Capacity reduced until they kick in.
  19. Bulkheads 45 Mental model is a thread pool—but useful in

    process- based servers Ensures controlled access to resources under increase in response time; easy to reason about impact Fails faster than circuit breakers when timeout is high
  20. Do I need Circuit Breakers and Bulkheads? 47 Less critical

    for evented (especially) and multithreaded servers You’ve observed these problems in production (you’re now equipped to) High timeouts to some data stores and/or services because of legitimate outliers
  21. 49

  22. Resiliency Maturity Pyramid 50 No resiliency effort Testing with mocks

    Toxiproxy tests and matrix Resiliency Patterns Production Practise Days (Games) Kill Nodes (Chaos Monkey) Latency Monkey Application-Specific Fallbacks Region Gorilla
  23. Final remarks 51 Draw your resiliency matrix, write Toxiproxy tests

    and implement application-specific fallbacks Not everyone needs circuit breakers and bulkheads, this may be premature for your application Be careful when introducing new dependencies
  24. 52 More examples and docs in Semian and Toxiproxy docs

    http://github.com/shopify/semian https://github.com/shopify/toxiproxy http://www.shopify.com/technology/16906928-building-and- testing-resilient-ruby-on-rails-applications
  25. Server by Konstantin Velichko from the Noun Project basket by

    Ben Rex Furneaux from the Noun Project container by Creative Stall from the Noun Project people by Wilson Joseph from the Noun Project mesh network by Lance Weisser from the Noun Project Conductor by By Luis Prado from the Noun Project Jar by Yazmin Alanix from the Noun Project Broken Chain by Simon Martin from the Noun Project Book by Ben Rex Furneaux from the Noun Project network by Jessica Coccimiglio from the Noun Project server by Creative Stall from the Noun Project components by icons.design from the Noun Project switch button by Marco Olgio from the Noun Project Pile of leaves (autumn) by Aarthi Ramamurthy Bridge by Toreham Sharman from the Noun Project collaboration by Alex Kwa from the Noun Project converge by Creative Stall from the Noun Project change by Jorge Mateo from the Noun Project person by Brian Dys Sahagun from the Noun Project water faucet by Yaroslav Samoilov from the Noun Project cash register by Gergely Korinek from the Noun Project lungs by Joris Hoogendoorn Hour Glass by Arthur Shalin from the Noun Project Brooklyn Bridge at Night by Dennis Leung