Goruco 2015: Building and Testing Resilient Applications

Goruco 2015: Building and Testing Resilient Applications

4d931adc15dea47e6e5854f745719269?s=128

Simon Hørup Eskildsen

June 20, 2015
Tweet

Transcript

  1. Building and Testing Resilient Applications Simon Eskildsen, Goruco 2015, NYC

    @Sirupsen 1
  2. None
  3. Shopify 3 165,000+ ACTIVE SHOPIFY MERCHANTS $8 BILLION+ CUMULATIVE GMV

    200+ DEVELOPERS 500+ SERVERS 2 DATACENTERS Ruby on Rails ONE OF THE LARGEST RAILS DEPLOYMENTS IN THE WORLD 3000+ CONTAINERS RUNNING AT ANY TIME 10,000+ MAX CHECKOUTS PER MINUTE 12+ DEPLOYS PER DAY 300M unique visits/month LEAGUE OF APPLE, EBAY AND AMAZON
  4. 4 Building reliable bridges in large distributed systems

  5. 5 Most important effort for peace of mind

  6. 6 The Holidays and The Catastrophic Fall of 2014

  7. 7 Resiliency Building reliable systems from unreliable components

  8. (Micro)service equation 8 Uptime = AN Number of services Availability

    per service Final availability
  9. 9 Availability 70 80 90 100 Services 10 50 100

    500 1000 99.98 99.99 99.999 99.95
  10. Even monolithic services easily have tens of dependencies: RDBMS, Redis,

    Memcached, ElasticSearch, S3, CDN, APIs, Mailing, CRM, … 10
  11. 11 Application should be designed to handle fallbacks

  12. None
  13. search sessions carts mysql-shard cdn

  14. Avoid HTTP 500 for single service failing .. or suffer

    the faith of the (micro)service equation
  15. Sessions data store unavailable Customer signed out

  16. 16 Naïve session code def index @user = fetch_user @posts

    = Post.all end private def fetch_user # accesses potentially two data stores: # session + data store that stores user User.find(session[:user_id]) end
  17. 17 Resilient session code def index @user = fetch_user @posts

    = Post.all end private def fetch_user User.find(session[:user_id]) # return nil if session store is down # in this example sessions are in Redis rescue Redis::BaseError nil end
  18. How do we test it? 18 Mocks are client specific

    and easily covers as many bugs as it uncovers Production testing means it’s too late and difficult to reproduce Network-level simulation in development and test environments would give full, reproducible confidence
  19. 19 https://github.com/shopify/toxiproxy Toxiproxy[:mysql_master].downstream(:latency, latency: 1000).apply do Shop.first # this takes

    at least 1s end Toxiproxy[/redis/].down do session[:user_id] # this will throw an exception end Simulate TCP conditions with Toxiproxy
  20. None
  21. 21 def test_storefront_resilient_sessions Toxiproxy[:redis_sessions].down do get '/' assert_response :success end

    end
  22. I’m lost. How do I get an overview of my

    application? 22
  23. 23 Checkout Admin Storefront MySQL Shard Unavailable Unavailable Degraded MySQL

    Master Available Unavailable Available Kafka Available Degraded Available External HTTP API Degraded Available Unavailable redis-sessions Unavailable Unavailable Degraded Resiliency Matrix
  24. Write a Toxiproxy test for each cell 24 # test/integration/resiliency_matrix_test.rb

    def test_section_a_mq_a_down Toxiproxy[:message_queue_a].down do get '/section_a' assert_response :success end end def test_section_b_datastore_b Toxiproxy[:datastore_b].down do get '/section_b' assert_response 500 end end # ... and every other cell
  25. Beware of non-resilient dependencies 25 # https://github. com/rails/rails/blob/master/activerecord/lib/active_record/query_cache.rb#L30- 45 def

    call(env) connection = ActiveRecord::Base.connection enabled = connection.query_cache_enabled connection_id = ActiveRecord::Base.connection_id connection.enable_query_cache! response = @app.call(env) response[2] = Rack::BodyProxy.new(response[2]) do restore_query_cache_settings(connection_id, enabled) end response rescue Exception => e restore_query_cache_settings(connection_id, enabled) raise e end
  26. 26 Another example class Product def tags redis.smembers("product:#{id}:tags") end end

  27. 27 Slightly better class Product def tags redis.smembers("product:#{id}:tags") rescue Redis::BaseError

    => e ErrorReporter.log(e) [] end end
  28. Rescues in high level code smells of leaky abstraction 28

  29. 29 Decorators class PersistentSet def to_a redis.smembers(key) rescue Redis::BaseError []

    end def include?(value) redis.sismember(key, value) rescue Redis::BaseError false end end
  30. With fallbacks the system is still vulnerable to slowness. ECONNREFUSED

    is a luxury in distributed systems, slowness is the killer. 30
  31. Little’s law

  32. /impacted 500ms /ok 20ms /ok 20ms /impacted 500ms

  33. 33 0.001s 0.01s 0.002s 0.01s 0.01s 0.01s 0.01s 0.01s 400

    RPS Infrastructure operating normally
  34. 34 0.001s 0.01s 0.020s 0.10s 0.10s 0.10s 0.10s 0.10s 40

    RPS Database latency increases by 10x, throughput drops 10x
  35. Timeouts not good enough 35 Response time suffers—even if timeout

    is low Little’s law: capacity reduced Need ways to fail even faster! Setting timeouts problematic, super low (<10ms) for frequent resources doesn’t account for natural outliers
  36. Your resiliency reference 36

  37. Strong heuristics for failing fast 37 Circuit breakers makes clients

    fail fast once they’ve failed repeatedly Bulkheads controls the access to resources, especially handy if you have high timeouts for some components
  38. Circuit Breakers 38 Start RPC Circuit.open? Driver.query(timeout=200ms) Driver.failure? Throw Exception

    End RPC NO markFailure() markSuccess() YES YES NO
  39. Circuit Breakers 39 Gives your failing components room to breathe

    Fails fast after several timeouts If some timeouts are high circuit breakers don’t help much. Capacity reduced until they kick in.
  40. Bulkheads 40 40 2 tickets 1 ticket

  41. Bulkheads 41 41 1 tickets 1 ticket SLOW

  42. Bulkheads 42 42 0 tickets 1 ticket SLOW

  43. Bulkheads 43 43 0 tickets 1 ticket REJECTED SLOW

  44. Bulkheads 44 44 0 tickets 0 tickets SLOW

  45. Bulkheads 45 Mental model is a thread pool—but useful in

    process- based servers Ensures controlled access to resources under increase in response time; easy to reason about impact Fails faster than circuit breakers when timeout is high
  46. Circuit Breakers, Bulkheads and Timeouts complement each other well to

    fail fast! 46
  47. Do I need Circuit Breakers and Bulkheads? 47 Less critical

    for evented (especially) and multithreaded servers You’ve observed these problems in production (you’re now equipped to) High timeouts to some data stores and/or services because of legitimate outliers
  48. 48 Resiliency Toolkits netflix/hystrix shopify/semian twitter/finagle

  49. 49

  50. Resiliency Maturity Pyramid 50 No resiliency effort Testing with mocks

    Toxiproxy tests and matrix Resiliency Patterns Production Practise Days (Games) Kill Nodes (Chaos Monkey) Latency Monkey Application-Specific Fallbacks Region Gorilla
  51. Final remarks 51 Draw your resiliency matrix, write Toxiproxy tests

    and implement application-specific fallbacks Not everyone needs circuit breakers and bulkheads, this may be premature for your application Be careful when introducing new dependencies
  52. 52 More examples and docs in Semian and Toxiproxy docs

    http://github.com/shopify/semian https://github.com/shopify/toxiproxy http://www.shopify.com/technology/16906928-building-and- testing-resilient-ruby-on-rails-applications
  53. Thank you! 53 Simon Eskildsen, Shopify @Sirupsen

  54. Server by Konstantin Velichko from the Noun Project basket by

    Ben Rex Furneaux from the Noun Project container by Creative Stall from the Noun Project people by Wilson Joseph from the Noun Project mesh network by Lance Weisser from the Noun Project Conductor by By Luis Prado from the Noun Project Jar by Yazmin Alanix from the Noun Project Broken Chain by Simon Martin from the Noun Project Book by Ben Rex Furneaux from the Noun Project network by Jessica Coccimiglio from the Noun Project server by Creative Stall from the Noun Project components by icons.design from the Noun Project switch button by Marco Olgio from the Noun Project Pile of leaves (autumn) by Aarthi Ramamurthy Bridge by Toreham Sharman from the Noun Project collaboration by Alex Kwa from the Noun Project converge by Creative Stall from the Noun Project change by Jorge Mateo from the Noun Project person by Brian Dys Sahagun from the Noun Project water faucet by Yaroslav Samoilov from the Noun Project cash register by Gergely Korinek from the Noun Project lungs by Joris Hoogendoorn Hour Glass by Arthur Shalin from the Noun Project Brooklyn Bridge at Night by Dennis Leung