Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Zero-downtime payment platforms

Zero-downtime payment platforms

Revised talk after given at RailsConf 2013 (https://speakerdeck.com/sikachu/zero-downtime-payment-platforms)

Presented at RailsPacific 2014 on September 27, 2014.

Video is available at https://www.youtube.com/watch?v=N8sYlKheRrk

F1c4a3bb1606cc4a61711e61e2fe6146?s=128

Prem Sichanugrist

September 27, 2014
Tweet

More Decks by Prem Sichanugrist

Other Decks in Programming

Transcript

  1. Zero-downtime payment platforms

  2. Prem Sichanugrist @sikachu /sikachu

  3. RAILSPACIFIC Promo code: https://upcase.com 50% off first month 50% off

    everything else (expires in 1 month)
  4. Some Background

  5. • Mobile payments (Android, iOS, WP7) company from Boston •

    Show QR code on phone to cashier to create an order • Order #create to Rails 4.1 app • Eventually hits credit/debit card via payment gateway.
  6. Our Stack • Heroku* cedar • Postgres DB, two followers

    (one on west coast) * Heroku is on AWS.
  7. Downtime sucks.

  8. Two different kinds of downtime: • Us • Them

  9. Them.

  10. • External Database • Email Provider • Caching Provider •

    Payment Gateway External Services
  11. • External Database • Email Provider • Caching Provider •

    Payment Gateway External Services
  12. What if our payment gateway goes down?

  13. New Order

  14. New Order Rejected!

  15. Customer turned away

  16. None
  17. Risk? https://flic.kr/p/81nfaV

  18. None
  19. Manual Shutdown

  20. Everybody panic! https://flic.kr/p/5V1h4R

  21. “Failover Mode”

  22. Failover Mode • Accept low risk orders • Store them

    and charge customer later
  23. Risk Assertion

  24. Risk class  Risk      def  initialize(order)      

       @amount  =  order.balance.to_f      end          def  low?          @amount  <  100.0      end   end
  25. Pros • Customers can make purchase. • No lost orders.

  26. Cons • Requires a human all the time. • Humans

    does not stay up 24/7
  27. https://flic.kr/p/cp5WgS

  28. Automated Failover

  29. Timeout & Accept • Wrap a charge in a timeout

    • If it times out, evaluate risk • If low risk, save it and return success • Cron task to retry timed-out orders
  30. Timeout #  app/models/customer_charger.rb   def  charge      Timeout.timeout(TIMEOUT_IN_SECONDS)  do

             charge_card_via_gateway      end   rescue  Timeout::Error      assess_risk_of_saving_order_without_charging_card   end
  31. def  assess_risk_of_saving_order_without_charging_card      if  Risk.new(@order).low?        

     true      else          @card.errors.add  :base,  'card  failed!'          false      end   end
  32. def  assess_risk_of_saving_order_without_charging_card      if  Risk.new(@order).low?        

     @order.gateway_id  =                "gateway-­‐down-­‐#{SecureRandom.hex(32)}"          true      else          @card.errors.add  :base,  'card  failed!'          false      end   end
  33. Cron task to retry Order.reconcilable.find_each  do  |order|      order.reconcile

      end
  34. #  app/models/order.rb   def  self.reconcilable      where("gateway_id  LIKE  'gateway-­‐down%'")

      end Order.reconcilable.find_each  do  |order|      order.reconcile   end
  35. def  reconcile      #  search  gateway  for  similar-­‐looking  charge

         if  gateway_id  =  SimilarOrderFinder.new(self).find          #  found  one!  update  this  order  and  don't  re-­‐charge          update_attribute  :gateway_id,  gateway_id      else          charge          save      end   end Order.reconcilable.find_each  do  |order|      order.reconcile   end
  36. Pros • No humans required. • Developers can get some

    sleep instead of pushing buttons.
  37. Cons • Not really: it worked well for quite a

    while. • Very rarely SimilarOrderFinder might mistakenly find the wrong order.
  38. What about when we are down? (or anything critical in

    our stack.)
  39. We could go down • Application error • Heroku is

    failing • AWS went away
  40. Story time! https://flic.kr/p/5AUVQ5

  41. Let me tell you a story

  42. On Oct 22, 2012 AWS went down.

  43. None
  44. Heroku is on us-east-1.

  45. Heroku is on us-east-1. Crap.

  46. Number of failed orders 0" 500" 1000" 1500" 2000" 2500"

    10/19/12" 10/20/12" 10/21/12" 10/22/12" 10/23/12" 10/24/12" 10/25/12"
  47. We’ve planned ahead

  48. “Chocolate” Request Replayer

  49. Dynamic failover service Powered by Akamai

  50. Internet

  51. Akamai Dynamic Router Internet

  52. Akamai Dynamic Router Internet Rails 4 Application (Heroku)

  53. Akamai Dynamic Router Internet Chocolate Rails 4 Application (Heroku)

  54. Akamai Dynamic Router Internet Chocolate Rails 4 Application (Heroku) Akamai

    CDN
  55. None
  56. Internet

  57. Akamai Dynamic Router Internet

  58. Rails Application (Heroku) Akamai Dynamic Router Internet

  59. Rails Application (Heroku) Akamai Dynamic Router Internet

  60. Rails Application (Heroku) Akamai Dynamic Router Internet Application Error

  61. Rails Application (Heroku) Akamai Dynamic Router Internet No respond within

    15s
  62. Chocolate Rails Application (Heroku) Akamai Dynamic Router Internet

  63. What is Chocolate? https://flic.kr/p/dfCAWM

  64. Separate Sinatra Application

  65. Perform Risk Assertion

  66. Store raw request in DB

  67. “Replay” request back to production

  68. VCR for the web!

  69. Completely separate... • Sinatra app • Deployed to a VM

    on another (non AWS) cloud.
  70. So, if Heroku or AWS is down...

  71. our customers never even notice.

  72. Same risk as before • If an order is accepted

    that can’t be charged, we’re still on the hook. • Our support team follows up with customers to keep lost $$ as low as possible.
  73. How it Works

  74. Chocolate: • Single POST endpoint to save an Order into

    the database. • Pulls out interesting things (amount, customer to charge, etc).
  75. If order looks real... • Calculate risk: • If low,

    saves everything: params, headers, etc. to DB. • Returns a response that looks identical to a production response.
  76. Replaying Orders

  77. When we’re back up: • Order model on chocolate has

    a replay method. • Manual process run by support team to track results (and follow up if necessary).
  78. De-duping • Could be a case where an order is

    in chocolate and in production. • Don’t want to double-charge the customer. • Need to de-dupe.
  79. De-duping • Akamai injects a unique request ID for every

    order we create. • Store this on each order in production and on replays in chocolate. • Chocolate sends this as part of a replay.
  80. When to Fail Over

  81. Triggering • Akamai has a rule that if a POST

    to our order #create endpoint takes > 15 seconds, retry the exact same request on chocolate. • Sometimes production will actually succeed, but not a problem: chocolate de-dupes.
  82. Pros of using something like Akamai • Allows you to

    auto-replay to separate endpoints. • If done correctly, your site will never appear to be down.
  83. Cons • Adds a fairly significant layer of complexity. •

    Adds non-trivial costs.
  84. Even though
 our site is up... We would still see

    orders fail over to chocolate.
  85. 0" 100" 200" 300" 400" 500" 600" 2/1/13" 2/2/13" 2/3/13"

    2/4/13" 2/5/13" 2/6/13" 2/7/13" 2/8/13" 2/9/13" 2/10/13" 2/11/13" 2/12/13" 2/13/13" 2/14/13" 2/15/13" 2/16/13" 2/17/13" 2/18/13" 2/19/13" 2/20/13" 2/21/13" 2/22/13" 2/23/13" 2/24/13" 2/25/13" 2/26/13" 2/27/13" 2/28/13" 3/1/13" 3/2/13" 3/3/13" 3/4/13" 3/5/13" 3/6/13" 3/7/13" 3/8/13" 3/9/13" 3/10/13" 3/11/13" 3/12/13" 3/13/13" 3/14/13" 3/15/13" 3/16/13" 3/17/13" 3/18/13" 3/19/13" 3/20/13" 3/21/13" 3/22/13" 3/23/13" 3/24/13" 3/25/13" 3/26/13" 3/27/13" 3/28/13" 3/29/13" 3/30/13" 3/31/13" 4/1/13" 4/2/13" Failovers*per*day*
  86. No services were down.

  87. What could be causing this?

  88. Have you ever heard of random routing?

  89. Dynos get backed up • Every day, a handful of

    orders still end up failing over to chocolate.
  90. Dynos Heroku Router 1 2 3 4 5

  91. Dynos Heroku Router 1 2 3 4 5

  92. Dynos Heroku Router 1 2 3 4 5

  93. Dynos Heroku Router 1 2 3 4 5

  94. Dynos Heroku Router 1 2 3 4 5

  95. Dynos Heroku Router 1 2 3 4 5

  96. Dynos Heroku Router 1 2 3 4 5

  97. Dynos Heroku Router 1 2 3 4 5

  98. Dynos Heroku Router 1 2 3 4 5

  99. Dynos Heroku Router 1 2 3 4 5

  100. Dynos Heroku Router 1 2 3 4 5

  101. t ≥ 15 sec Dynos Heroku Router 1 2 3

    4 5
  102. Dynos Heroku Router 1 2 3 4 5 Timeout

  103. Solutions • Make all endpoints fast to free up
 dynos

    quickly. • Keep tuning unicorn and failover timeouts. • No guaranteed way to solve this.
  104. We’re Still Investigating... • We’ve been obsessively tuning unicorn worker

    counts, backlog total, etc.
  105. Things to Remember

  106. 1. Your site will go down.

  107. 2. Use a replayer for critical web requests.

  108. 3. Accept some risk to keep customers happy.

  109. 4. Keep your endpoints lean and fast.

  110. Prem Sichanugrist @sikachu /sikachu Thank you