Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Zero-downtime payment platforms

Zero-downtime payment platforms

When you're building a payment platform, you want to make sure that your system is always available to accept orders. However, the complexity of the platform introduces the potential for it to go down when any one of the moving parts fails. In this talk, I will show you the approaches that we've taken and the risks that we have to take to ensure that our platform will always be available for our customers. Even if you're not building a payment platform, these approaches can be applied to ensure a high availability for your platform or service as well.

Co-speaking with Ryan Twomey from SCVNGR at RailsConf 2013 on May 1, 2013.

Video is available at http://www.confreaks.com/videos/2481-railsconf2013-zero-downtime-payment-platforms

F1c4a3bb1606cc4a61711e61e2fe6146?s=128

Prem Sichanugrist

May 01, 2013
Tweet

More Decks by Prem Sichanugrist

Other Decks in Programming

Transcript

  1. Zero-downtime payment platforms Prem Sichanugrist and Ryan Twomey

  2. Prem Sichanugrist

  3. None
  4. Ryan Twomey

  5. None
  6. RAILSCONF Promo code: 20% off first month of Prime 20%

    off everything else on the store http://learn.thoughtbot.com
  7. Some Background

  8. • Mobile payments (Android, iOS, WP7) company from Boston •

    Show QR code on phone to cashier to create an order • Order #create to Rails 3.2 app • Eventually hits credit/debit card via payment gateway.
  9. Our Stack M S S    • Heroku*

    cedar • Postgres DB, two followers (one on west coast) * Heroku is on AWS.
  10. $1,000 per minute and growing fast

  11. Downtime sucks.

  12. Two different kinds of downtime: • Us • Them

  13. Them.

  14. • External Database • Email Provider • Caching Provider •

    Payment Gateway External Services
  15. • External Database • Email Provider • Caching Provider •

    Payment Gateway External Services
  16. What if our payment gateway goes down?

  17. New Order

  18. New Order Rejected!

  19. Customer turned away

  20. None
  21. Risk? http://www.flickr.com/photos/tambako/4598642399/

  22. None
  23. Manual Shutdown

  24. Everybody panic! http://www.flickr.com/photos/dumbledad/3225255407/

  25. “Failover Mode”

  26. Failover Mode • Accept low risk orders • Store them

    and charge customer later
  27. Risk Assertion

  28. Risk class  Risk    def  initialize(order)        @amount

     =  order.balance.to_f    end      def  low?        @amount  <  100.0    end end
  29. Pros • Customers can make purchase. • No lost orders.

  30. Cons • Requires a human all the time. • Humans

    does not stay up 24/7
  31. https://secure.flickr.com/photos/spenceyc/7481166880/

  32. Automated Failover

  33. Timeout & Accept • Wrap a charge in a timeout

    • If it times out, evaluate risk • If low risk, save it and return success • Cron task to retry timed-out orders
  34. Timeout #  app/models/customer_charger.rb def  charge    Timeout.timeout(TIMEOUT_IN_SECONDS)  do    

       charge_card_via_gateway    end rescue  Timeout::Error    assess_risk_of_saving_order_without_charging_card end
  35. def  assess_risk_of_saving_order_without_charging_card    if  Risk.new(@order).low?        true  

     else        @card.errors.add  :base,  'card  failed!'        false    end end
  36. def  assess_risk_of_saving_order_without_charging_card    if  Risk.new(@order).low?        @order.gateway_id  =

                 "gateway-­‐down-­‐#{SecureRandom.hex(32)}"        true    else        @card.errors.add  :base,  'card  failed!'        false    end end
  37. Cron task to retry Order.reconcilable.find_each  do  |order|    order.reconcile end

  38. #  app/models/order.rb def  self.reconcilable    where("gateway_id  LIKE  'gateway-­‐down%'") end Order.reconcilable.find_each

     do  |order|    order.reconcile end
  39. def  reconcile    #  search  gateway  for  similar-­‐looking  charge  

     if  gateway_id  =  SimilarOrderFinder.new(self).find        #  found  one!  update  this  order  and  don't  re-­‐charge        update_attribute  :gateway_id,  gateway_id    else        charge        save    end end Order.reconcilable.find_each  do  |order|    order.reconcile end
  40. Pros • No humans required. • Developers can get some

    sleep instead of pushing buttons.
  41. Cons • Not really: it worked well for quite a

    while. • Very rarely SimilarOrderFinder might mistakenly find the wrong order.
  42. What about when we are down? (or anything critical in

    our stack.)
  43. We could go down • Application error • Heroku is

    failing • AWS went away
  44. Break time! http://www.flickr.com/photos/smason/3020514840/

  45. Let me tell you a story (involving burritos).

  46. On Oct 22 AWS went down.

  47. None
  48. I was still at work

  49. I wanted a burrito. • There’s a Qdoba near my

    house, but I couldn’t remember its hours. • I pull up their site and...
  50. Yup, Qdoba is on Heroku.

  51. Heroku is on us-east-1.

  52. Heroku is on us-east-1. Crap.

  53. I never got my burrito.

  54. Remember kids... • Always use a CDN to serve your

    static pages.
  55. Break time over.

  56. What if Heroku goes down for us? (or AWS, or

    anything else in our stack.)
  57. Number of failed orders 0" 500" 1000" 1500" 2000" 2500"

    10/19/12" 10/20/12" 10/21/12" 10/22/12" 10/23/12" 10/24/12" 10/25/12"
  58. We’ve planned ahead

  59. “Chocolate” Request Replayer

  60. Dynamic failover service Powered by Akamai

  61. Internet

  62. Akamai Dynamic Router Internet

  63. Akamai Dynamic Router Internet Rails 3 Application (Heroku)

  64. Akamai Dynamic Router Internet Chocolate Rails 3 Application (Heroku)

  65. Akamai Dynamic Router Internet Chocolate Rails 3 Application (Heroku) Akamai

    CDN
  66. None
  67. Internet

  68. Akamai Dynamic Router Internet

  69. Akamai Dynamic Router Internet Rails Application (Heroku)

  70. Akamai Dynamic Router Internet Rails Application (Heroku)

  71. Akamai Dynamic Router Internet Rails Application (Heroku) Application Error

  72. Akamai Dynamic Router Internet Rails Application (Heroku) No respond within

    15s
  73. Akamai Dynamic Router Internet Chocolate Rails Application (Heroku)

  74. What is Chocolate?

  75. Separate Sinatra Application

  76. Perform Risk Assertion

  77. Store raw request in DB

  78. “Replay” request back to production

  79. VCR for the web!

  80. Completely separate... • Sinatra app • Deployed to a VM

    on another (non AWS) cloud.
  81. So, if Heroku or AWS is down...

  82. our customers never even notice.

  83. Same risk as before • If an order is accepted

    that can’t be charged, we’re still on the hook. • Our support team follows up with customers to keep lost $$ as low as possible.
  84. How it Works

  85. Chocolate: • Single POST endpoint to save an Order into

    the database. • Pulls out interesting things (amount, customer to charge, etc).
  86. If order looks real... • Calculate risk: • If low,

    saves everything: params, headers, etc. to DB. • Returns a response that looks identical to a production response.
  87. Replaying Orders

  88. When we’re back up: • Order model on chocolate has

    a replay method. • Manual process run by support team to track results (and follow up if necessary).
  89. De-duping • Could be a case where an order is

    in chocolate and in production. • Don’t want to double-charge the customer. • Need to de-dupe.
  90. De-duping • Akamai injects a unique request ID for every

    order we create. • Store this on each order in production and on replays in chocolate. • Chocolate sends this as part of a replay.
  91. When to Fail Over

  92. Triggering • Akamai has a rule that if a POST

    to our order #create endpoint takes > 15 seconds, retry the exact same request on chocolate. • Sometimes production will actually succeed, but not a problem: chocolate de- dupes.
  93. Pros of using something like Akamai • Allows you to

    auto-replay to separate endpoints. • If done correctly, your site will never appear to be down.
  94. Cons • Adds a fairly significant layer of complexity. •

    Adds non-trivial costs.
  95. Even though our site is up... We would still see

    orders fail over to chocolate.
  96. 0" 100" 200" 300" 400" 500" 600" 2/1/13" 2/2/13" 2/3/13"

    2/4/13" 2/5/13" 2/6/13" 2/7/13" 2/8/13" 2/9/13" 2/10/13" 2/11/13" 2/12/13" 2/13/13" 2/14/13" 2/15/13" 2/16/13" 2/17/13" 2/18/13" 2/19/13" 2/20/13" 2/21/13" 2/22/13" 2/23/13" 2/24/13" 2/25/13" 2/26/13" 2/27/13" 2/28/13" 3/1/13" 3/2/13" 3/3/13" 3/4/13" 3/5/13" 3/6/13" 3/7/13" 3/8/13" 3/9/13" 3/10/13" 3/11/13" 3/12/13" 3/13/13" 3/14/13" 3/15/13" 3/16/13" 3/17/13" 3/18/13" 3/19/13" 3/20/13" 3/21/13" 3/22/13" 3/23/13" 3/24/13" 3/25/13" 3/26/13" 3/27/13" 3/28/13" 3/29/13" 3/30/13" 3/31/13" 4/1/13" 4/2/13" Failovers*per*day*
  97. No services were down.

  98. What could be causing this?

  99. Have you ever heard of random routing?

  100. Have you ever heard of random routing?

  101. Dynos get backed up • Every day, a handful of

    orders still end up failing over to chocolate.
  102. Dynos Heroku Router 1 2 3 4 5

  103. Dynos Heroku Router 1 2 3 4 5

  104. Dynos Heroku Router 1 2 3 4 5

  105. Dynos Heroku Router 1 2 3 4 5

  106. Dynos Heroku Router 1 2 3 4 5

  107. Dynos Heroku Router 1 2 3 4 5

  108. Dynos Heroku Router 1 2 3 4 5

  109. Dynos Heroku Router 1 2 3 4 5

  110. Dynos Heroku Router 1 2 3 4 5

  111. Dynos Heroku Router 1 2 3 4 5

  112. Dynos Heroku Router 1 2 3 4 5

  113. Dynos Heroku Router 1 2 3 4 5 t ≥

    15 sec
  114. Dynos Heroku Router 1 2 3 4 5 Timeout

  115. Solutions • Make all endpoints fast to free up dynos

    quickly. • Keep tuning unicorn and failover timeouts. • No guaranteed way to solve this.
  116. We’re Still Investigating... • We’ve been obsessively tuning unicorn worker

    counts, backlog total, etc.
  117. Things to Remember

  118. 1. Your site will go down.

  119. 2. Use a replayer for critical web requests.

  120. 3. Accept some risk to keep customers happy.

  121. 4. Keep your endpoints lean and fast.

  122. 5. Use a CDN.

  123. Thank you. Prem Sichanugrist prem@thoughtbot.com @sikachu Ryan Twomey ryant@thelevelup.com @rtwomey