Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Zero-downtime payment platforms

Zero-downtime payment platforms

When you're building a payment platform, you want to make sure that your system is always available to accept orders. However, the complexity of the platform introduces the potential for it to go down when any one of the moving parts fails. In this talk, I will show you the approaches that we've taken and the risks that we have to take to ensure that our platform will always be available for our customers. Even if you're not building a payment platform, these approaches can be applied to ensure a high availability for your platform or service as well.

Co-speaking with Ryan Twomey from SCVNGR at RailsConf 2013 on May 1, 2013.

Video is available at http://www.confreaks.com/videos/2481-railsconf2013-zero-downtime-payment-platforms

Prem Sichanugrist

May 01, 2013
Tweet

More Decks by Prem Sichanugrist

Other Decks in Programming

Transcript

  1. RAILSCONF Promo code: 20% off first month of Prime 20%

    off everything else on the store http://learn.thoughtbot.com
  2. • Mobile payments (Android, iOS, WP7) company from Boston •

    Show QR code on phone to cashier to create an order • Order #create to Rails 3.2 app • Eventually hits credit/debit card via payment gateway.
  3. Our Stack M S S    • Heroku*

    cedar • Postgres DB, two followers (one on west coast) * Heroku is on AWS.
  4. Risk class  Risk    def  initialize(order)        @amount

     =  order.balance.to_f    end      def  low?        @amount  <  100.0    end end
  5. Timeout & Accept • Wrap a charge in a timeout

    • If it times out, evaluate risk • If low risk, save it and return success • Cron task to retry timed-out orders
  6. Timeout #  app/models/customer_charger.rb def  charge    Timeout.timeout(TIMEOUT_IN_SECONDS)  do    

       charge_card_via_gateway    end rescue  Timeout::Error    assess_risk_of_saving_order_without_charging_card end
  7. def  assess_risk_of_saving_order_without_charging_card    if  Risk.new(@order).low?        true  

     else        @card.errors.add  :base,  'card  failed!'        false    end end
  8. def  assess_risk_of_saving_order_without_charging_card    if  Risk.new(@order).low?        @order.gateway_id  =

                 "gateway-­‐down-­‐#{SecureRandom.hex(32)}"        true    else        @card.errors.add  :base,  'card  failed!'        false    end end
  9. def  reconcile    #  search  gateway  for  similar-­‐looking  charge  

     if  gateway_id  =  SimilarOrderFinder.new(self).find        #  found  one!  update  this  order  and  don't  re-­‐charge        update_attribute  :gateway_id,  gateway_id    else        charge        save    end end Order.reconcilable.find_each  do  |order|    order.reconcile end
  10. Cons • Not really: it worked well for quite a

    while. • Very rarely SimilarOrderFinder might mistakenly find the wrong order.
  11. I wanted a burrito. • There’s a Qdoba near my

    house, but I couldn’t remember its hours. • I pull up their site and...
  12. What if Heroku goes down for us? (or AWS, or

    anything else in our stack.)
  13. Number of failed orders 0" 500" 1000" 1500" 2000" 2500"

    10/19/12" 10/20/12" 10/21/12" 10/22/12" 10/23/12" 10/24/12" 10/25/12"
  14. Same risk as before • If an order is accepted

    that can’t be charged, we’re still on the hook. • Our support team follows up with customers to keep lost $$ as low as possible.
  15. Chocolate: • Single POST endpoint to save an Order into

    the database. • Pulls out interesting things (amount, customer to charge, etc).
  16. If order looks real... • Calculate risk: • If low,

    saves everything: params, headers, etc. to DB. • Returns a response that looks identical to a production response.
  17. When we’re back up: • Order model on chocolate has

    a replay method. • Manual process run by support team to track results (and follow up if necessary).
  18. De-duping • Could be a case where an order is

    in chocolate and in production. • Don’t want to double-charge the customer. • Need to de-dupe.
  19. De-duping • Akamai injects a unique request ID for every

    order we create. • Store this on each order in production and on replays in chocolate. • Chocolate sends this as part of a replay.
  20. Triggering • Akamai has a rule that if a POST

    to our order #create endpoint takes > 15 seconds, retry the exact same request on chocolate. • Sometimes production will actually succeed, but not a problem: chocolate de- dupes.
  21. Pros of using something like Akamai • Allows you to

    auto-replay to separate endpoints. • If done correctly, your site will never appear to be down.
  22. Even though our site is up... We would still see

    orders fail over to chocolate.
  23. 0" 100" 200" 300" 400" 500" 600" 2/1/13" 2/2/13" 2/3/13"

    2/4/13" 2/5/13" 2/6/13" 2/7/13" 2/8/13" 2/9/13" 2/10/13" 2/11/13" 2/12/13" 2/13/13" 2/14/13" 2/15/13" 2/16/13" 2/17/13" 2/18/13" 2/19/13" 2/20/13" 2/21/13" 2/22/13" 2/23/13" 2/24/13" 2/25/13" 2/26/13" 2/27/13" 2/28/13" 3/1/13" 3/2/13" 3/3/13" 3/4/13" 3/5/13" 3/6/13" 3/7/13" 3/8/13" 3/9/13" 3/10/13" 3/11/13" 3/12/13" 3/13/13" 3/14/13" 3/15/13" 3/16/13" 3/17/13" 3/18/13" 3/19/13" 3/20/13" 3/21/13" 3/22/13" 3/23/13" 3/24/13" 3/25/13" 3/26/13" 3/27/13" 3/28/13" 3/29/13" 3/30/13" 3/31/13" 4/1/13" 4/2/13" Failovers*per*day*
  24. Dynos get backed up • Every day, a handful of

    orders still end up failing over to chocolate.
  25. Solutions • Make all endpoints fast to free up dynos

    quickly. • Keep tuning unicorn and failover timeouts. • No guaranteed way to solve this.