$30 off During Our Annual Pro Sale. View Details »

Zero-downtime payment platforms

Zero-downtime payment platforms

When you're building a payment platform, you want to make sure that your system is always available to accept orders. However, the complexity of the platform introduces the potential for it to go down when any one of the moving parts fails. In this talk, I will show you the approaches that we've taken and the risks that we have to take to ensure that our platform will always be available for our customers. Even if you're not building a payment platform, these approaches can be applied to ensure a high availability for your platform or service as well.

Co-speaking with Ryan Twomey from SCVNGR at RailsConf 2013 on May 1, 2013.

Video is available at http://www.confreaks.com/videos/2481-railsconf2013-zero-downtime-payment-platforms

Prem Sichanugrist

May 01, 2013
Tweet

More Decks by Prem Sichanugrist

Other Decks in Programming

Transcript

  1. Zero-downtime
    payment platforms
    Prem Sichanugrist and Ryan Twomey

    View Slide

  2. Prem Sichanugrist

    View Slide

  3. View Slide

  4. Ryan Twomey

    View Slide

  5. View Slide

  6. RAILSCONF
    Promo code:
    20% off first month of Prime
    20% off everything else on the store
    http://learn.thoughtbot.com

    View Slide

  7. Some Background

    View Slide

  8. • Mobile payments (Android, iOS, WP7)
    company from Boston
    • Show QR code on phone to cashier to
    create an order
    • Order #create to Rails 3.2 app
    • Eventually hits credit/debit card via
    payment gateway.

    View Slide

  9. Our Stack
    M S
    S



    • Heroku* cedar
    • Postgres DB, two followers
    (one on west coast)
    * Heroku is on AWS.

    View Slide

  10. $1,000 per minute
    and growing fast

    View Slide

  11. Downtime sucks.

    View Slide

  12. Two different kinds of
    downtime:
    • Us
    • Them

    View Slide

  13. Them.

    View Slide

  14. • External Database
    • Email Provider
    • Caching Provider
    • Payment Gateway
    External Services

    View Slide

  15. • External Database
    • Email Provider
    • Caching Provider
    • Payment Gateway
    External Services

    View Slide

  16. What if our payment
    gateway goes down?

    View Slide

  17. New Order

    View Slide

  18. New Order
    Rejected!

    View Slide

  19. Customer turned away

    View Slide

  20. View Slide

  21. Risk?
    http://www.flickr.com/photos/tambako/4598642399/

    View Slide

  22. View Slide

  23. Manual Shutdown

    View Slide

  24. Everybody panic!
    http://www.flickr.com/photos/dumbledad/3225255407/

    View Slide

  25. “Failover Mode”

    View Slide

  26. Failover Mode
    • Accept low risk orders
    • Store them and charge customer later

    View Slide

  27. Risk Assertion

    View Slide

  28. Risk
    class  Risk
       def  initialize(order)
           @amount  =  order.balance.to_f
       end
     
       def  low?
           @amount  <  100.0
       end
    end

    View Slide

  29. Pros
    • Customers can make purchase.
    • No lost orders.

    View Slide

  30. Cons
    • Requires a human all the time.
    • Humans does not stay up 24/7

    View Slide

  31. https://secure.flickr.com/photos/spenceyc/7481166880/

    View Slide

  32. Automated Failover

    View Slide

  33. Timeout & Accept
    • Wrap a charge in a timeout
    • If it times out, evaluate risk
    • If low risk, save it and return success
    • Cron task to retry timed-out orders

    View Slide

  34. Timeout
    #  app/models/customer_charger.rb
    def  charge
       Timeout.timeout(TIMEOUT_IN_SECONDS)  do
           charge_card_via_gateway
       end
    rescue  Timeout::Error
       assess_risk_of_saving_order_without_charging_card
    end

    View Slide

  35. def  assess_risk_of_saving_order_without_charging_card
       if  Risk.new(@order).low?
           true
       else
           @card.errors.add  :base,  'card  failed!'
           false
       end
    end

    View Slide

  36. def  assess_risk_of_saving_order_without_charging_card
       if  Risk.new(@order).low?
           @order.gateway_id  =  
               "gateway-­‐down-­‐#{SecureRandom.hex(32)}"
           true
       else
           @card.errors.add  :base,  'card  failed!'
           false
       end
    end

    View Slide

  37. Cron task to retry
    Order.reconcilable.find_each  do  |order|
       order.reconcile
    end

    View Slide

  38. #  app/models/order.rb
    def  self.reconcilable
       where("gateway_id  LIKE  'gateway-­‐down%'")
    end
    Order.reconcilable.find_each  do  |order|
       order.reconcile
    end

    View Slide

  39. def  reconcile
       #  search  gateway  for  similar-­‐looking  charge
       if  gateway_id  =  SimilarOrderFinder.new(self).find
           #  found  one!  update  this  order  and  don't  re-­‐charge
           update_attribute  :gateway_id,  gateway_id
       else
           charge
           save
       end
    end
    Order.reconcilable.find_each  do  |order|
       order.reconcile
    end

    View Slide

  40. Pros
    • No humans required.
    • Developers can get some sleep instead of
    pushing buttons.

    View Slide

  41. Cons
    • Not really: it worked well for quite a while.
    • Very rarely SimilarOrderFinder might
    mistakenly find the wrong order.

    View Slide

  42. What about when
    we are down?
    (or anything critical in our stack.)

    View Slide

  43. We could go down
    • Application error
    • Heroku is failing
    • AWS went away

    View Slide

  44. Break time!
    http://www.flickr.com/photos/smason/3020514840/

    View Slide

  45. Let me tell you a story
    (involving burritos).

    View Slide

  46. On Oct 22 AWS went
    down.

    View Slide

  47. View Slide

  48. I was still at work

    View Slide

  49. I wanted a burrito.
    • There’s a Qdoba near my house, but I
    couldn’t remember its hours.
    • I pull up their site and...

    View Slide

  50. Yup, Qdoba is on Heroku.

    View Slide

  51. Heroku is on us-east-1.

    View Slide

  52. Heroku is on us-east-1.
    Crap.

    View Slide

  53. I never got my burrito.

    View Slide

  54. Remember kids...
    • Always use a CDN to serve your static
    pages.

    View Slide

  55. Break time over.

    View Slide

  56. What if Heroku goes
    down for us?
    (or AWS, or anything else in our stack.)

    View Slide

  57. Number of failed orders
    0" 500" 1000" 1500" 2000" 2500"
    10/19/12"
    10/20/12"
    10/21/12"
    10/22/12"
    10/23/12"
    10/24/12"
    10/25/12"

    View Slide

  58. We’ve planned ahead

    View Slide

  59. “Chocolate”
    Request Replayer

    View Slide

  60. Dynamic failover service
    Powered by Akamai

    View Slide

  61. Internet

    View Slide

  62. Akamai Dynamic Router
    Internet

    View Slide

  63. Akamai Dynamic Router
    Internet
    Rails 3 Application
    (Heroku)

    View Slide

  64. Akamai Dynamic Router
    Internet
    Chocolate
    Rails 3 Application
    (Heroku)

    View Slide

  65. Akamai Dynamic Router
    Internet
    Chocolate
    Rails 3 Application
    (Heroku)
    Akamai CDN

    View Slide

  66. View Slide

  67. Internet

    View Slide

  68. Akamai Dynamic Router
    Internet

    View Slide

  69. Akamai Dynamic Router
    Internet
    Rails Application
    (Heroku)

    View Slide

  70. Akamai Dynamic Router
    Internet
    Rails Application
    (Heroku)

    View Slide

  71. Akamai Dynamic Router
    Internet
    Rails Application
    (Heroku)
    Application Error

    View Slide

  72. Akamai Dynamic Router
    Internet
    Rails Application
    (Heroku)
    No respond within 15s

    View Slide

  73. Akamai Dynamic Router
    Internet
    Chocolate
    Rails Application
    (Heroku)

    View Slide

  74. What is Chocolate?

    View Slide

  75. Separate Sinatra
    Application

    View Slide

  76. Perform Risk Assertion

    View Slide

  77. Store raw request in DB

    View Slide

  78. “Replay” request
    back to production

    View Slide

  79. VCR for the web!

    View Slide

  80. Completely separate...
    • Sinatra app
    • Deployed to a VM on another (non AWS)
    cloud.

    View Slide

  81. So, if Heroku or AWS is
    down...

    View Slide

  82. our customers
    never even notice.

    View Slide

  83. Same risk as before
    • If an order is accepted that can’t be
    charged, we’re still on the hook.
    • Our support team follows up with
    customers to keep lost $$ as low as
    possible.

    View Slide

  84. How it Works

    View Slide

  85. Chocolate:
    • Single POST endpoint to save an Order
    into the database.
    • Pulls out interesting things (amount,
    customer to charge, etc).

    View Slide

  86. If order looks real...
    • Calculate risk:
    • If low, saves everything: params, headers,
    etc. to DB.
    • Returns a response that looks identical to a
    production response.

    View Slide

  87. Replaying Orders

    View Slide

  88. When we’re back up:
    • Order model on chocolate has a replay
    method.
    • Manual process run by support team to
    track results (and follow up if necessary).

    View Slide

  89. De-duping
    • Could be a case where an order is in
    chocolate and in production.
    • Don’t want to double-charge the customer.
    • Need to de-dupe.

    View Slide

  90. De-duping
    • Akamai injects a unique request ID for
    every order we create.
    • Store this on each order in production and
    on replays in chocolate.
    • Chocolate sends this as part of a replay.

    View Slide

  91. When to Fail Over

    View Slide

  92. Triggering
    • Akamai has a rule that if a POST to our
    order #create endpoint takes > 15
    seconds, retry the exact same request on
    chocolate.
    • Sometimes production will actually
    succeed, but not a problem: chocolate de-
    dupes.

    View Slide

  93. Pros of using something
    like Akamai
    • Allows you to auto-replay to separate
    endpoints.
    • If done correctly, your site will never appear
    to be down.

    View Slide

  94. Cons
    • Adds a fairly significant layer of complexity.
    • Adds non-trivial costs.

    View Slide

  95. Even though our site is
    up...
    We would still see orders fail over to chocolate.

    View Slide

  96. 0"
    100"
    200"
    300"
    400"
    500"
    600"
    2/1/13"
    2/2/13"
    2/3/13"
    2/4/13"
    2/5/13"
    2/6/13"
    2/7/13"
    2/8/13"
    2/9/13"
    2/10/13"
    2/11/13"
    2/12/13"
    2/13/13"
    2/14/13"
    2/15/13"
    2/16/13"
    2/17/13"
    2/18/13"
    2/19/13"
    2/20/13"
    2/21/13"
    2/22/13"
    2/23/13"
    2/24/13"
    2/25/13"
    2/26/13"
    2/27/13"
    2/28/13"
    3/1/13"
    3/2/13"
    3/3/13"
    3/4/13"
    3/5/13"
    3/6/13"
    3/7/13"
    3/8/13"
    3/9/13"
    3/10/13"
    3/11/13"
    3/12/13"
    3/13/13"
    3/14/13"
    3/15/13"
    3/16/13"
    3/17/13"
    3/18/13"
    3/19/13"
    3/20/13"
    3/21/13"
    3/22/13"
    3/23/13"
    3/24/13"
    3/25/13"
    3/26/13"
    3/27/13"
    3/28/13"
    3/29/13"
    3/30/13"
    3/31/13"
    4/1/13"
    4/2/13"
    Failovers*per*day*

    View Slide

  97. No services were
    down.

    View Slide

  98. What could be causing
    this?

    View Slide

  99. Have you ever heard of
    random routing?

    View Slide

  100. Have you ever heard of
    random routing?

    View Slide

  101. Dynos get backed up
    • Every day, a handful of orders still end up
    failing over to chocolate.

    View Slide

  102. Dynos
    Heroku
    Router
    1 2 3 4 5

    View Slide

  103. Dynos
    Heroku
    Router
    1 2 3 4 5

    View Slide

  104. Dynos
    Heroku
    Router
    1 2 3 4 5

    View Slide

  105. Dynos
    Heroku
    Router
    1 2 3 4 5

    View Slide

  106. Dynos
    Heroku
    Router
    1 2 3 4 5

    View Slide

  107. Dynos
    Heroku
    Router
    1 2 3 4 5

    View Slide

  108. Dynos
    Heroku
    Router
    1 2 3 4 5

    View Slide

  109. Dynos
    Heroku
    Router
    1 2 3 4 5

    View Slide

  110. Dynos
    Heroku
    Router
    1 2 3 4 5

    View Slide

  111. Dynos
    Heroku
    Router
    1 2 3 4 5

    View Slide

  112. Dynos
    Heroku
    Router
    1 2 3 4 5

    View Slide

  113. Dynos
    Heroku
    Router
    1 2 3 4 5
    t ≥ 15 sec

    View Slide

  114. Dynos
    Heroku
    Router
    1 2 3 4 5
    Timeout

    View Slide

  115. Solutions
    • Make all endpoints fast to free up dynos
    quickly.
    • Keep tuning unicorn and failover timeouts.
    • No guaranteed way to solve this.

    View Slide

  116. We’re Still Investigating...
    • We’ve been obsessively tuning unicorn
    worker counts, backlog total, etc.

    View Slide

  117. Things to Remember

    View Slide

  118. 1. Your site will go
    down.

    View Slide

  119. 2. Use a replayer for
    critical web requests.

    View Slide

  120. 3. Accept some risk to
    keep customers happy.

    View Slide

  121. 4. Keep your endpoints
    lean and fast.

    View Slide

  122. 5. Use a CDN.

    View Slide

  123. Thank you.
    Prem Sichanugrist
    [email protected]
    @sikachu
    Ryan Twomey
    [email protected]
    @rtwomey

    View Slide