Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Zero-downtime payment platforms

Zero-downtime payment platforms

Revised talk after given at RailsConf 2013 (https://speakerdeck.com/sikachu/zero-downtime-payment-platforms)

Presented at RailsPacific 2014 on September 27, 2014.

Video is available at https://www.youtube.com/watch?v=N8sYlKheRrk

Prem Sichanugrist

September 27, 2014
Tweet

More Decks by Prem Sichanugrist

Other Decks in Programming

Transcript

  1. Zero-downtime
    payment platforms

    View Slide

  2. Prem Sichanugrist
    @sikachu
    /sikachu

    View Slide

  3. RAILSPACIFIC
    Promo code:
    https://upcase.com
    50% off first month
    50% off everything else
    (expires in 1 month)

    View Slide

  4. Some Background

    View Slide

  5. • Mobile payments (Android, iOS, WP7)
    company from Boston
    • Show QR code on phone to cashier to
    create an order
    • Order #create to Rails 4.1 app
    • Eventually hits credit/debit card via
    payment gateway.

    View Slide

  6. Our Stack
    • Heroku* cedar
    • Postgres DB, two followers
    (one on west coast)
    * Heroku is on AWS.

    View Slide

  7. Downtime sucks.

    View Slide

  8. Two different kinds of
    downtime:
    • Us
    • Them

    View Slide

  9. Them.

    View Slide

  10. • External Database
    • Email Provider
    • Caching Provider
    • Payment Gateway
    External Services

    View Slide

  11. • External Database
    • Email Provider
    • Caching Provider
    • Payment Gateway
    External Services

    View Slide

  12. What if our payment
    gateway goes down?

    View Slide

  13. New Order

    View Slide

  14. New Order
    Rejected!

    View Slide

  15. Customer turned away

    View Slide

  16. View Slide

  17. Risk?
    https://flic.kr/p/81nfaV

    View Slide

  18. View Slide

  19. Manual Shutdown

    View Slide

  20. Everybody panic!
    https://flic.kr/p/5V1h4R

    View Slide

  21. “Failover Mode”

    View Slide

  22. Failover Mode
    • Accept low risk orders
    • Store them and charge customer later

    View Slide

  23. Risk Assertion

    View Slide

  24. Risk
    class  Risk  
       def  initialize(order)  
           @amount  =  order.balance.to_f  
       end  
       
       def  low?  
           @amount  <  100.0  
       end  
    end

    View Slide

  25. Pros
    • Customers can make purchase.
    • No lost orders.

    View Slide

  26. Cons
    • Requires a human all the time.
    • Humans does not stay up 24/7

    View Slide

  27. https://flic.kr/p/cp5WgS

    View Slide

  28. Automated Failover

    View Slide

  29. Timeout & Accept
    • Wrap a charge in a timeout
    • If it times out, evaluate risk
    • If low risk, save it and return success
    • Cron task to retry timed-out orders

    View Slide

  30. Timeout
    #  app/models/customer_charger.rb  
    def  charge  
       Timeout.timeout(TIMEOUT_IN_SECONDS)  do  
           charge_card_via_gateway  
       end  
    rescue  Timeout::Error  
       assess_risk_of_saving_order_without_charging_card  
    end

    View Slide

  31. def  assess_risk_of_saving_order_without_charging_card  
       if  Risk.new(@order).low?  
           true  
       else  
           @card.errors.add  :base,  'card  failed!'  
           false  
       end  
    end

    View Slide

  32. def  assess_risk_of_saving_order_without_charging_card  
       if  Risk.new(@order).low?  
           @order.gateway_id  =    
               "gateway-­‐down-­‐#{SecureRandom.hex(32)}"  
           true  
       else  
           @card.errors.add  :base,  'card  failed!'  
           false  
       end  
    end

    View Slide

  33. Cron task to retry
    Order.reconcilable.find_each  do  |order|  
       order.reconcile  
    end

    View Slide

  34. #  app/models/order.rb  
    def  self.reconcilable  
       where("gateway_id  LIKE  'gateway-­‐down%'")  
    end
    Order.reconcilable.find_each  do  |order|  
       order.reconcile  
    end

    View Slide

  35. def  reconcile  
       #  search  gateway  for  similar-­‐looking  charge  
       if  gateway_id  =  SimilarOrderFinder.new(self).find  
           #  found  one!  update  this  order  and  don't  re-­‐charge  
           update_attribute  :gateway_id,  gateway_id  
       else  
           charge  
           save  
       end  
    end
    Order.reconcilable.find_each  do  |order|  
       order.reconcile  
    end

    View Slide

  36. Pros
    • No humans required.
    • Developers can get some sleep instead
    of pushing buttons.

    View Slide

  37. Cons
    • Not really: it worked well for quite a while.
    • Very rarely SimilarOrderFinder might
    mistakenly find the wrong order.

    View Slide

  38. What about when
    we are down?
    (or anything critical in our stack.)

    View Slide

  39. We could go down
    • Application error
    • Heroku is failing
    • AWS went away

    View Slide

  40. Story time!
    https://flic.kr/p/5AUVQ5

    View Slide

  41. Let me tell you a story

    View Slide

  42. On Oct 22, 2012
    AWS went down.

    View Slide

  43. View Slide

  44. Heroku is on us-east-1.

    View Slide

  45. Heroku is on us-east-1.
    Crap.

    View Slide

  46. Number of failed orders
    0" 500" 1000" 1500" 2000" 2500"
    10/19/12"
    10/20/12"
    10/21/12"
    10/22/12"
    10/23/12"
    10/24/12"
    10/25/12"

    View Slide

  47. We’ve planned ahead

    View Slide

  48. “Chocolate”
    Request Replayer

    View Slide

  49. Dynamic failover service
    Powered by Akamai

    View Slide

  50. Internet

    View Slide

  51. Akamai Dynamic Router
    Internet

    View Slide

  52. Akamai Dynamic Router
    Internet
    Rails 4
    Application
    (Heroku)

    View Slide

  53. Akamai Dynamic Router
    Internet
    Chocolate
    Rails 4
    Application
    (Heroku)

    View Slide

  54. Akamai Dynamic Router
    Internet
    Chocolate
    Rails 4
    Application
    (Heroku)
    Akamai
    CDN

    View Slide

  55. View Slide

  56. Internet

    View Slide

  57. Akamai Dynamic Router
    Internet

    View Slide

  58. Rails Application
    (Heroku)
    Akamai Dynamic Router
    Internet

    View Slide

  59. Rails Application
    (Heroku)
    Akamai Dynamic Router
    Internet

    View Slide

  60. Rails Application
    (Heroku)
    Akamai Dynamic Router
    Internet
    Application Error

    View Slide

  61. Rails Application
    (Heroku)
    Akamai Dynamic Router
    Internet
    No respond within 15s

    View Slide

  62. Chocolate
    Rails Application
    (Heroku)
    Akamai Dynamic Router
    Internet

    View Slide

  63. What is Chocolate?
    https://flic.kr/p/dfCAWM

    View Slide

  64. Separate Sinatra
    Application

    View Slide

  65. Perform Risk
    Assertion

    View Slide

  66. Store raw request in DB

    View Slide

  67. “Replay” request
    back to production

    View Slide

  68. VCR for the web!

    View Slide

  69. Completely separate...
    • Sinatra app
    • Deployed to a VM on another (non AWS)
    cloud.

    View Slide

  70. So, if Heroku or AWS
    is down...

    View Slide

  71. our customers
    never even notice.

    View Slide

  72. Same risk as before
    • If an order is accepted that can’t be
    charged, we’re still on the hook.
    • Our support team follows up with
    customers to keep lost $$ as low as
    possible.

    View Slide

  73. How it Works

    View Slide

  74. Chocolate:
    • Single POST endpoint to save an Order
    into the database.
    • Pulls out interesting things (amount,
    customer to charge, etc).

    View Slide

  75. If order looks real...
    • Calculate risk:
    • If low, saves everything: params,
    headers, etc. to DB.
    • Returns a response that looks identical to
    a production response.

    View Slide

  76. Replaying Orders

    View Slide

  77. When we’re back up:
    • Order model on chocolate has a replay
    method.
    • Manual process run by support team to
    track results (and follow up if necessary).

    View Slide

  78. De-duping
    • Could be a case where an order is in
    chocolate and in production.
    • Don’t want to double-charge the
    customer.
    • Need to de-dupe.

    View Slide

  79. De-duping
    • Akamai injects a unique request ID for
    every order we create.
    • Store this on each order in production
    and on replays in chocolate.
    • Chocolate sends this as part of a replay.

    View Slide

  80. When to Fail Over

    View Slide

  81. Triggering
    • Akamai has a rule that if a POST to our
    order #create endpoint takes > 15
    seconds, retry the exact same request on
    chocolate.
    • Sometimes production will actually
    succeed, but not a problem: chocolate
    de-dupes.

    View Slide

  82. Pros of using
    something like Akamai
    • Allows you to auto-replay to separate
    endpoints.
    • If done correctly, your site will never
    appear to be down.

    View Slide

  83. Cons
    • Adds a fairly significant layer of
    complexity.
    • Adds non-trivial costs.

    View Slide

  84. Even though

    our site is up...
    We would still see orders fail over to chocolate.

    View Slide

  85. 0"
    100"
    200"
    300"
    400"
    500"
    600"
    2/1/13"
    2/2/13"
    2/3/13"
    2/4/13"
    2/5/13"
    2/6/13"
    2/7/13"
    2/8/13"
    2/9/13"
    2/10/13"
    2/11/13"
    2/12/13"
    2/13/13"
    2/14/13"
    2/15/13"
    2/16/13"
    2/17/13"
    2/18/13"
    2/19/13"
    2/20/13"
    2/21/13"
    2/22/13"
    2/23/13"
    2/24/13"
    2/25/13"
    2/26/13"
    2/27/13"
    2/28/13"
    3/1/13"
    3/2/13"
    3/3/13"
    3/4/13"
    3/5/13"
    3/6/13"
    3/7/13"
    3/8/13"
    3/9/13"
    3/10/13"
    3/11/13"
    3/12/13"
    3/13/13"
    3/14/13"
    3/15/13"
    3/16/13"
    3/17/13"
    3/18/13"
    3/19/13"
    3/20/13"
    3/21/13"
    3/22/13"
    3/23/13"
    3/24/13"
    3/25/13"
    3/26/13"
    3/27/13"
    3/28/13"
    3/29/13"
    3/30/13"
    3/31/13"
    4/1/13"
    4/2/13"
    Failovers*per*day*

    View Slide

  86. No services were down.

    View Slide

  87. What could be
    causing this?

    View Slide

  88. Have you ever heard
    of random routing?

    View Slide

  89. Dynos get backed up
    • Every day, a handful of orders still end up
    failing over to chocolate.

    View Slide

  90. Dynos
    Heroku
    Router
    1 2 3 4 5

    View Slide

  91. Dynos
    Heroku
    Router
    1 2 3 4 5

    View Slide

  92. Dynos
    Heroku
    Router
    1 2 3 4 5

    View Slide

  93. Dynos
    Heroku
    Router
    1 2 3 4 5

    View Slide

  94. Dynos
    Heroku
    Router
    1 2 3 4 5

    View Slide

  95. Dynos
    Heroku
    Router
    1 2 3 4 5

    View Slide

  96. Dynos
    Heroku
    Router
    1 2 3 4 5

    View Slide

  97. Dynos
    Heroku
    Router
    1 2 3 4 5

    View Slide

  98. Dynos
    Heroku
    Router
    1 2 3 4 5

    View Slide

  99. Dynos
    Heroku
    Router
    1 2 3 4 5

    View Slide

  100. Dynos
    Heroku
    Router
    1 2 3 4 5

    View Slide

  101. t ≥ 15 sec
    Dynos
    Heroku
    Router
    1 2 3 4 5

    View Slide

  102. Dynos
    Heroku
    Router
    1 2 3 4 5
    Timeout

    View Slide

  103. Solutions
    • Make all endpoints fast to free up

    dynos quickly.
    • Keep tuning unicorn and failover timeouts.
    • No guaranteed way to solve this.

    View Slide

  104. We’re Still
    Investigating...
    • We’ve been obsessively tuning unicorn
    worker counts, backlog total, etc.

    View Slide

  105. Things to Remember

    View Slide

  106. 1. Your site will go down.

    View Slide

  107. 2. Use a replayer for
    critical web requests.

    View Slide

  108. 3. Accept some risk
    to keep customers
    happy.

    View Slide

  109. 4. Keep your endpoints
    lean and fast.

    View Slide

  110. Prem Sichanugrist
    @sikachu
    /sikachu
    Thank you

    View Slide