Zero-downtime payment platforms

Prem Sichanugrist @sikachu /sikachu

RAILSPACIFIC Promo code: https://upcase.com 50% off ﬁrst month 50% off
everything else (expires in 1 month)

Some Background

• Mobile payments (Android, iOS, WP7) company from Boston •
Show QR code on phone to cashier to create an order • Order #create to Rails 4.1 app • Eventually hits credit/debit card via payment gateway.

Our Stack • Heroku* cedar • Postgres DB, two followers
(one on west coast) * Heroku is on AWS.

Downtime sucks.

Two different kinds of downtime: • Us • Them

• External Database • Email Provider • Caching Provider •
Payment Gateway External Services

What if our payment gateway goes down?

New Order

New Order Rejected!

Customer turned away

Risk? https://ﬂic.kr/p/81nfaV

Manual Shutdown

Everybody panic! https://ﬂic.kr/p/5V1h4R

“Failover Mode”

Failover Mode • Accept low risk orders • Store them
and charge customer later

Risk Assertion

Risk class Risk def initialize(order)
@amount = order.balance.to_f end def low? @amount < 100.0 end end

Pros • Customers can make purchase. • No lost orders.

Cons • Requires a human all the time. • Humans
does not stay up 24/7

https://ﬂic.kr/p/cp5WgS

Automated Failover

Timeout & Accept • Wrap a charge in a timeout
• If it times out, evaluate risk • If low risk, save it and return success • Cron task to retry timed-out orders

Timeout # app/models/customer_charger.rb def charge Timeout.timeout(TIMEOUT_IN_SECONDS) do
charge_card_via_gateway end rescue Timeout::Error assess_risk_of_saving_order_without_charging_card end

def assess_risk_of_saving_order_without_charging_card if Risk.new(@order).low?
true else @card.errors.add :base, 'card failed!' false end end

def assess_risk_of_saving_order_without_charging_card if Risk.new(@order).low?
@order.gateway_id = "gateway-‐down-‐#{SecureRandom.hex(32)}" true else @card.errors.add :base, 'card failed!' false end end

Cron task to retry Order.reconcilable.find_each do |order| order.reconcile
end

# app/models/order.rb def self.reconcilable where("gateway_id LIKE 'gateway-‐down%'")
end Order.reconcilable.find_each do |order| order.reconcile end

def reconcile # search gateway for similar-‐looking charge
if gateway_id = SimilarOrderFinder.new(self).find # found one! update this order and don't re-‐charge update_attribute :gateway_id, gateway_id else charge save end end Order.reconcilable.find_each do |order| order.reconcile end

Pros • No humans required. • Developers can get some
sleep instead of pushing buttons.

Cons • Not really: it worked well for quite a
while. • Very rarely SimilarOrderFinder might mistakenly ﬁnd the wrong order.

What about when we are down? (or anything critical in
our stack.)

We could go down • Application error • Heroku is
failing • AWS went away

Story time! https://ﬂic.kr/p/5AUVQ5

Let me tell you a story

On Oct 22, 2012 AWS went down.

Heroku is on us-east-1.

Heroku is on us-east-1. Crap.

Number of failed orders 0" 500" 1000" 1500" 2000" 2500"
10/19/12" 10/20/12" 10/21/12" 10/22/12" 10/23/12" 10/24/12" 10/25/12"

We’ve planned ahead

“Chocolate” Request Replayer

Dynamic failover service Powered by Akamai

Internet

Akamai Dynamic Router Internet

Akamai Dynamic Router Internet Rails 4 Application (Heroku)

Akamai Dynamic Router Internet Chocolate Rails 4 Application (Heroku)

Akamai Dynamic Router Internet Chocolate Rails 4 Application (Heroku) Akamai
CDN

Internet

Akamai Dynamic Router Internet

Rails Application (Heroku) Akamai Dynamic Router Internet

Rails Application (Heroku) Akamai Dynamic Router Internet Application Error

Rails Application (Heroku) Akamai Dynamic Router Internet No respond within
15s

Chocolate Rails Application (Heroku) Akamai Dynamic Router Internet

What is Chocolate? https://ﬂic.kr/p/dfCAWM

Separate Sinatra Application

Perform Risk Assertion

Store raw request in DB

“Replay” request back to production

VCR for the web!

Completely separate... • Sinatra app • Deployed to a VM
on another (non AWS) cloud.

So, if Heroku or AWS is down...

our customers never even notice.

Same risk as before • If an order is accepted
that can’t be charged, we’re still on the hook. • Our support team follows up with customers to keep lost $$ as low as possible.

How it Works

Chocolate: • Single POST endpoint to save an Order into
the database. • Pulls out interesting things (amount, customer to charge, etc).

If order looks real... • Calculate risk: • If low,
saves everything: params, headers, etc. to DB. • Returns a response that looks identical to a production response.

Replaying Orders

When we’re back up: • Order model on chocolate has
a replay method. • Manual process run by support team to track results (and follow up if necessary).

De-duping • Could be a case where an order is
in chocolate and in production. • Don’t want to double-charge the customer. • Need to de-dupe.

De-duping • Akamai injects a unique request ID for every
order we create. • Store this on each order in production and on replays in chocolate. • Chocolate sends this as part of a replay.

When to Fail Over

Triggering • Akamai has a rule that if a POST
to our order #create endpoint takes > 15 seconds, retry the exact same request on chocolate. • Sometimes production will actually succeed, but not a problem: chocolate de-dupes.

Pros of using something like Akamai • Allows you to
auto-replay to separate endpoints. • If done correctly, your site will never appear to be down.

Cons • Adds a fairly signiﬁcant layer of complexity. •
Adds non-trivial costs.

Even though  our site is up... We would still see
orders fail over to chocolate.

0" 100" 200" 300" 400" 500" 600" 2/1/13" 2/2/13" 2/3/13"
2/4/13" 2/5/13" 2/6/13" 2/7/13" 2/8/13" 2/9/13" 2/10/13" 2/11/13" 2/12/13" 2/13/13" 2/14/13" 2/15/13" 2/16/13" 2/17/13" 2/18/13" 2/19/13" 2/20/13" 2/21/13" 2/22/13" 2/23/13" 2/24/13" 2/25/13" 2/26/13" 2/27/13" 2/28/13" 3/1/13" 3/2/13" 3/3/13" 3/4/13" 3/5/13" 3/6/13" 3/7/13" 3/8/13" 3/9/13" 3/10/13" 3/11/13" 3/12/13" 3/13/13" 3/14/13" 3/15/13" 3/16/13" 3/17/13" 3/18/13" 3/19/13" 3/20/13" 3/21/13" 3/22/13" 3/23/13" 3/24/13" 3/25/13" 3/26/13" 3/27/13" 3/28/13" 3/29/13" 3/30/13" 3/31/13" 4/1/13" 4/2/13" Failovers*per*day*

No services were down.

What could be causing this?

Have you ever heard of random routing?

Dynos get backed up • Every day, a handful of
orders still end up failing over to chocolate.

Dynos Heroku Router 1 2 3 4 5

t ≥ 15 sec Dynos Heroku Router 1 2 3
4 5

Dynos Heroku Router 1 2 3 4 5 Timeout

Solutions • Make all endpoints fast to free up  dynos
quickly. • Keep tuning unicorn and failover timeouts. • No guaranteed way to solve this.

We’re Still Investigating... • We’ve been obsessively tuning unicorn worker
counts, backlog total, etc.

Things to Remember

1. Your site will go down.

2. Use a replayer for critical web requests.

3. Accept some risk to keep customers happy.

4. Keep your endpoints lean and fast.

Prem Sichanugrist @sikachu /sikachu Thank you

Zero-downtime payment platforms

Zero-downtime payment platforms

More Decks by Prem Sichanugrist

Other Decks in Programming

Featured

Transcript