$30 off During Our Annual Pro Sale. View Details »

The ups and downs of building a payments API

The ups and downs of building a payments API

This talk was given at the University of Warwick Computing Society (UWCS).

The talk was given right before Halloween, hence the bats.

---

Lots of companies need to collect payments online, but nobody wants the hassle of building their own payments system. So what's it like if you're the one building it for everyone else?

GoCardless is a company doing just that. In this talk, we'll explore two themes - the challenges of building APIs, and the challenges of building a payments product. We'll look at how they tie together, and how they shaped the way we build software at GoCardless.

We'll wrap up by exploring a case where things didn't go to plan - an API outage - and how we learned from that failure.

After that, there'll be plenty of time for Q&A.

Chris Sinjakli

October 30, 2018
Tweet

More Decks by Chris Sinjakli

Other Decks in Programming

Transcript

  1. Hi

    View Slide

  2. @ChrisSinjo

    View Slide

  3. @ChrisSinjo

    View Slide

  4. GOCARDLESS

    View Slide

  5. Site Reliability
    Engineer

    View Slide

  6. Site Reliability
    Engineer
    (a Software Engineer who cares more about
    systems…ish)

    View Slide

  7. The ups and downs
    of building
    a payments API
    @ChrisSinjo

    View Slide

  8. GOCARDLESS

    View Slide

  9. GOCARDLESS

    View Slide

  10. GOCARDLESS
    What is a
    then?

    View Slide

  11. –The sales pitch
    “A global bank-to-bank payments
    network”

    View Slide

  12. –The sales pitch
    “A global bank-to-bank payments
    network”

    View Slide

  13. "#$%&'
    Direct Debit systems

    View Slide

  14. "#$%&'
    API (HTTP + JSON)
    Direct Debit systems

    View Slide

  15. POST /payments HTTP/1.1
    {
    "amount": 100,
    "currency": "GBP"
    }

    View Slide

  16. "#$%&'
    API (HTTP + JSON)

    View Slide

  17. "#$%&'
    )*+
    API (HTTP + JSON)

    View Slide

  18. GOCARDLESS
    That’s what a
    is

    View Slide

  19. 3 parts

    View Slide

  20. Challenges

    View Slide

  21. Challenges
    Approach

    View Slide

  22. Challenges
    Approach
    A tale from production

    View Slide

  23. CS in industry
    Variety of approaches
    Realities of production

    View Slide

  24. Challenges
    Approach
    A tale from production

    View Slide

  25. 2 categories

    View Slide

  26. Banking systems
    Reliability

    View Slide

  27. The banking
    systems
    Challenge 1

    View Slide

  28. "#$%&'
    So, about hiding those details…
    API (HTTP + JSON)

    View Slide

  29. File formats
    Payment timings
    Weird edge cases

    View Slide

  30. File formats
    Payment timings
    Weird edge cases

    View Slide


  31. 55779911
    200000
    100

    XML

    View Slide

  32. account,sort_code,amount
    55779911,200000,100
    CSV

    View Slide

  33. account sort_code amount
    55779911 200000 100
    Column-aligned

    View Slide

  34. File formats
    Payment timings
    Weird edge cases

    View Slide

  35. Direct Debit takes
    multiple working
    days

    View Slide

  36. https://gocardless.com/blog/bacs-sepa-processing-dates-2018/

    View Slide

  37. View Slide

  38. why?? (╯°□°)╯︵ ┻━┻
    why??

    View Slide

  39. Lots
    of calendar
    maths

    View Slide

  40. Merchant gives collection
    date

    We work out the rest

    View Slide

  41. File formats
    Payment timings
    Weird edge cases

    View Slide

  42. Payment failures
    in "

    View Slide

  43. Explicit failure
    Implicit success

    View Slide

  44. –A flawless plan
    “Wait 2 days and if you don’t hear,
    it’s fine.”

    View Slide

  45. –Reality
    “Wait 2 days and if you don’t hear,
    it’s fine, except…”

    View Slide

  46. Explicit failure
    Implicit success

    View Slide

  47. Explicit failure
    Explicit success

    View Slide

  48. It’s our job
    to handle the
    complexity

    View Slide

  49. Expectations of
    reliability
    Challenge 2

    View Slide

  50. POST /payments HTTP/1.1
    {
    "amount": 100,
    "currency": "GBP"
    }

    View Slide

  51. High per-request

    View Slide

  52. Reliability is

    View Slide

  53. Failure =>

    View Slide

  54. So we had to start
    early

    View Slide

  55. Move
    fast and
    break
    things

    View Slide

  56. Move
    fast and
    break
    things

    View Slide

  57. What did that
    mean?

    View Slide

  58. Avoid shipping bugs

    View Slide

  59. Avoid shipping bugs
    Be available

    View Slide

  60. Avoid shipping bugs
    Be available
    Don’t lose payments

    View Slide

  61. Challenges
    Approach
    A tale from production

    View Slide

  62. Challenges
    Approach
    A tale from production

    View Slide

  63. Find the right
    abstractions
    Approach 1

    View Slide

  64. Our challenge:
    Abstracting pain
    away from
    merchants

    View Slide

  65. Split the problem:
    core model vs
    integrations

    View Slide

  66. Core
    Model
    API
    Requests

    View Slide

  67. Core
    Model
    Banking
    Integration
    Banking
    Integration
    Banking
    Integration
    API
    Requests

    View Slide

  68. Core
    Model
    Banking
    Integration
    Banking
    Integration
    Banking
    Integration
    Banks
    API
    Requests

    View Slide

  69. Core
    Model
    Banking
    Integration
    Banking
    Integration
    Banking
    Integration
    Banks
    API
    Requests

    View Slide

  70. Core
    Model
    API
    Requests
    Cares about:
    - Payment status
    - Fees
    - Merchant payout

    View Slide

  71. Banking
    Integration
    Banking
    Integration
    Banking
    Integration
    Cares about:
    - File formats
    - Timings
    - Payment events

    View Slide

  72. So about the
    status of a
    payment…

    View Slide

  73. Created Submitted
    Successful
    Failed

    View Slide

  74. Created Submitted
    Successful
    Failed
    Submit

    View Slide

  75. Created Submitted
    Successful
    Failed
    Submit Succeed
    Fail

    View Slide

  76. Created Submitted
    Successful
    Failed
    Submit Succeed
    Fail
    Retry

    View Slide

  77. A familiar
    Computer Science
    abstraction

    View Slide

  78. State machines!

    View Slide

  79. https://github.com/gocardless/statesman

    View Slide

  80. Example state machine

    View Slide

  81. Lots
    of calendar
    maths

    View Slide

  82. https://gocardless.com/blog/bacs-sepa-processing-dates-2018/

    View Slide

  83. https://github.com/gocardless/business

    View Slide

  84. GoCardless

    open source

    View Slide

  85. Detect bugs
    early
    Approach 2

    View Slide

  86. Automated
    testing

    View Slide

  87. 109,000 lines of prod code

    View Slide

  88. 109,000 lines of prod code
    216,000 lines of test code

    View Slide

  89. 109,000 lines of prod code
    216,000 lines of test code
    ~1:2 ratio

    View Slide

  90. 30,000 tests on
    every
    git push

    View Slide

  91. And we don’t
    merge if they’re
    broken

    View Slide

  92. Production
    monitoring

    View Slide

  93. errors
    total reqs
    ( )x100
    Error rate (%)

    View Slide

  94. API requests
    server
    v1
    server
    v1
    server
    v1

    View Slide

  95. API requests
    server
    v1
    server
    v1
    server
    v1
    Error rate: 0.001%

    View Slide

  96. API requests
    server
    v2
    server
    v2
    server
    v2
    Error rate: 0.5%

    View Slide

  97. API requests
    server
    v1
    server
    v1
    server
    v1
    Error rate: 0.001%

    View Slide

  98. View Slide

  99. View Slide

  100. A goal:
    canary deploys

    View Slide

  101. API requests
    server
    v1
    server
    v1
    server
    v1
    33.3% 33.3% 33.3%

    View Slide

  102. API requests
    server
    v1
    server
    v1
    server
    v1
    server
    v2
    33.3% 33.3% 33.3%

    View Slide

  103. API requests
    server
    v1
    server
    v1
    server
    v1
    server
    v2
    33% 33% 33% 1%

    View Slide

  104. API requests
    server
    v1
    server
    v1
    server
    v1
    server
    v2
    33% 33% 33% 1%

    View Slide

  105. API requests
    server
    v1
    server
    v1
    server
    v1
    33.3% 33.3% 33.3%

    View Slide

  106. More
    automated use
    of error rates

    View Slide

  107. Make retries
    end-to-end
    Approach 3

    View Slide

  108. The reality of
    production:
    things break

    View Slide

  109. Networks break
    Machines break
    Applications break

    View Slide

  110. GoCardless
    (simplified)

    View Slide

  111. Web backend
    Load balancer
    Postgres
    database
    GoCardless
    (simplified)

    View Slide

  112. API User
    Web backend
    Load balancer
    Postgres
    database
    GoCardless
    (simplified)

    View Slide

  113. API User
    Web backend
    Load balancer
    Postgres
    database
    Network
    GoCardless
    (simplified)

    View Slide

  114. API User
    Web backend
    Load balancer
    Postgres
    database
    Network
    GoCardless
    (simplified)
    Broken (temp)

    View Slide

  115. API User
    Web backend
    Load balancer
    Postgres
    database
    Network
    GoCardless
    (simplified)
    Broken (temp)
    Retry

    View Slide

  116. API User
    Web backend
    Load balancer
    Postgres
    database
    Network
    GoCardless
    (simplified)
    Broken (temp)
    Retry

    View Slide

  117. API User
    Web backend
    Load balancer
    Postgres
    database
    Network
    GoCardless
    (simplified)
    Broken (temp)
    Retry

    View Slide

  118. API User
    Web backend
    Load balancer
    Postgres
    database
    Network
    GoCardless
    (simplified)
    Broken (temp)
    Retry

    View Slide

  119. API User
    Web backend
    Load balancer
    Postgres
    database
    Network
    GoCardless
    (simplified)
    Broken (temp)
    Retry

    View Slide

  120. API User
    Web backend
    Load balancer
    Postgres
    database
    Network
    GoCardless
    (simplified)
    Broken (temp)
    Retry

    View Slide

  121. API User
    Web backend
    Load balancer
    Postgres
    database
    Network
    GoCardless
    (simplified)
    Broken (temp)
    Retry

    View Slide

  122. API User
    Web backend
    Load balancer
    Postgres
    database
    Network
    GoCardless
    (simplified)
    Broken (temp)
    Retry

    View Slide

  123. API User
    Web backend
    Load balancer
    Postgres
    database
    Network
    GoCardless
    (simplified)
    Broken (temp)
    Retry

    View Slide

  124. API User
    Web backend
    Load balancer
    Postgres
    database
    Network
    GoCardless
    (simplified)
    Broken (temp)
    Retry

    View Slide

  125. This is the only
    way to win

    View Slide

  126. http://web.mit.edu/Saltzer/www/publications/endtoend/
    endtoend.pdf

    View Slide

  127. API User
    Web backend
    Load balancer
    Postgres
    database
    Network
    GoCardless
    (simplified)
    Broken (temp)
    Retry

    View Slide

  128. API User
    Web backend
    Load balancer
    Postgres
    database
    Network
    GoCardless
    (simplified)
    Broken (temp)
    Retry

    View Slide

  129. This is an
    optimisation

    View Slide

  130. So how do
    we do it?

    View Slide

  131. Retry
    requires
    idempotency

    View Slide

  132. Many invocations
    Effect only happens once

    View Slide

  133. View Slide

  134. View Slide

  135. Now, there’s
    one more trick

    View Slide

  136. https://github.com/gocardless/gocardless-pro-ruby

    View Slide

  137. View Slide

  138. Replicate data
    synchronously
    Approach 4

    View Slide

  139. Losing payments
    => bad times

    View Slide

  140. Web backend
    Postgres

    View Slide

  141. Web backend
    Postgres

    View Slide

  142. Web backend
    Postgres
    0
    GOCARDLESS

    View Slide

  143. Web backend
    Postgres

    View Slide

  144. Web backend
    Postgres
    Backup in remote storage
    (e.g. Amazon S3)

    View Slide

  145. Slow to restore
    A bit behind

    View Slide

  146. Slow to restore
    A bit behind

    View Slide

  147. Web backend
    Postgres

    View Slide

  148. Postgres
    Postgres
    Replication
    Web backend

    View Slide

  149. Postgres
    Postgres
    Replication
    Web backend

    View Slide

  150. Postgres
    Postgres
    Web backend

    View Slide

  151. Postgres
    Postgres
    Web backend

    View Slide

  152. Postgres
    Postgres
    Web backend
    Replication

    View Slide

  153. Primary: accepts writes
    Replica: copies primary

    View Slide

  154. A caveat

    View Slide

  155. Postgres
    Postgres
    Replication
    Web backend

    View Slide

  156. Postgres
    Postgres
    Synchronous
    Web backend
    Replication

    View Slide

  157. Postgres
    Postgres
    Synchronous
    Web backend
    Replication
    INSERT INTO payments…

    View Slide

  158. Postgres
    Postgres
    Synchronous
    Web backend
    Replication
    INSERT INTO payments…

    View Slide

  159. Postgres
    Postgres
    Synchronous
    Web backend
    Replication
    INSERT INTO payments…
    1

    View Slide

  160. Postgres
    Postgres
    Asynchronous
    Web backend
    Replication
    INSERT INTO payments…

    View Slide

  161. Postgres
    Postgres
    Asynchronous
    Web backend
    Replication
    INSERT INTO payments…

    View Slide

  162. Postgres
    Postgres
    Asynchronous
    Web backend
    Replication
    1
    INSERT INTO payments…

    View Slide

  163. Postgres
    Postgres
    Synchronous
    Web backend
    Replication
    INSERT INTO payments…
    1

    View Slide

  164. Automate database
    failover
    Approach 5

    View Slide

  165. Postgres
    Postgres
    Replication
    Web backend

    View Slide

  166. Postgres
    Postgres
    Replication
    Web backend

    View Slide

  167. View Slide

  168. Postgres
    Postgres
    Replication
    Web backend
    2

    View Slide

  169. Postgres
    Postgres
    Web backend
    2

    View Slide

  170. Postgres
    Postgres
    Web backend
    2

    View Slide

  171. Postgres
    Postgres
    Web backend
    2

    View Slide

  172. Postgres
    Postgres
    Web backend
    2

    View Slide

  173. Postgres
    Postgres
    Web backend
    2
    Replication

    View Slide

  174. Kinda okay,
    but…

    View Slide

  175. Slow recovery
    Error prone

    View Slide

  176. You gotta perform:
    - Many steps
    - In the right order
    - Perfectly

    View Slide

  177. Don’t make a
    tired
    SRE think

    View Slide

  178. Add automation

    View Slide

  179. Pacemaker
    A clustering tool

    View Slide

  180. How do we know a
    node has failed?

    View Slide

  181. B
    A

    View Slide

  182. B
    A

    View Slide

  183. B
    A
    ? ?

    View Slide

  184. B
    A

    View Slide

  185. B
    A

    View Slide

  186. B
    A

    View Slide

  187. We can’t choose
    a primary

    View Slide

  188. Quorum

    View Slide

  189. A majority of nodes
    must be available

    View Slide

  190. n+1
    2
    ( )
    round
    up

    View Slide

  191. n+1
    2
    ⌈ ⌉

    View Slide

  192. By example

    View Slide

  193. Nodes Quorum
    2 2

    View Slide

  194. Nodes Quorum
    2
    3
    2
    2

    View Slide

  195. Nodes Quorum
    2
    3
    4
    2
    2
    3

    View Slide

  196. Nodes Quorum
    2
    3
    4
    5
    2
    2
    3
    3

    View Slide

  197. B
    A

    View Slide

  198. B
    A
    C

    View Slide

  199. B
    A
    C

    View Slide

  200. B
    A
    C

    View Slide

  201. Let’s fix our
    setup

    View Slide

  202. But first,
    an apology

    View Slide

  203. View Slide

  204. 3

    View Slide

  205. View Slide

  206. Postgres
    Postgres
    Replication
    Web backend

    View Slide

  207. Postgres
    Postgres
    Postgres
    Repl Repl
    Web backend

    View Slide

  208. Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker
    Web backend

    View Slide

  209. Postgres
    Postgres
    Postgres
    Repl Repl
    Pacemaker Pacemaker Pacemaker
    Web backend
    Connection proxy

    View Slide

  210. Postgres
    Postgres
    Postgres
    Repl Repl
    Pacemaker Pacemaker Pacemaker
    Web backend
    Connection proxy

    View Slide

  211. Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    Web backend
    Connection proxy

    View Slide

  212. Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    Web backend
    Connection proxy
    Repl

    View Slide

  213. Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    Web backend
    Connection proxy
    Repl

    View Slide

  214. Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    Web backend
    Connection proxy
    Repl
    Repl

    View Slide

  215. Sorted!

    View Slide

  216. An endless list
    Approach 6, 7, 8…

    View Slide

  217. Challenges
    Approach
    A tale from production

    View Slide

  218. Challenges
    Approach
    A tale from production

    View Slide

  219. https://gocardless.com/blog/incident-review-api-and-dashboard-
    outage-on-10th-october/

    View Slide

  220. Postgres
    Postgres
    Postgres
    Repl Repl
    Pacemaker Pacemaker Pacemaker
    Web backend
    Connection proxy

    View Slide

  221. Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker
    Web backend

    View Slide

  222. Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker
    VIP
    Web backend

    View Slide

  223. Postgres
    Postgres
    Postgres Repl
    Repl
    VIP
    Pacemaker Pacemaker Pacemaker
    Web backend

    View Slide

  224. Web backend
    Postgres
    Postgres
    Postgres Repl
    Repl
    VIP
    Pacemaker Pacemaker Pacemaker

    View Slide

  225. Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker
    VIP
    Web backend

    View Slide

  226. Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker
    VIP
    Web backend

    View Slide

  227. Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker
    VIP
    Web backend

    View Slide

  228. Our API
    was down

    View Slide

  229. Priority:
    “Stop the bleeding”

    View Slide

  230. What’s the quickest
    way we can be
    back up?

    View Slide

  231. Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker
    VIP
    Web backend

    View Slide

  232. Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    VIP
    Web backend

    View Slide

  233. Our API
    was down

    View Slide

  234. Priority:
    “Stop the bleeding”

    View Slide

  235. Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    VIP
    Web backend

    View Slide

  236. Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    VIP
    Web backend
    crm resource cleanup

    View Slide

  237. Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    VIP
    Web backend
    “pacemaker forget errors”

    View Slide

  238. Our API
    was down

    View Slide

  239. Priority:
    “Stop the bleeding”

    View Slide

  240. Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    VIP
    Web backend

    View Slide

  241. Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    VIP
    Web backend
    2

    View Slide

  242. Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    VIP
    Web backend
    2
    Repl

    View Slide

  243. Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    VIP
    2
    Repl
    Web backend

    View Slide

  244. Our API
    was up

    View Slide

  245. Priority:
    “Stop the bleeding”

    View Slide

  246. What’s the quickest
    way we can be
    back up?

    View Slide

  247. The investigation
    that followed

    View Slide

  248. https://gocardless.com/blog/incident-review-api-and-dashboard-
    outage-on-10th-october/

    View Slide

  249. View Slide

  250. View Slide

  251. View Slide

  252. CS in industry
    Variety of approaches
    Realities of production

    View Slide

  253. CS in industry
    Variety of approaches
    Realities of production

    View Slide

  254. Idempotency
    End-to-end principle
    Quorum

    View Slide

  255. CS in industry
    Variety of approaches
    Realities of production

    View Slide

  256. End-to-end retry
    Automated failure handling
    People!

    View Slide

  257. CS in industry
    Variety of approaches
    Realities of production

    View Slide

  258. Networks break
    Machines break
    Applications break

    View Slide

  259. @ChrisSinjo
    @GoCardlessEng
    Thank you
    5❤

    View Slide

  260. We’re hiring
    5❤
    @ChrisSinjo
    @GoCardlessEng
    bit.ly/gc-interns-2019

    View Slide

  261. We’re hiring
    5❤
    @ChrisSinjo
    @GoCardlessEng
    bit.ly/gc-software-engineer

    View Slide

  262. Image credits
    • Castle with bats - CC0 - https://pixabay.com/en/bats-castle-evil-flying-full-
    moon-2027875/
    • call me maybe? - CC-BY - https://www.flickr.com/photos/myuibe/7880399646/
    • Rope - CC-BY - https://www.flickr.com/photos/49140926@N07/6798304070/

    View Slide

  263. @ChrisSinjo
    @GoCardlessEng
    Questions?
    5❤

    View Slide