Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The ups and downs of building a payments API

The ups and downs of building a payments API

This talk was given at the University of Warwick Computing Society (UWCS).

The talk was given right before Halloween, hence the bats.

---

Lots of companies need to collect payments online, but nobody wants the hassle of building their own payments system. So what's it like if you're the one building it for everyone else?

GoCardless is a company doing just that. In this talk, we'll explore two themes - the challenges of building APIs, and the challenges of building a payments product. We'll look at how they tie together, and how they shaped the way we build software at GoCardless.

We'll wrap up by exploring a case where things didn't go to plan - an API outage - and how we learned from that failure.

After that, there'll be plenty of time for Q&A.

Chris Sinjakli

October 30, 2018
Tweet

More Decks by Chris Sinjakli

Other Decks in Programming

Transcript

  1. Hi

  2. @ChrisSinjo

  3. @ChrisSinjo

  4. GOCARDLESS

  5. Site Reliability Engineer

  6. Site Reliability Engineer (a Software Engineer who cares more about

    systems…ish)
  7. The ups and downs of building a payments API @ChrisSinjo

  8. GOCARDLESS

  9. GOCARDLESS

  10. GOCARDLESS What is a then?

  11. –The sales pitch “A global bank-to-bank payments network”

  12. –The sales pitch “A global bank-to-bank payments network”

  13. "#$%&' Direct Debit systems

  14. "#$%&' API (HTTP + JSON) Direct Debit systems

  15. POST /payments HTTP/1.1 { "amount": 100, "currency": "GBP" }

  16. "#$%&' API (HTTP + JSON)

  17. "#$%&' )*+ API (HTTP + JSON)

  18. GOCARDLESS That’s what a is

  19. 3 parts

  20. Challenges

  21. Challenges Approach

  22. Challenges Approach A tale from production

  23. CS in industry Variety of approaches Realities of production

  24. Challenges Approach A tale from production

  25. 2 categories

  26. Banking systems Reliability

  27. The banking systems Challenge 1

  28. "#$%&' So, about hiding those details… API (HTTP + JSON)

  29. File formats Payment timings Weird edge cases

  30. File formats Payment timings Weird edge cases

  31. <payment> <account>55779911</account> <sort_code>200000</sort_code> <amount>100</amount> </payment> XML

  32. account,sort_code,amount 55779911,200000,100 CSV

  33. account sort_code amount 55779911 200000 100 Column-aligned

  34. File formats Payment timings Weird edge cases

  35. Direct Debit takes multiple working days

  36. https://gocardless.com/blog/bacs-sepa-processing-dates-2018/

  37. None
  38. why?? (╯°□°)╯︵ ┻━┻ why??

  39. Lots of calendar maths

  40. Merchant gives collection date ↓ We work out the rest

  41. File formats Payment timings Weird edge cases

  42. Payment failures in "

  43. Explicit failure Implicit success

  44. –A flawless plan “Wait 2 days and if you don’t

    hear, it’s fine.”
  45. –Reality “Wait 2 days and if you don’t hear, it’s

    fine, except…”
  46. Explicit failure Implicit success

  47. Explicit failure Explicit success

  48. It’s our job to handle the complexity

  49. Expectations of reliability Challenge 2

  50. POST /payments HTTP/1.1 { "amount": 100, "currency": "GBP" }

  51. High per-request

  52. Reliability is

  53. Failure =>

  54. So we had to start early

  55. Move fast and break things

  56. Move fast and break things

  57. What did that mean?

  58. Avoid shipping bugs

  59. Avoid shipping bugs Be available

  60. Avoid shipping bugs Be available Don’t lose payments

  61. Challenges Approach A tale from production

  62. Challenges Approach A tale from production

  63. Find the right abstractions Approach 1

  64. Our challenge: Abstracting pain away from merchants

  65. Split the problem: core model vs integrations

  66. Core Model API Requests

  67. Core Model Banking Integration Banking Integration Banking Integration API Requests

  68. Core Model Banking Integration Banking Integration Banking Integration Banks API

    Requests
  69. Core Model Banking Integration Banking Integration Banking Integration Banks API

    Requests
  70. Core Model API Requests Cares about: - Payment status -

    Fees - Merchant payout
  71. Banking Integration Banking Integration Banking Integration Cares about: - File

    formats - Timings - Payment events
  72. So about the status of a payment…

  73. Created Submitted Successful Failed

  74. Created Submitted Successful Failed Submit

  75. Created Submitted Successful Failed Submit Succeed Fail

  76. Created Submitted Successful Failed Submit Succeed Fail Retry

  77. A familiar Computer Science abstraction

  78. State machines!

  79. https://github.com/gocardless/statesman

  80. Example state machine

  81. Lots of calendar maths

  82. https://gocardless.com/blog/bacs-sepa-processing-dates-2018/

  83. https://github.com/gocardless/business

  84. GoCardless ❤ open source

  85. Detect bugs early Approach 2

  86. Automated testing

  87. 109,000 lines of prod code

  88. 109,000 lines of prod code 216,000 lines of test code

  89. 109,000 lines of prod code 216,000 lines of test code

    ~1:2 ratio
  90. 30,000 tests on every git push

  91. And we don’t merge if they’re broken

  92. Production monitoring

  93. errors total reqs ( )x100 Error rate (%)

  94. API requests server v1 server v1 server v1

  95. API requests server v1 server v1 server v1 Error rate:

    0.001%
  96. API requests server v2 server v2 server v2 Error rate:

    0.5%
  97. API requests server v1 server v1 server v1 Error rate:

    0.001%
  98. None
  99. None
  100. A goal: canary deploys

  101. API requests server v1 server v1 server v1 33.3% 33.3%

    33.3%
  102. API requests server v1 server v1 server v1 server v2

    33.3% 33.3% 33.3%
  103. API requests server v1 server v1 server v1 server v2

    33% 33% 33% 1%
  104. API requests server v1 server v1 server v1 server v2

    33% 33% 33% 1%
  105. API requests server v1 server v1 server v1 33.3% 33.3%

    33.3%
  106. More automated use of error rates

  107. Make retries end-to-end Approach 3

  108. The reality of production: things break

  109. Networks break Machines break Applications break

  110. GoCardless (simplified)

  111. Web backend Load balancer Postgres database GoCardless (simplified)

  112. API User Web backend Load balancer Postgres database GoCardless (simplified)

  113. API User Web backend Load balancer Postgres database Network GoCardless

    (simplified)
  114. API User Web backend Load balancer Postgres database Network GoCardless

    (simplified) Broken (temp)
  115. API User Web backend Load balancer Postgres database Network GoCardless

    (simplified) Broken (temp) Retry
  116. API User Web backend Load balancer Postgres database Network GoCardless

    (simplified) Broken (temp) Retry
  117. API User Web backend Load balancer Postgres database Network GoCardless

    (simplified) Broken (temp) Retry
  118. API User Web backend Load balancer Postgres database Network GoCardless

    (simplified) Broken (temp) Retry
  119. API User Web backend Load balancer Postgres database Network GoCardless

    (simplified) Broken (temp) Retry
  120. API User Web backend Load balancer Postgres database Network GoCardless

    (simplified) Broken (temp) Retry
  121. API User Web backend Load balancer Postgres database Network GoCardless

    (simplified) Broken (temp) Retry
  122. API User Web backend Load balancer Postgres database Network GoCardless

    (simplified) Broken (temp) Retry
  123. API User Web backend Load balancer Postgres database Network GoCardless

    (simplified) Broken (temp) Retry
  124. API User Web backend Load balancer Postgres database Network GoCardless

    (simplified) Broken (temp) Retry
  125. This is the only way to win

  126. http://web.mit.edu/Saltzer/www/publications/endtoend/ endtoend.pdf

  127. API User Web backend Load balancer Postgres database Network GoCardless

    (simplified) Broken (temp) Retry
  128. API User Web backend Load balancer Postgres database Network GoCardless

    (simplified) Broken (temp) Retry
  129. This is an optimisation

  130. So how do we do it?

  131. Retry requires idempotency

  132. Many invocations Effect only happens once

  133. None
  134. None
  135. Now, there’s one more trick

  136. https://github.com/gocardless/gocardless-pro-ruby

  137. None
  138. Replicate data synchronously Approach 4

  139. Losing payments => bad times

  140. Web backend Postgres

  141. Web backend Postgres

  142. Web backend Postgres 0 GOCARDLESS

  143. Web backend Postgres

  144. Web backend Postgres Backup in remote storage (e.g. Amazon S3)

  145. Slow to restore A bit behind

  146. Slow to restore A bit behind

  147. Web backend Postgres

  148. Postgres Postgres Replication Web backend

  149. Postgres Postgres Replication Web backend

  150. Postgres Postgres Web backend

  151. Postgres Postgres Web backend

  152. Postgres Postgres Web backend Replication

  153. Primary: accepts writes Replica: copies primary

  154. A caveat

  155. Postgres Postgres Replication Web backend

  156. Postgres Postgres Synchronous Web backend Replication

  157. Postgres Postgres Synchronous Web backend Replication INSERT INTO payments…

  158. Postgres Postgres Synchronous Web backend Replication INSERT INTO payments…

  159. Postgres Postgres Synchronous Web backend Replication INSERT INTO payments… 1

  160. Postgres Postgres Asynchronous Web backend Replication INSERT INTO payments…

  161. Postgres Postgres Asynchronous Web backend Replication INSERT INTO payments…

  162. Postgres Postgres Asynchronous Web backend Replication 1 INSERT INTO payments…

  163. Postgres Postgres Synchronous Web backend Replication INSERT INTO payments… 1

  164. Automate database failover Approach 5

  165. Postgres Postgres Replication Web backend

  166. Postgres Postgres Replication Web backend

  167. None
  168. Postgres Postgres Replication Web backend 2

  169. Postgres Postgres Web backend 2

  170. Postgres Postgres Web backend 2

  171. Postgres Postgres Web backend 2

  172. Postgres Postgres Web backend 2

  173. Postgres Postgres Web backend 2 Replication

  174. Kinda okay, but…

  175. Slow recovery Error prone

  176. You gotta perform: - Many steps - In the right

    order - Perfectly
  177. Don’t make a tired SRE think

  178. Add automation

  179. Pacemaker A clustering tool

  180. How do we know a node has failed?

  181. B A

  182. B A

  183. B A ? ?

  184. B A

  185. B A

  186. B A

  187. We can’t choose a primary

  188. Quorum

  189. A majority of nodes must be available

  190. n+1 2 ( ) round up

  191. n+1 2 ⌈ ⌉

  192. By example

  193. Nodes Quorum 2 2

  194. Nodes Quorum 2 3 2 2

  195. Nodes Quorum 2 3 4 2 2 3

  196. Nodes Quorum 2 3 4 5 2 2 3 3

  197. B A

  198. B A C

  199. B A C

  200. B A C

  201. Let’s fix our setup

  202. But first, an apology

  203. None
  204. 3

  205. None
  206. Postgres Postgres Replication Web backend

  207. Postgres Postgres Postgres Repl Repl Web backend

  208. Postgres Postgres Postgres Repl Repl Pacemaker Pacemaker Pacemaker Web backend

  209. Postgres Postgres Postgres Repl Repl Pacemaker Pacemaker Pacemaker Web backend

    Connection proxy
  210. Postgres Postgres Postgres Repl Repl Pacemaker Pacemaker Pacemaker Web backend

    Connection proxy
  211. Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker Web backend Connection proxy

  212. Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker Web backend Connection proxy

    Repl
  213. Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker Web backend Connection proxy

    Repl
  214. Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker Web backend Connection proxy

    Repl Repl
  215. Sorted!

  216. An endless list Approach 6, 7, 8…

  217. Challenges Approach A tale from production

  218. Challenges Approach A tale from production

  219. https://gocardless.com/blog/incident-review-api-and-dashboard- outage-on-10th-october/

  220. Postgres Postgres Postgres Repl Repl Pacemaker Pacemaker Pacemaker Web backend

    Connection proxy
  221. Postgres Postgres Postgres Repl Repl Pacemaker Pacemaker Pacemaker Web backend

  222. Postgres Postgres Postgres Repl Repl Pacemaker Pacemaker Pacemaker VIP Web

    backend
  223. Postgres Postgres Postgres Repl Repl VIP Pacemaker Pacemaker Pacemaker Web

    backend
  224. Web backend Postgres Postgres Postgres Repl Repl VIP Pacemaker Pacemaker

    Pacemaker
  225. Postgres Postgres Postgres Repl Repl Pacemaker Pacemaker Pacemaker VIP Web

    backend
  226. Postgres Postgres Postgres Repl Repl Pacemaker Pacemaker Pacemaker VIP Web

    backend
  227. Postgres Postgres Postgres Repl Repl Pacemaker Pacemaker Pacemaker VIP Web

    backend
  228. Our API was down

  229. Priority: “Stop the bleeding”

  230. What’s the quickest way we can be back up?

  231. Postgres Postgres Postgres Repl Repl Pacemaker Pacemaker Pacemaker VIP Web

    backend
  232. Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker VIP Web backend

  233. Our API was down

  234. Priority: “Stop the bleeding”

  235. Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker VIP Web backend

  236. Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker VIP Web backend crm

    resource cleanup
  237. Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker VIP Web backend “pacemaker

    forget errors”
  238. Our API was down

  239. Priority: “Stop the bleeding”

  240. Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker VIP Web backend

  241. Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker VIP Web backend 2

  242. Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker VIP Web backend 2

    Repl
  243. Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker VIP 2 Repl Web

    backend
  244. Our API was up

  245. Priority: “Stop the bleeding”

  246. What’s the quickest way we can be back up?

  247. The investigation that followed

  248. https://gocardless.com/blog/incident-review-api-and-dashboard- outage-on-10th-october/

  249. None
  250. None
  251. None
  252. CS in industry Variety of approaches Realities of production

  253. CS in industry Variety of approaches Realities of production

  254. Idempotency End-to-end principle Quorum

  255. CS in industry Variety of approaches Realities of production

  256. End-to-end retry Automated failure handling People!

  257. CS in industry Variety of approaches Realities of production

  258. Networks break Machines break Applications break

  259. @ChrisSinjo @GoCardlessEng Thank you 5❤

  260. We’re hiring 5❤ @ChrisSinjo @GoCardlessEng bit.ly/gc-interns-2019

  261. We’re hiring 5❤ @ChrisSinjo @GoCardlessEng bit.ly/gc-software-engineer

  262. Image credits • Castle with bats - CC0 - https://pixabay.com/en/bats-castle-evil-flying-full-

    moon-2027875/ • call me maybe? - CC-BY - https://www.flickr.com/photos/myuibe/7880399646/ • Rope - CC-BY - https://www.flickr.com/photos/49140926@N07/6798304070/
  263. @ChrisSinjo @GoCardlessEng Questions? 5❤