Slide 1

Slide 1 text

Building a real-time data pipeline 10.28.2014 David Pick @davidpick

Slide 2

Slide 2 text

Simple, powerful payments.

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

4 What is a data pipeline 
 and why do we need one?

Slide 5

Slide 5 text

5 BUT FIRST, What problems are we actually trying to solve? What is a data pipeline and why do we need one?_

Slide 6

Slide 6 text

6 What is a data pipeline and why do we need one?

Slide 7

Slide 7 text

7 Keeping your data warehouse in sync is difficult!

Slide 8

Slide 8 text

8 • Deletes are nearly impossible to keep track of. • You have to keep track of data that changed. • Batch updates are really slow and it’s difficult to know how long they’ll take. Keeping your data warehouse in sync is difficult!_

Slide 9

Slide 9 text

9

Slide 10

Slide 10 text

10 • Powered by SQL • Ran against our primary database • Slow! • Extremely limited Advanced Search_

Slide 11

Slide 11 text

11 • Getting data into our data warehouse • Getting data into our search platform • Keeping that data in sync Problems We’re Trying to Solve_

Slide 12

Slide 12 text

12 Something Elasticsearch Database Redshift

Slide 13

Slide 13 text

13 Something Elasticsearch Database Redshift

Slide 14

Slide 14 text

14 Something Elasticsearch Database Redshift

Slide 15

Slide 15 text

15 Something Elasticsearch Database Redshift

Slide 16

Slide 16 text

16 PGQ_

Slide 17

Slide 17 text

17 • Designed to facilitate asynchronous batch processing of live transactions • Events are persisted in a special Postgres table until either every registered consumer has seen them or a set amount of time has passed • ACID • Fast! PGQ_

Slide 18

Slide 18 text

18 PGQ_ PGQ Elasticsearch Database Redshift

Slide 19

Slide 19 text

19 PGQ_

Slide 20

Slide 20 text

20 PGQ_ PGQ Elasticsearch Database Redshift

Slide 21

Slide 21 text

21 KAFKA_

Slide 22

Slide 22 text

22 PGQ_ Kafka Elasticsearch Database
 PGQ Redshift

Slide 23

Slide 23 text

23 Kafka: Anatomy_ Producer Producer Producer Consumer Consumer Consumer Kafka Cluster

Slide 24

Slide 24 text

24 Kafka: Inside a Topic_ 1
 2 0 1 2 3 4 5 6 7 8 9 1
 0 1
 1 9 0 1 2 3 4 5 6 7 8 1
 2 0 1 2 3 4 5 6 7 8 9 1
 0 1
 1 Partition 
 0 Partition 
 1 Partition 
 2 Writes Old New

Slide 25

Slide 25 text

25 Kafka: Consumers 1
 2 0 1 2 3 4 5 6 7 8 9 1
 0 1
 1 Partition 
 2 Consumer 1 Consumer 2

Slide 26

Slide 26 text

26 How are we using Kafka?

Slide 27

Slide 27 text

27 Database_ 0 1 2 3 Gateway 0 Funding 0 Apply Redshift Elasticsearch

Slide 28

Slide 28 text

Datastream 0 1 2 3 Eventstream 0 1 2 3 28 Kafka Pipeline_ Redshift Loader Elasticsearch Loader 0 1 2 3

Slide 29

Slide 29 text

29 Data Pipeline_ Kafka Elasticsearch Database
 PGQ Redshift

Slide 30

Slide 30 text

30 CLOJURE_

Slide 31

Slide 31 text

31 Why Clojure?

Slide 32

Slide 32 text

32 • JVM • Concurrency • Abstractions Why Clojure?_

Slide 33

Slide 33 text

33 Laziness_ (with-resource [c (consumer config)] shutdown (doseq [message (messages c "topic")] (process-message message)))

Slide 34

Slide 34 text

34 Laziness_ (let [c (consumer config)] (try (doseq [message (messages c "topic")] (process-message message)) (finally (shutdown))))

Slide 35

Slide 35 text

35 Laziness_ (defn process-messages [messages] (doseq [message messages] (process-message message)))

Slide 36

Slide 36 text

36 JVM_

Slide 37

Slide 37 text

37 • Regular threads weren’t enough • clojure.core/async wasn’t even enough • Actors were the right abstraction for 
 our problem Concurrency_

Slide 38

Slide 38 text

38 • Allowed us to think about a single merchant at a time • Put back pressure on Kafka • Provided a mechanism for ensuring the process continued to work in spite of failures Actors_

Slide 39

Slide 39 text

39 Elasticsearch provides a similar abstraction

Slide 40

Slide 40 text

40 Merchant Abstractions_ Braintree 5 6 1 2 5 6 3 4

Slide 41

Slide 41 text

41 • curl localhost:9200/6/transaction/_count • Searches are guaranteed to be scoped to a merchant • Keeping all a merchants transactions together greatly improves performance Why Aliases?_

Slide 42

Slide 42 text

42 Kafka Elasticsearch Database
 PGQ Redshift

Slide 43

Slide 43 text

43 Lessons Learned

Slide 44

Slide 44 text

44 • GC is the enemy! • Use the Garbage First Garbage Collector (G1GC) • Tune your heap size • Monitor everything (JMX is great for this) • Don’t use the default configs • Use a model that avoids race conditions Lessons Learned_

Slide 45

Slide 45 text

45 • Real time fraud monitoring • Report generation • Webhooks • Any async data processing! Future Uses_

Slide 46

Slide 46 text

Thank you.