Building a Data Pipeline with Clojure and Kafka

Building a real-time data pipeline 10.28.2014 David Pick @davidpick

Simple, powerful payments.

4 What is a data pipeline   and why do
we need one?

5 BUT FIRST, What problems are we actually trying to
solve? What is a data pipeline and why do we need one?_

6 What is a data pipeline and why do we
need one?

7 Keeping your data warehouse in sync is difficult!

8 • Deletes are nearly impossible to keep track of.
• You have to keep track of data that changed. • Batch updates are really slow and it’s difficult to know how long they’ll take. Keeping your data warehouse in sync is difficult!_

10 • Powered by SQL • Ran against our primary
database • Slow! • Extremely limited Advanced Search_

11 • Getting data into our data warehouse • Getting
data into our search platform • Keeping that data in sync Problems We’re Trying to Solve_

12 Something Elasticsearch Database Redshift

16 PGQ_

17 • Designed to facilitate asynchronous batch processing of live
transactions • Events are persisted in a special Postgres table until either every registered consumer has seen them or a set amount of time has passed • ACID • Fast! PGQ_

18 PGQ_ PGQ Elasticsearch Database Redshift

19 PGQ_

20 PGQ_ PGQ Elasticsearch Database Redshift

21 KAFKA_

22 PGQ_ Kafka Elasticsearch Database  PGQ Redshift

23 Kafka: Anatomy_ Producer Producer Producer Consumer Consumer Consumer Kafka
Cluster

24 Kafka: Inside a Topic_ 1  2 0 1 2
3 4 5 6 7 8 9 1  0 1  1 9 0 1 2 3 4 5 6 7 8 1  2 0 1 2 3 4 5 6 7 8 9 1  0 1  1 Partition   0 Partition   1 Partition   2 Writes Old New

25 Kafka: Consumers 1  2 0 1 2 3 4
5 6 7 8 9 1  0 1  1 Partition   2 Consumer 1 Consumer 2

26 How are we using Kafka?

27 Database_ 0 1 2 3 Gateway 0 Funding 0
Apply Redshift Elasticsearch

Datastream 0 1 2 3 Eventstream 0 1 2 3
28 Kafka Pipeline_ Redshift Loader Elasticsearch Loader 0 1 2 3

29 Data Pipeline_ Kafka Elasticsearch Database  PGQ Redshift

30 CLOJURE_

31 Why Clojure?

32 • JVM • Concurrency • Abstractions Why Clojure?_

33 Laziness_ (with-resource [c (consumer config)] shutdown (doseq [message (messages
c "topic")] (process-message message)))

34 Laziness_ (let [c (consumer config)] (try (doseq [message (messages
c "topic")] (process-message message)) (finally (shutdown))))

35 Laziness_ (defn process-messages [messages] (doseq [message messages] (process-message message)))

36 JVM_

37 • Regular threads weren’t enough • clojure.core/async wasn’t even
enough • Actors were the right abstraction for   our problem Concurrency_

38 • Allowed us to think about a single merchant
at a time • Put back pressure on Kafka • Provided a mechanism for ensuring the process continued to work in spite of failures Actors_

39 Elasticsearch provides a similar abstraction

40 Merchant Abstractions_ Braintree 5 6 1 2 5 6
3 4

41 • curl localhost:9200/6/transaction/_count • Searches are guaranteed to be
scoped to a merchant • Keeping all a merchants transactions together greatly improves performance Why Aliases?_

42 Kafka Elasticsearch Database  PGQ Redshift

43 Lessons Learned

44 • GC is the enemy! • Use the Garbage
First Garbage Collector (G1GC) • Tune your heap size • Monitor everything (JMX is great for this) • Don’t use the default conﬁgs • Use a model that avoids race conditions Lessons Learned_

45 • Real time fraud monitoring • Report generation •
Webhooks • Any async data processing! Future Uses_

Thank you.

Building a Data Pipeline with Clojure and Kafka

Building a Data Pipeline with Clojure and Kafka

Other Decks in Programming

Featured

Transcript