Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Data Pipeline with Clojure and Kafka

David Pick
October 28, 2014

Building a Data Pipeline with Clojure and Kafka

At some point in every large software application's lifetime, it must turn to service-oriented architecture to deal with complexity. This often involves separating data between applications and creating a way for those applications to talk to each other. Inevitably, pieces of the system end up needing to know more about the shape of the data in the main application (e.g. a data warehouse and search) than a separate piece of architecture should. In order to combat this issue, Braintree developed a data pipeline built on PGQ, Kafka, Zookeeper, and Clojure. In this talk, David Pick will give an in-depth review of how the data pipeline functions and talk through some of the issues encountered along the way.

David Pick

October 28, 2014
Tweet

Other Decks in Programming

Transcript

  1. 5 BUT FIRST, What problems are we actually trying to

    solve? What is a data pipeline and why do we need one?_
  2. 8 • Deletes are nearly impossible to keep track of.

    • You have to keep track of data that changed. • Batch updates are really slow and it’s difficult to know how long they’ll take. Keeping your data warehouse in sync is difficult!_
  3. 9

  4. 10 • Powered by SQL • Ran against our primary

    database • Slow! • Extremely limited Advanced Search_
  5. 11 • Getting data into our data warehouse • Getting

    data into our search platform • Keeping that data in sync Problems We’re Trying to Solve_
  6. 17 • Designed to facilitate asynchronous batch processing of live

    transactions • Events are persisted in a special Postgres table until either every registered consumer has seen them or a set amount of time has passed • ACID • Fast! PGQ_
  7. 24 Kafka: Inside a Topic_ 1
 2 0 1 2

    3 4 5 6 7 8 9 1
 0 1
 1 9 0 1 2 3 4 5 6 7 8 1
 2 0 1 2 3 4 5 6 7 8 9 1
 0 1
 1 Partition 
 0 Partition 
 1 Partition 
 2 Writes Old New
  8. 25 Kafka: Consumers 1
 2 0 1 2 3 4

    5 6 7 8 9 1
 0 1
 1 Partition 
 2 Consumer 1 Consumer 2
  9. 27 Database_ 0 1 2 3 Gateway 0 Funding 0

    Apply Redshift Elasticsearch
  10. Datastream 0 1 2 3 Eventstream 0 1 2 3

    28 Kafka Pipeline_ Redshift Loader Elasticsearch Loader 0 1 2 3
  11. 34 Laziness_ (let [c (consumer config)] (try (doseq [message (messages

    c "topic")] (process-message message)) (finally (shutdown))))
  12. 37 • Regular threads weren’t enough • clojure.core/async wasn’t even

    enough • Actors were the right abstraction for 
 our problem Concurrency_
  13. 38 • Allowed us to think about a single merchant

    at a time • Put back pressure on Kafka • Provided a mechanism for ensuring the process continued to work in spite of failures Actors_
  14. 41 • curl localhost:9200/6/transaction/_count • Searches are guaranteed to be

    scoped to a merchant • Keeping all a merchants transactions together greatly improves performance Why Aliases?_
  15. 44 • GC is the enemy! • Use the Garbage

    First Garbage Collector (G1GC) • Tune your heap size • Monitor everything (JMX is great for this) • Don’t use the default configs • Use a model that avoids race conditions Lessons Learned_
  16. 45 • Real time fraud monitoring • Report generation •

    Webhooks • Any async data processing! Future Uses_