Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Data Pipeline with Clojure and Kafka

0ba988e6ace3290b49328b83f075a533?s=47 David Pick
October 28, 2014

Building a Data Pipeline with Clojure and Kafka

At some point in every large software application's lifetime, it must turn to service-oriented architecture to deal with complexity. This often involves separating data between applications and creating a way for those applications to talk to each other. Inevitably, pieces of the system end up needing to know more about the shape of the data in the main application (e.g. a data warehouse and search) than a separate piece of architecture should. In order to combat this issue, Braintree developed a data pipeline built on PGQ, Kafka, Zookeeper, and Clojure. In this talk, David Pick will give an in-depth review of how the data pipeline functions and talk through some of the issues encountered along the way.

0ba988e6ace3290b49328b83f075a533?s=128

David Pick

October 28, 2014
Tweet

Transcript

  1. Building a real-time data pipeline 10.28.2014 David Pick @davidpick

  2. Simple, powerful payments.

  3. None
  4. 4 What is a data pipeline 
 and why do

    we need one?
  5. 5 BUT FIRST, What problems are we actually trying to

    solve? What is a data pipeline and why do we need one?_
  6. 6 What is a data pipeline and why do we

    need one?
  7. 7 Keeping your data warehouse in sync is difficult!

  8. 8 • Deletes are nearly impossible to keep track of.

    • You have to keep track of data that changed. • Batch updates are really slow and it’s difficult to know how long they’ll take. Keeping your data warehouse in sync is difficult!_
  9. 9

  10. 10 • Powered by SQL • Ran against our primary

    database • Slow! • Extremely limited Advanced Search_
  11. 11 • Getting data into our data warehouse • Getting

    data into our search platform • Keeping that data in sync Problems We’re Trying to Solve_
  12. 12 Something Elasticsearch Database Redshift

  13. 13 Something Elasticsearch Database Redshift

  14. 14 Something Elasticsearch Database Redshift

  15. 15 Something Elasticsearch Database Redshift

  16. 16 PGQ_

  17. 17 • Designed to facilitate asynchronous batch processing of live

    transactions • Events are persisted in a special Postgres table until either every registered consumer has seen them or a set amount of time has passed • ACID • Fast! PGQ_
  18. 18 PGQ_ PGQ Elasticsearch Database Redshift

  19. 19 PGQ_

  20. 20 PGQ_ PGQ Elasticsearch Database Redshift

  21. 21 KAFKA_

  22. 22 PGQ_ Kafka Elasticsearch Database
 PGQ Redshift

  23. 23 Kafka: Anatomy_ Producer Producer Producer Consumer Consumer Consumer Kafka

    Cluster
  24. 24 Kafka: Inside a Topic_ 1
 2 0 1 2

    3 4 5 6 7 8 9 1
 0 1
 1 9 0 1 2 3 4 5 6 7 8 1
 2 0 1 2 3 4 5 6 7 8 9 1
 0 1
 1 Partition 
 0 Partition 
 1 Partition 
 2 Writes Old New
  25. 25 Kafka: Consumers 1
 2 0 1 2 3 4

    5 6 7 8 9 1
 0 1
 1 Partition 
 2 Consumer 1 Consumer 2
  26. 26 How are we using Kafka?

  27. 27 Database_ 0 1 2 3 Gateway 0 Funding 0

    Apply Redshift Elasticsearch
  28. Datastream 0 1 2 3 Eventstream 0 1 2 3

    28 Kafka Pipeline_ Redshift Loader Elasticsearch Loader 0 1 2 3
  29. 29 Data Pipeline_ Kafka Elasticsearch Database
 PGQ Redshift

  30. 30 CLOJURE_

  31. 31 Why Clojure?

  32. 32 • JVM • Concurrency • Abstractions Why Clojure?_

  33. 33 Laziness_ (with-resource [c (consumer config)] shutdown (doseq [message (messages

    c "topic")] (process-message message)))
  34. 34 Laziness_ (let [c (consumer config)] (try (doseq [message (messages

    c "topic")] (process-message message)) (finally (shutdown))))
  35. 35 Laziness_ (defn process-messages [messages] (doseq [message messages] (process-message message)))

  36. 36 JVM_

  37. 37 • Regular threads weren’t enough • clojure.core/async wasn’t even

    enough • Actors were the right abstraction for 
 our problem Concurrency_
  38. 38 • Allowed us to think about a single merchant

    at a time • Put back pressure on Kafka • Provided a mechanism for ensuring the process continued to work in spite of failures Actors_
  39. 39 Elasticsearch provides a similar abstraction

  40. 40 Merchant Abstractions_ Braintree 5 6 1 2 5 6

    3 4
  41. 41 • curl localhost:9200/6/transaction/_count • Searches are guaranteed to be

    scoped to a merchant • Keeping all a merchants transactions together greatly improves performance Why Aliases?_
  42. 42 Kafka Elasticsearch Database
 PGQ Redshift

  43. 43 Lessons Learned

  44. 44 • GC is the enemy! • Use the Garbage

    First Garbage Collector (G1GC) • Tune your heap size • Monitor everything (JMX is great for this) • Don’t use the default configs • Use a model that avoids race conditions Lessons Learned_
  45. 45 • Real time fraud monitoring • Report generation •

    Webhooks • Any async data processing! Future Uses_
  46. Thank you.