Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Clojure in Big Data pipeline

Clojure in Big Data pipeline

Talk given in Vilnius Clojure meetup on Clojure in big data pipeline at Vinted.

Saulius Grigaliunas

January 07, 2015
Tweet

More Decks by Saulius Grigaliunas

Other Decks in Programming

Transcript

  1. Big Data? http://bigdatapix.tumblr.com Big data is data that exceeds the

    processing capacity of conventional database systems.
  2. Goals • How well the product is doing? • Advanced

    product features using machine learning, collaborative filtering • MySQL scalability issues • Too slow for “big”-OLAP • Can’t store tracking data
  3. Big data ingestion @ Vinted • up to 19 billion

    events / month • up to 0.7 billion events / day • growing
  4. A high throughput distributed messaging system •Distributed: partitions messages across

    multiple nodes •Reliable: messages replicated across multiple nodes •Persistent: all messages are persisted to disk •Performant: works just fine with 70 000 message writes per second. Up to 700Mb/s (when replicating/rebalancing).
  5. •Broker - Kafka server •Producer - N producers send messages

    to Brokers •Consumer - N consumers read messages from Brokers. Each at its own pace.
  6. @ Vinted • 6 Broker nodes • ~4TB of usable

    space • Allows us to safely keep 3+ weeks worth of event tracking data (= Hadoop can be down for 3 weeks w/ out data loss)
  7. @ Vinted • 2 services in production. Deployed across 39

    + 9 hosts. • 1 service in pre-production state. • 1 service in hackathon state. • 4 people proficient at writing Clojure code. • Clojure code written using Sublime, Vim, Emacs. • ~1800 LOC, ~1500 lines of test code • Still learning
  8. Why Clojure? • Dynamically typed • Easier transition from Ruby

    • REPL++ • Ruby is an acceptable Lisp? • Eco-system • Native Avro support • Native Kafka library (JVM) • mysql-connector-java
  9. Standard Clojure project @ Vinted • Small codebase (micro service?)

    • Continuous testing with Jenkins • Continuous deployment with Jenkins + Chef • Configuration management with Chef
  10. Standard Clojure project @ Vinted • Clojure 1.6 • Leiningen

    • Trapperkeeper (component pattern) • Logback • clojure.test • Aiming for dev-prod parity with Vagrant + Ansible
  11. Component (defrecord Database [host port connection] ;; Implement the Lifecycle

    protocol component/Lifecycle (start [component] (println ";; Starting database") (let [conn (connect-to-database host port)] (assoc component :connection conn))) (stop [component] (println ";; Stopping database") (.close connection) (assoc component :connection nil)))
  12. What’s next? Dynamic Clojure component configuration • Apache Zookeeper •

    Changing log levels • Enabling/disabling REPL at runtime
  13. More stream processing • Build real time dashboards / reports

    / OLAP • Detect anomalies in event streams - identify failures quicker than we get a new support ticket • Join event streams with application metric or logging streams for root cause identification?
  14. Thanks! • The Log: What every software engineer should know

    about real- time data's unifying abstraction • All Aboard the Databus! Linkedin’s Scalable Consistent Change Data Capture Platform • Wormhole pub/sub system: Moving data through space and time • The “Big Data” Ecosystem at LinkedIn • The Unified Logging Infrastructure for Data Analytics at Twitter • Kafka: A Distributed Messaging System for Log Processing • Building LinkedIn’s Real-time Activity Data Pipeline