Clojure in Big Data pipeline

SPEAKER Clojure in Big Data pipeline Saulius Grigaliūnas

@imsaulius saulius

Make second-hand the ﬁrst choice worldwide

Big Data? http://bigdatapix.tumblr.com Big data is data that exceeds the
processing capacity of conventional database systems.

Goals • How well the product is doing? • Advanced
product features using machine learning, collaborative ﬁltering • MySQL scalability issues • Too slow for “big”-OLAP • Can’t store tracking data

Big data ingestion @ Vinted • up to 19 billion
events / month • up to 0.7 billion events / day • growing

HTTP -> UDP UDP -> Kafka Kafka -> Kafka Kafka
-> HDFS MySQL -> Kafka

A high throughput distributed messaging system •Distributed: partitions messages across
multiple nodes •Reliable: messages replicated across multiple nodes •Persistent: all messages are persisted to disk •Performant: works just ﬁne with 70 000 message writes per second. Up to 700Mb/s (when replicating/rebalancing).

•Broker - Kafka server •Producer - N producers send messages
to Brokers •Consumer - N consumers read messages from Brokers. Each at its own pace.

@ Vinted • 6 Broker nodes • ~4TB of usable
space • Allows us to safely keep 3+ weeks worth of event tracking data (= Hadoop can be down for 3 weeks w/ out data loss)

@ Vinted • 2 services in production. Deployed across 39
+ 9 hosts. • 1 service in pre-production state. • 1 service in hackathon state. • 4 people proﬁcient at writing Clojure code. • Clojure code written using Sublime, Vim, Emacs. • ~1800 LOC, ~1500 lines of test code • Still learning

Why Clojure? • Dynamically typed • Easier transition from Ruby
• REPL++ • Ruby is an acceptable Lisp? • Eco-system • Native Avro support • Native Kafka library (JVM) • mysql-connector-java

Standard Clojure project @ Vinted • Small codebase (micro service?)
• Continuous testing with Jenkins • Continuous deployment with Jenkins + Chef • Conﬁguration management with Chef

Standard Clojure project @ Vinted • Clojure 1.6 • Leiningen
• Trapperkeeper (component pattern) • Logback • clojure.test • Aiming for dev-prod parity with Vagrant + Ansible

Componentised services • Service consists of components • Components process
stream of events in a conveyor belt principle

Component (defrecord Database [host port connection] ;; Implement the Lifecycle
protocol component/Lifecycle (start [component] (println ";; Starting database") (let [conn (connect-to-database host port)] (assoc component :connection conn))) (stop [component] (println ";; Stopping database") (.close connection) (assoc component :connection nil)))

What’s next? (def prop-sorted-first-less-than-last (prop/for-all [v (gen/not-empty (gen/vector gen/int))] (let
[s (sort v)] (< (first s) (last s))))) Property based testing

What’s next? Dynamic Clojure component conﬁguration • Apache Zookeeper •
Changing log levels • Enabling/disabling REPL at runtime

What’s next? Real time anomaly detection https://github.com/etsy/skyline

More stream processing • Build real time dashboards / reports
/ OLAP • Detect anomalies in event streams - identify failures quicker than we get a new support ticket • Join event streams with application metric or logging streams for root cause identiﬁcation?

What’s next? Better & smarter product

Thanks! • The Log: What every software engineer should know
about real- time data's unifying abstraction • All Aboard the Databus! Linkedin’s Scalable Consistent Change Data Capture Platform • Wormhole pub/sub system: Moving data through space and time • The “Big Data” Ecosystem at LinkedIn • The Uniﬁed Logging Infrastructure for Data Analytics at Twitter • Kafka: A Distributed Messaging System for Log Processing • Building LinkedIn’s Real-time Activity Data Pipeline

Components @ event-passage

Clojure in Big Data pipeline

Clojure in Big Data pipeline

Saulius Grigaliunas

More Decks by Saulius Grigaliunas

Other Decks in Programming

Featured

Transcript

SPEAKER Clojure in Big Data pipeline Saulius Grigaliūnas

@imsaulius saulius

Make second-hand the ﬁrst choice worldwide

Big Data? http://bigdatapix.tumblr.com Big data is data that exceeds the

Goals • How well the product is doing? • Advanced

Big data ingestion @ Vinted • up to 19 billion

HTTP -> UDP UDP -> Kafka Kafka -> Kafka Kafka

A high throughput distributed messaging system •Distributed: partitions messages across

•Broker - Kafka server •Producer - N producers send messages

@ Vinted • 6 Broker nodes • ~4TB of usable

@ Vinted • 2 services in production. Deployed across 39

Why Clojure? • Dynamically typed • Easier transition from Ruby

Standard Clojure project @ Vinted • Small codebase (micro service?)

Standard Clojure project @ Vinted • Clojure 1.6 • Leiningen

Componentised services • Service consists of components • Components process

Component (defrecord Database [host port connection] ;; Implement the Lifecycle

What’s next? (def prop-sorted-first-less-than-last (prop/for-all [v (gen/not-empty (gen/vector gen/int))] (let

What’s next? Dynamic Clojure component conﬁguration • Apache Zookeeper •

What’s next? Real time anomaly detection https://github.com/etsy/skyline

More stream processing • Build real time dashboards / reports

What’s next? Better & smarter product

Thanks! • The Log: What every software engineer should know

Components @ event-passage