Clojure 4 BigData

Clojure 4 BigData

Slides to my talk on AI and Big Data Day 2018 in Lviv

3997614cf9ef044e5f569318fbfc64d5?s=128

Michael Pershyn

November 03, 2018
Tweet

Transcript

  1. Clojure 4 Big Data Michael Pershyn 2018-11-03

  2. 2 About me and why Clojure 4 Big Data •

    Make Software since 2005, work with Big Data since 2012 • Work for ADITION Technologies AG – Leading european adserving provider – Part of european tech stack VirtualMinds – >2.5 bln events per day processed in real-time – Extra ~12 bln data points in (batch) ETL daily – 250 TB of data in hadoop data lake – Several own data centers – Low latency requirements – Written mostly in Clojure
  3. 3

  4. 4 Agenda • Why Clojure in 3 Minutes • Apache

    Storm • Apache Trident • Incanter • Cascalog
  5. 5 Why Clojure?

  6. 6 • Makes you think diferent and approach problems diferently

    and solve them faster • Immutability, functions and map-reduce • Powerful, interactive, small, concise • Makes it hard to fall back to imperative style
  7. 7

  8. 8 • Distributed realtime computation system • Apache Top-Level Project

    since September 2014 • Free and open source
  9. 9 Core Concepts of Storm • Spouts • Bolts •

    Topology • Stream • Cluster (Nimbus & Workers)
  10. 10 Storm and Clojure

  11. 11

  12. 12

  13. 13 Storm Pros and Cons • No “exactly once” guarantee

    • Fast, simple • Multitenance and debugging • Integrations
  14. 14 Trident • The “Cascading” of Storm • High level

    abstraction processing library on top of Storm • Rich API with joins, aggregations, grouping, etc. • Provides stateful, exactly-once processing primitives
  15. 15 Marceline Marceline provides a DSL that allows you to

    defne all of the primitives that Trident has to ofer from Clojure
  16. 16

  17. 17

  18. 18 Trident compiles to Storm

  19. 19 Incanter

  20. 20

  21. 21 Incanter and openhub.net

  22. 22 Cascalog

  23. 23 • Cascading - a Java API – defning complex

    data fows – integrating those fows with back-end systems – query planner for mapping and executing logical fows onto a computing platform • Cascalog – Clojure DSL for Cascading
  24. 24 Cascading Concepts • Decouple application logic from integration •

    Flow, source, sink, taps, schemes
  25. 25 Cascading Pros and Cons Hive Pig Cascading Pros •

    SQL (non-standard) • Low learning curve • UDF • Pig Latin • Low learning curve • UDF • Java API • Unit testable • Flow control (if, try-catch) • Good reusability Cons • Testability • Reusability • Flow control • Spread logic • UDF Programming • Testability • Reusability • Spread logic • UDF Programming • Programming
  26. 26

  27. 27 https://hortonworks.com/blog/cascading-hadoop-big-data-whatever/

  28. 28 Trident and Cascalog • Trident for Storm is like

    Cascading for Hadoop
  29. 29 Simplicity is about living life with more enjoyment and

    less pain - John Maeda https://www.ted.com/speakers/john_maeda
  30. 30 There are also other Clojure tools • Flambo –

    Clojure DSL for Apache Spark • http://riemann.io/ - Monitors Distributed System • ...
  31. 31 Thanks! Questions?