Clojure 4 BigData

Clojure 4 BigData

Slides to my talk on AI and Big Data Day 2018 in Lviv

3997614cf9ef044e5f569318fbfc64d5?s=128

Michael Pershyn

November 03, 2018
Tweet

Transcript

  1. 2.

    2 About me and why Clojure 4 Big Data •

    Make Software since 2005, work with Big Data since 2012 • Work for ADITION Technologies AG – Leading european adserving provider – Part of european tech stack VirtualMinds – >2.5 bln events per day processed in real-time – Extra ~12 bln data points in (batch) ETL daily – 250 TB of data in hadoop data lake – Several own data centers – Low latency requirements – Written mostly in Clojure
  2. 3.

    3

  3. 4.

    4 Agenda • Why Clojure in 3 Minutes • Apache

    Storm • Apache Trident • Incanter • Cascalog
  4. 6.

    6 • Makes you think diferent and approach problems diferently

    and solve them faster • Immutability, functions and map-reduce • Powerful, interactive, small, concise • Makes it hard to fall back to imperative style
  5. 7.

    7

  6. 8.
  7. 9.

    9 Core Concepts of Storm • Spouts • Bolts •

    Topology • Stream • Cluster (Nimbus & Workers)
  8. 11.

    11

  9. 12.

    12

  10. 13.

    13 Storm Pros and Cons • No “exactly once” guarantee

    • Fast, simple • Multitenance and debugging • Integrations
  11. 14.

    14 Trident • The “Cascading” of Storm • High level

    abstraction processing library on top of Storm • Rich API with joins, aggregations, grouping, etc. • Provides stateful, exactly-once processing primitives
  12. 15.

    15 Marceline Marceline provides a DSL that allows you to

    defne all of the primitives that Trident has to ofer from Clojure
  13. 16.

    16

  14. 17.

    17

  15. 20.

    20

  16. 23.

    23 • Cascading - a Java API – defning complex

    data fows – integrating those fows with back-end systems – query planner for mapping and executing logical fows onto a computing platform • Cascalog – Clojure DSL for Cascading
  17. 25.

    25 Cascading Pros and Cons Hive Pig Cascading Pros •

    SQL (non-standard) • Low learning curve • UDF • Pig Latin • Low learning curve • UDF • Java API • Unit testable • Flow control (if, try-catch) • Good reusability Cons • Testability • Reusability • Flow control • Spread logic • UDF Programming • Testability • Reusability • Spread logic • UDF Programming • Programming
  18. 26.

    26

  19. 29.

    29 Simplicity is about living life with more enjoyment and

    less pain - John Maeda https://www.ted.com/speakers/john_maeda
  20. 30.

    30 There are also other Clojure tools • Flambo –

    Clojure DSL for Apache Spark • http://riemann.io/ - Monitors Distributed System • ...