Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Clojure 4 BigData

Clojure 4 BigData

Slides to my talk on AI and Big Data Day 2018 in Lviv

Michael Pershyn

November 03, 2018
Tweet

More Decks by Michael Pershyn

Other Decks in Programming

Transcript

  1. 2 About me and why Clojure 4 Big Data •

    Make Software since 2005, work with Big Data since 2012 • Work for ADITION Technologies AG – Leading european adserving provider – Part of european tech stack VirtualMinds – >2.5 bln events per day processed in real-time – Extra ~12 bln data points in (batch) ETL daily – 250 TB of data in hadoop data lake – Several own data centers – Low latency requirements – Written mostly in Clojure
  2. 3

  3. 4 Agenda • Why Clojure in 3 Minutes • Apache

    Storm • Apache Trident • Incanter • Cascalog
  4. 6 • Makes you think diferent and approach problems diferently

    and solve them faster • Immutability, functions and map-reduce • Powerful, interactive, small, concise • Makes it hard to fall back to imperative style
  5. 7

  6. 9 Core Concepts of Storm • Spouts • Bolts •

    Topology • Stream • Cluster (Nimbus & Workers)
  7. 11

  8. 12

  9. 13 Storm Pros and Cons • No “exactly once” guarantee

    • Fast, simple • Multitenance and debugging • Integrations
  10. 14 Trident • The “Cascading” of Storm • High level

    abstraction processing library on top of Storm • Rich API with joins, aggregations, grouping, etc. • Provides stateful, exactly-once processing primitives
  11. 15 Marceline Marceline provides a DSL that allows you to

    defne all of the primitives that Trident has to ofer from Clojure
  12. 16

  13. 17

  14. 20

  15. 23 • Cascading - a Java API – defning complex

    data fows – integrating those fows with back-end systems – query planner for mapping and executing logical fows onto a computing platform • Cascalog – Clojure DSL for Cascading
  16. 25 Cascading Pros and Cons Hive Pig Cascading Pros •

    SQL (non-standard) • Low learning curve • UDF • Pig Latin • Low learning curve • UDF • Java API • Unit testable • Flow control (if, try-catch) • Good reusability Cons • Testability • Reusability • Flow control • Spread logic • UDF Programming • Testability • Reusability • Spread logic • UDF Programming • Programming
  17. 26

  18. 29 Simplicity is about living life with more enjoyment and

    less pain - John Maeda https://www.ted.com/speakers/john_maeda
  19. 30 There are also other Clojure tools • Flambo –

    Clojure DSL for Apache Spark • http://riemann.io/ - Monitors Distributed System • ...