Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Apache NiFi and Storm

Avatar for Jungtaek Lim Jungtaek Lim
November 26, 2016

Introduction to Apache NiFi and Storm

Introduction to Apache NiFi and Storm

Avatar for Jungtaek Lim

Jungtaek Lim

November 26, 2016

Other Decks in Technology

Transcript

  1. WHO AM I? • Staff Software Engineer @ Hortonworks •

    remote worker • Open source prosumer • Committer of Jedis • PMC member of Apache Storm • Contributor of Apache (Spark, Zeppelin, Ambari, Calcite), Redis, and so on. • Contact • [email protected] • Twitter / LinkedIn / Github / Facebook • @heartsavior
  2. Core Infrastructure Sources à Constrained à High-latency à Localized context

    à Hybrid – cloud / on-premises à Low-latency à Global context Regional Infrastructure DATA IN MOTION IN HORTONWORKS DATAFLOW (HDF) Source: http://ko.hortonworks.com/products/data-center/hdf/
  3. • Created by the United States National Security Agency (NSA)

    • originally named Niagarafiles • In 2014 the NSA submitted the source code to Apache Software Foundation, via the NSA Technology Transfer Program, entered incubation in December 2014 • Development of Apache NiFi continued at Onyara, Inc., a start up company • Became Apache Top-Level Project in July 2015 • Hortonworks acquired Onyara, Inc. in August 2015
  4. • Data acquisition and delivery • Simple transformation and data

    routing • Simple event processing • End to end provenance • Edge intelligence and bi-directional comms.
  5. Highly configurable • Loss tolerant vs guaranteed delivery • Low

    latency vs high throughput • Dynamic prioritization • Flow can be modified at runtime • Back pressure
  6. More… • Designed for extension • Build your own processors

    and more • Secure • SSL, SSH, HTTPS, encrypted content, etc... • Multi-tenant authorization and internal authorization/policy management • MiNiFi subproject • Reduce footprint to ~ 40 MB
  7. • Spout: a source of streams in a topology •

    Bolt: a processing component which includes Sink • Stream: an unbounded sequence of tuples, defined with schema • Stream groupings: defines how that stream should be partitioned among the bolt's tasks • Topology: the logic for a realtime application represented to a DAG
  8. Core Trident Computation Unit Record (tuple) Micro batch Latency Very

    low (sub-seconds) High (up to batch size) Similar to Spark Streaming Delivery Guarantee At least once Exactly once API Compositional Declarative Stateful Operator Supported from v1.0.0 Core feature
 (exactly-once) Windowing Time (processing time, event time), Count Tumbling window, Sliding window
  9. • Supports number of connectors (17 connectors in master branch)

    • Automatic back-pressure • Distributed Cache • Flux (constructing topology via yaml) • Distributed Log Search • Dynamic Worker Profiling • Dynamic Log Levels • Topology Event Inspector • Resource Aware Scheduler • SQL (Experimental)
  10. • Clojure to Java translation • Unified Stream API with

    supporting exactly-once • Rework Metrics feature • Apache Beam runner • Streaming SQL with Apache Calcite • And more… • Performance • Usability