From Zero to Stream Processing

Getting started with Apache Flink (and Apache Kafka) From zero
to stream processing Kenny Gorman Founder and CEO www.eventador.io www.kennygorman.com @kennygorman

I have done database foo for my whole career, going
on 25 years. Sybase, Oracle DBA, PostgreSQL DBA, MySQL aficionado, MongoDB early adopter, founded two companies based on data technologies Streaming data is a game changer. Fell in love with Apache Kafka and Apache Flink. We went ‘all in’. I am a data nerd ‘02 had hair ^ Now… lol

Performing some operation on a boundless data stream Apache Kafka
FTW But how to process the data? What is stream processing?

Stream Processing Frameworks Technology Description Apache Spark Traditionally more of
a batch execution environment. Born from Apache Hadoop ecosystem. Good streaming API with micro-batch streaming model. Mature. Apache Storm True boundless stream processing, based around concept of “topologies”, “spouts”, and “bolts”. Open sourced by Twitter (who are now working on Heron). Apache Kafka Traditionally a transport mechanism for data, now has API’s for streaming (KStreams, KSQL). Popular, management req’d. Just went 1.0. Apache Flink Pure streaming execution environment, exactly once semantics, checkpointing, high availability, source/sink connectors, powerful APIs with higher order functionality for windowing, recoverability, state ... The landscape is evolving fast

Apache Flink is an open source stream processing framework developed
by the Apache Software Foundation. The core of Apache Flink is a distributed streaming dataflow engine written in Java and Scala. Flink provides a high-throughput, low-latency streaming engine[7] as well as support for event-time processing and state management. Flink applications are fault-tolerant in the event of machine failure and support exactly-once semantics.[8]Programs can be written in Java, Scala,[9] Python,[10] and SQL[11] and are automatically compiled and optimized[12] into dataflow programs that are executed in a cluster or cloud environment. Apache Flink

Flink Development API’s Decomposed - DataSet API (batch) vs DataStream
API (streaming) - DataStream most powerful of Flink API’s - Table API most convenient and simple of Flink API’s - Flink SQL (Apache Calcite)

Anatomy of a Flink job - Table API - Declaritive
DSL centered around the concept of a dynamic table - Follows an extended relational model - What logical operation should be performed on the data - Table API + SQL FTW - Sources and Sinks ←!!!!!!! - Kafka, CSV, roll your own..

Table API vs Table API + SQL // Table API
// register a table source Table orders = tableEnv.scan("Orders"); Table result = orders.groupBy("a").select("a, b.sum as d"); // Table API + SQL // register a table source Table result = tableEnv.sql("SELECT a, b.sum as d FROM orders GROUP BY a");

Our example stream processor - Simple usage of Table API
- Streaming data from aircraft via ADSB (http://www.eventador.io/planestream.html) - Produce data into Kafka topic A - Consume from Kafka topic B - Do some filtering in-between A and B Kafka Source Flink Dest Dest Dest Your code, performing the filtering

Line by line public class FlinkReadWriteKafkaJSON { public static void
main(String[] args) throws Exception { // Read parameters from command line final ParameterTool params = ParameterTool.fromArgs(args); if(params.getNumberOfParameters() < 4) { System.out.println("\nUsage: FlinkReadKafkaJSON --read-topic <topic> --write-topic <topic> --bootstrap.servers <kafka brokers> --group.id <groupid>"); return; }

Line by line // setup flink environment if needed, but
if deploying to cluster totally cool StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // a couple example settings env.getConfig().setRestartStrategy(RestartStrategies.fixedDelayRestart(4, 10000)); env.enableCheckpointing(300000); // 300 seconds for recovery checkpointing env.getConfig().setGlobalJobParameters(params);

Line by line // create a table environment StreamTableEnvironment tableEnv
= TableEnvironment.getTableEnvironment(env); // let’s define a schema to pass the table TypeInformation<Row> typeInfo = Types.ROW( new String[] { "flight", "timestamp_verbose", "msg_type", "track", "timestamp", "altitude", "counter", "lon", "icao", "vr", "lat", "speed" }, new TypeInformation<?>[] { Types.STRING(), Types.STRING(), Types.STRING(), Types.STRING(), Types.SQL_TIMESTAMP(), Types.STRING(), Types.STRING(), Types.STRING(), Types.STRING(), Types.STRING(), Types.STRING(), Types.STRING() } );

Line by line // create a new tablesource of JSON
from kafka KafkaJsonTableSource kafkaTableSource = new Kafka010JsonTableSource( params.getRequired("read-topic"), params.getProperties(), typeInfo );

Line by line // let’s define a simple filtering SQL
statement String sql = "SELECT icao, lat, lon, altitude FROM flights WHERE altitude <> ‘’"; // or maybe something more complicated.. String sql = “SELECT icao, max(altitude) FROM flights GROUP BY tumble(timestamp, INTERVAL ‘5’ SECOND), icao”; // apply that statement to the table tableEnv.registerTableSource("flights", kafkaTableSource); Table result = tableEnv.sql(sql);

Line by line // Flink needs to know how to
partition the data at consume time from kafka FlinkFixedPartitioner partition = new FlinkFixedPartitioner(); // Create a sink to put the data into KafkaJsonTableSink kafkaTableSink = new Kafka09JsonTableSink( params.getRequired("write-topic"), params.getProperties(), partition ); // write result.writeToSink(kafkaTableSink);

Line by line // run it! env.execute("FlinkReadWriteKafkaJSON");

In Summary - Table API plus SQL is super cool
- Calcite supports loads of SQL operations - At some level of complexity, choose the lower level datastream API - Growing amount of development on Table API/SQL by community https://github.com/kgorman/TrafficAnalyzer https://github.com/kgorman/TrafficAnalyzer/blob/master/src/main/java/io/eventador/FlinkReadWriteKafkaJSON.java https://github.com/kgorman/TrafficAnalyzer/blob/master/src/main/java/io/eventador/FlinkReadWriteKafkaSinker.java https://ci.apache.org/projects/flink/flink-docs-release-1.3/ https://calcite.apache.org/docs/reference.html

[email protected] www.eventador.io @eventadorlabs Contact

From Zero to Stream Processing

From Zero to Stream Processing

Kenny Gorman

More Decks by Kenny Gorman

Other Decks in Programming

Featured

Transcript

Getting started with Apache Flink (and Apache Kafka) From zero

I have done database foo for my whole career, going

Performing some operation on a boundless data stream Apache Kafka

Stream Processing Frameworks Technology Description Apache Spark Traditionally more of

Apache Flink is an open source stream processing framework developed

Flink Development API’s Decomposed - DataSet API (batch) vs DataStream

Anatomy of a Flink job - Table API - Declaritive

Table API vs Table API + SQL // Table API

Our example stream processor - Simple usage of Table API

Line by line public class FlinkReadWriteKafkaJSON { public static void

Line by line // setup flink environment if needed, but

Line by line // create a table environment StreamTableEnvironment tableEnv

Line by line // create a new tablesource of JSON

Line by line // let’s define a simple filtering SQL

Line by line // Flink needs to know how to

Line by line // run it! env.execute("FlinkReadWriteKafkaJSON");

Demo

In Summary - Table API plus SQL is super cool

[email protected] www.eventador.io @eventadorlabs Contact