Slide 1

Slide 1 text

Building Streaming Data Pipelines using Flink SQL

Slide 2

Slide 2 text

2 HELLO! I’m Zabeer Farook Technical Architect, Credit Agricole CIB - Passionate about Stream data processing, Event Driven Architecture, Cloud, DevOps etc. - Love travelling & exploring places https://sg.linkedin.com/in/zabeer-farook

Slide 3

Slide 3 text

Agenda ▸ Stream processing ▸ Background ▸ Usecases ▸ Popular frameworks ▸ Apache Flink ▸ Overview ▸ Usecases & Companies using Flink ▸ Architecture ▸ API’s ▸ How to choose between Flink, Kafka Streams and Spark? ▸ Demo 3

Slide 4

Slide 4 text

Stream Processing 1

Slide 5

Slide 5 text

5 Worldwide Data Volume Growth Source: https://www.statista.com/

Slide 6

Slide 6 text

“ “Data has long become the new oil and data processing platforms have become the refineries to harness the information” 6

Slide 7

Slide 7 text

7 Stream Processing ▸ “a type of data processing that is designed with infinite data sets in mind. Nothing more” ▸ “Continuous processing of data that is continuously generated” ▸ “Processing unbounded data” ▸ “Processing of data in motion” ▸ “Processing infinite sequence of data”

Slide 8

Slide 8 text

8 Stream Processing – Use Cases ◎ Real time Fraud Detection ◎ Real time stock trading ◎ Cybersecurity ◎ Online Gaming ◎ Click stream analytics ◎ Ride sharing apps ◎ Training ML Models ◎ Real time tracking in logistics ◎ Realtime monitoring ◎ Recommendation Engine ◎ Up to date Retail inventory ◎ Social media feeds ◎ Sensor (IOT) data

Slide 9

Slide 9 text

9 Stream Processing Vs Batch Processing

Slide 10

Slide 10 text

10 Stream Processing Vs Batch Processing

Slide 11

Slide 11 text

11 Why Realtime Stream Processing Matters?

Slide 12

Slide 12 text

12 Why Realtime Stream Processing Matters? “Your groceries will be delivered at 10:30 AM today. Some items from your order are not available and we will refund you the corresponding amount” “Your income tax is due on May 15th 2023. Please ignore if you have already paid”

Slide 13

Slide 13 text

13 Why Realtime Stream Processing Matters?

Slide 14

Slide 14 text

14 Is Batch ETL Dead? ◎ Batch processing is still relevant based on the use case but it need not be the default any more ◎ What if you just need a daily report of the number of users who subscribed to your blog site? Use Batch ETL. ◎ What if the upstream legacy application only delivers a batch file at EOD? It doesn’t make sense to have a stream job waiting all day ◎ Does it mean we can’t do stream processing until we re-write the legacy application to produce streams in this case? ○ Not necessarily, CDC (Change Data Capture) can help ◎ How do you prefer to detect Fraud? EOD Batch or Realtime Streaming?

Slide 15

Slide 15 text

15 Popular Stream Processing Platforms Image Credits: KAI WAEHNER, Field CTO, Confluent

Slide 16

Slide 16 text

Apache Flink 2

Slide 17

Slide 17 text

17 Apache Flink - Overview • Distributed Open source stream processing framework offering • Low Latency • High Throughput • Fault Tolerant with Exactly Once Support • High Scalability • Support for both stream & batch processing (bounded stream) • Support for event-time processing • Checkpoint and Savepoint features • Written in Java & Scala • Stream jobs can be written in Java, Scala, Python or even SQL • Latest Version 1.18 released in October 2023

Slide 18

Slide 18 text

18 Apache Flink - History • Started off as a research project “Stratosphere” in collaboration with few German universities in 2010 • Became an Apache Incubator project in March 2014 and accepted as Apache top level project in December 2014 • Alibaba created an internal fork “Blink” from Flink in 2015 and was merged back to Flink in 2019/2020 • Fun Fact – Flink means Fast or Agile in German. The red squirrel logo was chosen as squirrels are fast, agile and squirrels in Berlin apparently have a shade of reddish brown J

Slide 19

Slide 19 text

Popular usecases & Companies using Flink 19 § Event Driven Applications § Batch Data Analytics § Streaming Data Analytics § Complex Event Processing § ETL jobs § Data pipelines

Slide 20

Slide 20 text

Why Flink is so popular VC Funding ▸ Data Artisans/Ververica acquired by Alibaba ▸ eventador.io acquired by Cloudera ▸ Immerok acquired by Confluent in early 2023 and Flink integrated in Confluent Cloud Platform ▸ Other Companies building managed streaming solutions on top of Flink like Decodable, Aiven.io, Deltastream Strong Community ▸ Community support with large organizations using Flink ▸ Also support and contribution from managed service providers ▸ Top Apache project in terms of user activity USP ▸ Leading choice for large scale stateful stream processing with high throughput and low latency ▸ Powerful and battle tested runtime ▸ Support for multiple programming languages and connectors ▸ Streaming first approach for both stream & batch processing ▸ Useful extensions like Flink CDC, Flink SQL, Flink ML, PyFlink etc. 20

Slide 21

Slide 21 text

Let’s ask BBC why Flink is popular

Slide 22

Slide 22 text

Flink - Architecture 22 ▸ JobManager - Scheduling and Coordination of distributed execution of Flink Applications. ▸ TaskManagers - Also called workers which execute the tasks of a dataflow ▸ Client Program – Prepares and sends the data flow graph to the Job Manager

Slide 23

Slide 23 text

Flink – Deployment Modes 23 § Standalone § Yarn § Mesos § Kubernetes § As a library - in memory (not for production)

Slide 24

Slide 24 text

API’s in Flink 24 • Unified Batch & Stream Processing Support • Batch Data is treated as a finite / bounded stream and Stream Data is treated as an infinite /unbounded stream • Flink SQL supports ANSI standard SQL Level of Abstraction

Slide 25

Slide 25 text

Flink Vs Kafka – Friends or Foes 25 ▸ Kafka & Flink are complementary technologies ▸ Kafka takes care of distributed storage layer for streaming data ▸ Flink adds up as a stream processing engine ▸ Kafka Streams & KSQL also can be used for stream processing and has some overlapping functionalities compared to Flink ▸ Kafka is Flink’s most popular connector

Slide 26

Slide 26 text

Should we choose Flink over Kafka streams or Spark? It Depends…..

Slide 27

Slide 27 text

How to choose between Flink, Kafka or Spark? ▸ The key points to consider ▹ Batch workload or streaming workload? ▹ Volume & Rate of Data to process? (throughput) ▹ Latency requirements ▹ Stateful or Stateless? ▹ Supported languages and expertise in the team ▹ Existing Tech Stack ▹ Community Support & Documentation ▹ Ordering & Delivery Guarantees ▹ Deployment modes ▹ State management

Slide 28

Slide 28 text

How to choose between Flink, Kafka Streams or Spark? (contd..) ▸ Stateless stream processing with Kafka – Kafka Streams ▸ When both input and output are Kafka topics – Kafka Streams ▸ Complex stateful stream processing with multiple sources – Flink ▸ Stream analytics – Kafka / Flink ▸ More batch oriented workloads - Spark ▸ Kafka & Flink are mostly complementary (Storage Vs Processing)

Slide 29

Slide 29 text

Any challenges with Flink? ▸ Relatively Steep Learning Curve ▸ May seem complex for people new to stream processing systems to understand concepts like state management, event time processing, watermarks etc.

Slide 30

Slide 30 text

Further Learning on Flink ▸ https://nightlies.apache.org/flink/flink-docs-master/ ▸ https://developer.confluent.io/courses/apache-flink/intro/

Slide 31

Slide 31 text

Demo 3

Slide 32

Slide 32 text

Demo Use Case

Slide 33

Slide 33 text

Demo Repo https://github.com/Zabi82/flinksql-demo

Slide 34

Slide 34 text

34 THANKS! Any questions?