Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SG Kafka | Flink Meetup Nov 2023 : Building Str...

SG Kafka | Flink Meetup Nov 2023 : Building Streaming Data Pipelines Using Flink SQL

Slides from the Flink Talk as part of the Singapore Apache Kafka | Flink Meetup on 30th November 2023

Zabeer Farook

January 03, 2024
Tweet

More Decks by Zabeer Farook

Other Decks in Technology

Transcript

  1. 2 HELLO! I’m Zabeer Farook Technical Architect, Credit Agricole CIB

    - Passionate about Stream data processing, Event Driven Architecture, Cloud, DevOps etc. - Love travelling & exploring places https://sg.linkedin.com/in/zabeer-farook
  2. Agenda ▸ Stream processing ▸ Background ▸ Usecases ▸ Popular

    frameworks ▸ Apache Flink ▸ Overview ▸ Usecases & Companies using Flink ▸ Architecture ▸ API’s ▸ How to choose between Flink, Kafka Streams and Spark? ▸ Demo 3
  3. “ “Data has long become the new oil and data

    processing platforms have become the refineries to harness the information” 6
  4. 7 Stream Processing ▸ “a type of data processing that

    is designed with infinite data sets in mind. Nothing more” ▸ “Continuous processing of data that is continuously generated” ▸ “Processing unbounded data” ▸ “Processing of data in motion” ▸ “Processing infinite sequence of data”
  5. 8 Stream Processing – Use Cases ◎ Real time Fraud

    Detection ◎ Real time stock trading ◎ Cybersecurity ◎ Online Gaming ◎ Click stream analytics ◎ Ride sharing apps ◎ Training ML Models ◎ Real time tracking in logistics ◎ Realtime monitoring ◎ Recommendation Engine ◎ Up to date Retail inventory ◎ Social media feeds ◎ Sensor (IOT) data
  6. 12 Why Realtime Stream Processing Matters? “Your groceries will be

    delivered at 10:30 AM today. Some items from your order are not available and we will refund you the corresponding amount” “Your income tax is due on May 15th 2023. Please ignore if you have already paid”
  7. 14 Is Batch ETL Dead? ◎ Batch processing is still

    relevant based on the use case but it need not be the default any more ◎ What if you just need a daily report of the number of users who subscribed to your blog site? Use Batch ETL. ◎ What if the upstream legacy application only delivers a batch file at EOD? It doesn’t make sense to have a stream job waiting all day ◎ Does it mean we can’t do stream processing until we re-write the legacy application to produce streams in this case? ◦ Not necessarily, CDC (Change Data Capture) can help ◎ How do you prefer to detect Fraud? EOD Batch or Realtime Streaming?
  8. 17 Apache Flink - Overview • Distributed Open source stream

    processing framework offering • Low Latency • High Throughput • Fault Tolerant with Exactly Once Support • High Scalability • Support for both stream & batch processing (bounded stream) • Support for event-time processing • Checkpoint and Savepoint features • Written in Java & Scala • Stream jobs can be written in Java, Scala, Python or even SQL • Latest Version 1.18 released in October 2023
  9. 18 Apache Flink - History • Started off as a

    research project “Stratosphere” in collaboration with few German universities in 2010 • Became an Apache Incubator project in March 2014 and accepted as Apache top level project in December 2014 • Alibaba created an internal fork “Blink” from Flink in 2015 and was merged back to Flink in 2019/2020 • Fun Fact – Flink means Fast or Agile in German. The red squirrel logo was chosen as squirrels are fast, agile and squirrels in Berlin apparently have a shade of reddish brown J
  10. Popular usecases & Companies using Flink 19 § Event Driven

    Applications § Batch Data Analytics § Streaming Data Analytics § Complex Event Processing § ETL jobs § Data pipelines
  11. Why Flink is so popular VC Funding ▸ Data Artisans/Ververica

    acquired by Alibaba ▸ eventador.io acquired by Cloudera ▸ Immerok acquired by Confluent in early 2023 and Flink integrated in Confluent Cloud Platform ▸ Other Companies building managed streaming solutions on top of Flink like Decodable, Aiven.io, Deltastream Strong Community ▸ Community support with large organizations using Flink ▸ Also support and contribution from managed service providers ▸ Top Apache project in terms of user activity USP ▸ Leading choice for large scale stateful stream processing with high throughput and low latency ▸ Powerful and battle tested runtime ▸ Support for multiple programming languages and connectors ▸ Streaming first approach for both stream & batch processing ▸ Useful extensions like Flink CDC, Flink SQL, Flink ML, PyFlink etc. 20
  12. Flink - Architecture 22 ▸ JobManager - Scheduling and Coordination

    of distributed execution of Flink Applications. ▸ TaskManagers - Also called workers which execute the tasks of a dataflow ▸ Client Program – Prepares and sends the data flow graph to the Job Manager
  13. Flink – Deployment Modes 23 § Standalone § Yarn §

    Mesos § Kubernetes § As a library - in memory (not for production)
  14. API’s in Flink 24 • Unified Batch & Stream Processing

    Support • Batch Data is treated as a finite / bounded stream and Stream Data is treated as an infinite /unbounded stream • Flink SQL supports ANSI standard SQL Level of Abstraction
  15. Flink Vs Kafka – Friends or Foes 25 ▸ Kafka

    & Flink are complementary technologies ▸ Kafka takes care of distributed storage layer for streaming data ▸ Flink adds up as a stream processing engine ▸ Kafka Streams & KSQL also can be used for stream processing and has some overlapping functionalities compared to Flink ▸ Kafka is Flink’s most popular connector
  16. How to choose between Flink, Kafka or Spark? ▸ The

    key points to consider ▹ Batch workload or streaming workload? ▹ Volume & Rate of Data to process? (throughput) ▹ Latency requirements ▹ Stateful or Stateless? ▹ Supported languages and expertise in the team ▹ Existing Tech Stack ▹ Community Support & Documentation ▹ Ordering & Delivery Guarantees ▹ Deployment modes ▹ State management
  17. How to choose between Flink, Kafka Streams or Spark? (contd..)

    ▸ Stateless stream processing with Kafka – Kafka Streams ▸ When both input and output are Kafka topics – Kafka Streams ▸ Complex stateful stream processing with multiple sources – Flink ▸ Stream analytics – Kafka / Flink ▸ More batch oriented workloads - Spark ▸ Kafka & Flink are mostly complementary (Storage Vs Processing)
  18. Any challenges with Flink? ▸ Relatively Steep Learning Curve ▸

    May seem complex for people new to stream processing systems to understand concepts like state management, event time processing, watermarks etc.