Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JavaZone Workshop - Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

Robin Moffatt
September 11, 2018

JavaZone Workshop - Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

Code: https://cnfl.io/ksql-workshop

Have you ever thought that you needed to be a programmer to do stream processing and build streaming data pipelines? Think again! Apache Kafka is a distributed, scalable, and fault-tolerant streaming platform, providing low-latency pub-sub messaging coupled with native storage and stream processing capabilities. Integrating Kafka with RDBMS, NoSQL, and object stores is simple with the Kafka Connect API, which is part of Apache Kafka. KSQL is the open-source SQL streaming engine for Apache Kafka, and makes it possible to build stream processing applications at scale, written using a familiar SQL interface.

In this talk we’ll explain the architectural reasoning for Apache Kafka and the benefits of real-time integration, and we’ll build a streaming data pipeline using nothing but our bare hands, the Kafka Connect API, and KSQL.

Gasp as we filter events in real time! Be amazed at how we can enrich streams of data with data from RDBMS! Be astonished at the power of streaming aggregates for anomaly detection!Have you

Robin Moffatt

September 11, 2018
Tweet

More Decks by Robin Moffatt

Other Decks in Technology

Transcript

  1. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff [email protected] confluent.io/ksql
  2. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 2 https://cnfl.io/ksql-workshop-prereqs • Make sure you allocate Docker >=8GB memory
 • Clone the repo • Should default to branch 5.0.0-post • Pull the git images as instructed in the doc docker system info | grep Memory
  3. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 3 https://cnfl.io/ksql-workshop 3: Start up the Stack
  4. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 4 • Developer Advocate @ Confluent • Working in data & analytics since 2001 • Oracle ACE Director & Dev Champion • Blogging : http://rmoff.net & http://cnfl.io/rmoff • Twitter: @rmoff • Geek stuff • Beer & Fried Breakfasts $ whoami https://speakerdeck.com/rmoff/
  5. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 5 App App App App search Hadoop DWH monitoring security MQ MQ cache cache A bit of a mess…
  6. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 6 Kafka is a Streaming Platform KAFKA DWH Hadoop App App App App App App App App request-response messaging OR stream processing streaming data pipelines changelogs
  7. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 7 Analytics - Database Offload HDFS / S3 / BigQuery etc RDBMS CDC
  8. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 8 Stream Processing with Apache Kafka and KSQL order events customer customer orders Stream Processing RDBMS CDC
  9. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 9 Real-time Event Stream Enrichment order events customer Stream Processing customer orders RDBMS <y> CDC
  10. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 10 Transform Once, Use Many order events customer Stream Processing customer orders RDBMS <y> New App <x> CDC
  11. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 11 Transform Once, Use Many order events customer Stream Processing customer orders RDBMS <y> HDFS / S3 / etc New App <x> CDC
  12. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 12 Rating events Join events to users, and filter Push notification to Slack Operational Dashboard Data Lake User data Let’s Build It!
  13. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 13 Rating events Join events to users, and filter Push notification to Slack Operational Dashboard Data Lake User data RDBMS S3/HDFS/ SnowflakeDB etc Elasticsearch App App Producer API Consumer API Let’s Build It!
  14. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 14 Confluent Open Source : Apache Kafka with a bunch of cool stuff! For free! Database Changes Log Events loT Data Web Events … CRM Data Warehouse Database Hadoop Data
 Integration … Monitoring Analytics Custom Apps Transformations Real-time Applications … Apache Open Source Confluent Open Source Confluent Enterprise Confluent Platform Confluent Platform Apache Kafka® Core | Connect API | Streams API Data Compatibility Schema Registry Monitoring & Administration Confluent Control Center | Security Operations Replicator | Auto Data Balancing Development and Connectivity Clients | Connectors | REST Proxy | CLI Apache Open Source Confluent Open Source Confluent Enterprise SQL Stream Processing KSQL
  15. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 15 Rating events Join events to users, and filter Push notification to Slack Operational Dashboard Data Lake User data RDBMS S3/HDFS/ SnowflakeDB etc Elasticsearch App App Producer API Consumer API Kafka Connect Kafka Connect Kafka Connect Kafka Connect
  16. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 16 Streaming Integration with Kafka Connect Kafka Brokers Kafka Connect Tasks Workers Sources Sinks Amazon S3 syslog flat file CSV JSON MQTT MQTT
  17. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 17 ✓ Fault tolerant and automatically load balanced ✓ Extensible API ✓ Single Message Transforms ✓ Part of Apache Kafka, included in
 Confluent Open Source Reliable and scalable integration of Kafka with other systems – no coding required. { "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector", "connection.url": "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo", "table.whitelist": "sales,orders,customers" } https://docs.confluent.io/current/connect/ ✓ Centralized management and configuration ✓ Support for hundreds of technologies including RDBMS, Elasticsearch, HDFS, S3 ✓ Supports CDC ingest of events from RDBMS ✓ Preserves data schema Kafka Connect
  18. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 18 Kafka Connect + Schema Registry = WIN RDBMS Avro Message Elasticsearch Schema Registry Avro Schema Kafka Connect Kafka Connect
  19. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 19 Kafka Connect + Schema Registry = WIN RDBMS Elasticsearch Schema Registry Avro Schema Kafka Connect Kafka Connect Avro Message
  20. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 20 Confluent Hub hub.confluent.io • One-stop place to discover and download : • Connectors • Transformations • Converters
  21. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 21 MySQL Debezium Kafka Connect Producer API Demo Time!
  22. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 22 https://cnfl.io/ksql-workshop 4 & 5: Setup & Inspect source data
  23. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 23 Rating events Join events to users, and filter Push notification to Slack Operational Dashboard Data Lake User data RDBMS S3/HDFS/ SnowflakeDB etc Elasticsearch App App Producer API Consumer API Let’s Build It! Kafka Connect Kafka Connect Kafka Connect
  24. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 24 Rating events Join events to users, and filter Push notification to Slack Operational Dashboard Data Lake User data RDBMS S3/HDFS/ SnowflakeDB etc Elasticsearch App App Producer API Consumer API KSQL Kafka Connect Kafka Connect Kafka Connect KSQL
  25. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql Declarative Stream Language Processing KSQL is a
  26. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql KSQL is the Streaming SQL Engine for Apache Kafka
  27. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql KSQL in Development and Production Interactive KSQL
 for development and testing Headless KSQL
 for Production Desired KSQL queries have been identified REST “Hmm, let me try
 out this idea...”
  28. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 28 Producer API { "rating_id": 5313, "user_id": 3, "stars": 4, "route_id": 6975, "rating_time": 1519304105213, "channel": "web", "message": "worst. flight. ever. #neveragain" } POOR_RATINGS Filter all ratings where STARS<3 CREATE STREAM POOR_RATINGS AS SELECT * FROM ratings WHERE STARS <3
  29. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 29 https://cnfl.io/ksql-workshop 6: KSQL CLI 7: Querying the Ratings topic 8. Populating a Kafka topic with KSQL
  30. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 30 Do you think that’s a table you are querying?
  31. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 31 The Table Stream Duality Account ID Balance 12345 €50 Account ID Amount 12345 + €50 12345 + €25 12345 -€60 Account ID Balance 12345 €75 Account ID Balance 12345 €15 Time Stream Table
  32. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 32 The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf Photo by Bobby Burch on Unsplash
  33. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 33 Kafka Connect Producer API { "rating_id": 5313, "user_id": 3, "stars": 4, "route_id": 6975, "rating_time": 1519304105213, "channel": "web", "message": "worst. flight. ever. #neveragain" } { "id": 3, "first_name": "Merilyn", "last_name": "Doughartie", "email": "[email protected]", "gender": "Female", "club_status": "platinum", "comments": "none" } RATINGS_WITH_CUSTOMER_DATA Join each rating to customer data CREATE STREAM RATINGS_WITH_CUSTOMER_DATA AS SELECT * FROM RATINGS LEFT JOIN CUSTOMERS ON R.ID=C.ID;
  34. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 34 Kafka Connect Producer API { "rating_id": 5313, "user_id": 3, "stars": 4, "route_id": 6975, "rating_time": 1519304105213, "channel": "web", "message": "worst. flight. ever. #neveragain" } { "id": 3, "first_name": "Merilyn", "last_name": "Doughartie", "email": "[email protected]", "gender": "Female", "club_status": "platinum", "comments": "none" } RATINGS_WITH_CUSTOMER_DATA Join each rating to customer data UNHAPPY_PLATINUM_CUSTOMERS Filter for just PLATINUM customers CREATE STREAM UNHAPPY_PLATINUM_CUSTOMERS AS SELECT * FROM RATINGS_WITH_CUSTOMER_DATA WHERE STARS < 3
  35. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 35 Kafka Connect Producer API { "rating_id": 5313, "user_id": 3, "stars": 4, "route_id": 6975, "rating_time": 1519304105213, "channel": "web", "message": "worst. flight. ever. #neveragain" } { "id": 3, "first_name": "Merilyn", "last_name": "Doughartie", "email": "[email protected]", "gender": "Female", "club_status": "platinum", "comments": "none" } RATINGS_WITH_CUSTOMER_DATA Join each rating to customer data RATINGS_BY_CLUB_STATUS_1MIN Aggregate per-minute by CLUB_STATUS CREATE TABLE RATINGS_BY_CLUB_STATUS AS SELECT CLUB_STATUS, COUNT(*) FROM RATINGS_WITH_CUSTOMER_DATA WINDOW TUMBLING (SIZE 1 MINUTES) GROUP BY CLUB_STATUS;
  36. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 36 Stream to Elasticsearch
  37. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 37 https://cnfl.io/ksql-workshop 9. Joining Data in KSQL 10. Daisy-chaining derived streams 11. Streaming Aggregates
  38. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 38 Free Books! https://www.confluent.io/apache-kafka-stream-processing-book-bundle
  39. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 40 • Embrace the Anarchy : Apache Kafka's Role in Modern Data Architectures Recording & Slides • Look Ma, no Code! Building Streaming Data Pipelines with Apache Kafka and KSQL • Steps to Building a Streaming ETL Pipeline with Apache Kafka and KSQL Recording & Slides • https://www.confluent.io/blog/ksql-in-action-real-time-streaming-etl-from-oracle-transactional-data • https://github.com/confluentinc/ksql/ Useful links
  40. @rmoff / Apache Kafka and KSQL in Action : Let’s

    Build a Streaming Data Pipeline! http://cnfl.io/ksql 41 • CDC Spreadsheet • Blog: No More Silos: How to Integrate your Databases with Apache Kafka and CDC • #partner-engineering on Slack for questions • BD team (#partners / [email protected]) can help with introductions on a given sales op Resources #EOF