Slide 1

Slide 1 text

Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL PGConf UK 2018 3 Jul 2018 / Robin Moffatt @rmoff robin@confluent.io https://speakerdeck.com/rmoff/

Slide 2

Slide 2 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 2 • Developer Advocate @ Confluent • Working in data & analytics since 2001 • Oracle ACE Director & Dev Champion • Blogging : http://rmoff.net & http://cnfl.io/rmoff • Twitter: @rmoff • Geek stuff • Beer & Fried Breakfasts $ whoami https://speakerdeck.com/rmoff/

Slide 3

Slide 3 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 3 App App App App search Hadoop DWH monitoring security MQ MQ cache cache A bit of a mess…

Slide 4

Slide 4 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 4 The Streaming Platform KAFKA DWH Hadoop App App App App App App App App request-response messaging OR stream processing streaming data pipelines changelogs

Slide 5

Slide 5 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 5 Database offload → Analytics HDFS / S3 / BigQuery etc RDBMS CDC

Slide 6

Slide 6 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 6 Streaming ETL with Apache Kafka and KSQL order events customer customer orders Stream Processing RDBMS CDC

Slide 7

Slide 7 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 7 Real-time Event Stream Enrichment order events customer Stream Processing customer orders RDBMS CDC

Slide 8

Slide 8 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 8 Transform Once, Use Many order events customer Stream Processing customer orders RDBMS New App CDC

Slide 9

Slide 9 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 9 Transform Once, Use Many order events customer Stream Processing customer orders RDBMS HDFS / S3 / etc New App CDC

Slide 10

Slide 10 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 10 KSQL Streaming ETL with Apache Kafka

Slide 11

Slide 11 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 11 Streaming Integration with Kafka Connect Kafka Brokers Kafka Connect Tasks Workers Sources Sinks Amazon S3 syslog flat file CSV JSON

Slide 12

Slide 12 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 12 The Connect API of Apache Kafka® ✓ Fault tolerant and automatically load balanced ✓ Extensible API ✓ Single Message Transforms ✓ Part of Apache Kafka, included in
 Confluent Open Source Reliable and scalable integration of Kafka with other systems – no coding required. { "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector", "connection.url": "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo", "table.whitelist": "sales,orders,customers" } https://docs.confluent.io/current/connect/ ✓ Centralized management and configuration ✓ Support for hundreds of technologies including RDBMS, Elasticsearch, HDFS, S3 ✓ Supports CDC ingest of events from RDBMS ✓ Preserves data schema

Slide 13

Slide 13 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 13 Integrating Postgres with Kafka Kafka Connect & Debezium Kafka Connect & JDBC Sink

Slide 14

Slide 14 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 14 Confluent Hub hub.confluent.io • Launched June 2018 • One-stop place to discover and download : • Connectors • Transformations • Converters

Slide 15

Slide 15 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 15 KSQL Streaming ETL with Apache Kafka

Slide 16

Slide 16 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 Declarative Stream Language Processing KSQL is a

Slide 17

Slide 17 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 KSQL is the Streaming SQL Engine for Apache Kafka

Slide 18

Slide 18 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 KSQL for Streaming ETL CREATE STREAM vip_actions AS 
 SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id 
 WHERE u.level = 'Platinum'; Joining, filtering, and aggregating streams of event data

Slide 19

Slide 19 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 KSQL for Anomaly Detection CREATE TABLE possible_fraud AS
 SELECT card_number, count(*)
 FROM authorization_attempts 
 WINDOW TUMBLING (SIZE 5 SECONDS)
 GROUP BY card_number
 HAVING count(*) > 3; Identifying patterns or anomalies in real-time data, surfaced in milliseconds

Slide 20

Slide 20 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 KSQL for Real-Time Monitoring • Log data monitoring, tracking and alerting • syslog data • Sensor / IoT data CREATE STREAM SYSLOG_INVALID_USERS AS SELECT HOST, MESSAGE FROM SYSLOG WHERE MESSAGE LIKE '%Invalid user%'; http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting

Slide 21

Slide 21 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 21 KSQL Streaming ETL with Apache Kafka

Slide 22

Slide 22 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 22 Kafka Connect Producer API Elasticsearch Kafka Connect { "rating_id": 5313, "user_id": 3, "stars": 4, "route_id": 6975, "rating_time": 1519304105213, "channel": "web", "message": "worst. flight. ever. #neveragain" } { "id": 3, "first_name": "Merilyn", "last_name": "Doughartie", "email": "mdoughartie1@dedecms.com", "gender": "Female", "club_status": "platinum", "comments": "none" } Postgres Demo Time! Kafka Connect Postgres

Slide 23

Slide 23 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 23 Producer API { "rating_id": 5313, "user_id": 3, "stars": 4, "route_id": 6975, "rating_time": 1519304105213, "channel": "web", "message": "worst. flight. ever. #neveragain" } POOR_RATINGS Filter all ratings where STARS<3 CREATE STREAM POOR_RATINGS AS SELECT * FROM ratings WHERE STARS <3

Slide 24

Slide 24 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 24 Kafka Connect Producer API { "rating_id": 5313, "user_id": 3, "stars": 4, "route_id": 6975, "rating_time": 1519304105213, "channel": "web", "message": "worst. flight. ever. #neveragain" } { "id": 3, "first_name": "Merilyn", "last_name": "Doughartie", "email": "mdoughartie1@dedecms.com", "gender": "Female", "club_status": "platinum", "comments": "none" } RATINGS_WITH_CUSTOMER_DATA Join each rating to customer data UNHAPPY_PLATINUM_CUSTOMERS Filter for just PLATINUM customers CREATE STREAM UNHAPPY_PLATINUM_CUSTOMERS AS SELECT * FROM RATINGS_WITH_CUSTOMER_DATA WHERE STARS < 3

Slide 25

Slide 25 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 25 Kafka Connect Producer API { "rating_id": 5313, "user_id": 3, "stars": 4, "route_id": 6975, "rating_time": 1519304105213, "channel": "web", "message": "worst. flight. ever. #neveragain" } { "id": 3, "first_name": "Merilyn", "last_name": "Doughartie", "email": "mdoughartie1@dedecms.com", "gender": "Female", "club_status": "platinum", "comments": "none" } RATINGS_WITH_CUSTOMER_DATA Join each rating to customer data RATINGS_BY_CLUB_STATUS_1MIN Aggregate per-minute by CLUB_STATUS CREATE TABLE RATINGS_BY_CLUB_STATUS AS SELECT CLUB_STATUS, COUNT(*) FROM RATINGS_WITH_CUSTOMER_DATA WINDOW TUMBLING (SIZE 1 MINUTES) GROUP BY CLUB_STATUS;

Slide 26

Slide 26 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 26 Confluent Open Source : Apache Kafka with a bunch of cool stuff! For free! Database Changes Log Events loT Data Web Events … CRM Data Warehouse Database Hadoop Data
 Integration … Monitoring Analytics Custom Apps Transformations Real-time Applications … Apache Open Source Confluent Open Source Confluent Enterprise Confluent Platform Confluent Platform Apache Kafka® Core | Connect API | Streams API Data Compatibility Schema Registry Monitoring & Administration Confluent Control Center | Security Operations Replicator | Auto Data Balancing Development and Connectivity Clients | Connectors | REST Proxy | CLI Apache Open Source Confluent Open Source Confluent Enterprise SQL Stream Processing KSQL

Slide 27

Slide 27 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 27 Free Books! https://www.confluent.io/apache-kafka-stream-processing-book-bundle

Slide 28

Slide 28 text

@rmoff robin@confluent.io http://cnfl.io/slack https://www.confluent.io/download/

Slide 29

Slide 29 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 29 • Postgres integration into Kafka • http://debezium.io/docs/connectors/postgresql/ • https://www.simple.com/engineering/a-change-data-capture-pipeline-from-postgresql-to-kafka • https://www.slideshare.net/JeffKlukas/postgresql-kafka-the-delight-of-change-data-capture • https://blog.insightdatascience.com/from-postgresql-to-redshift-with-kafka-connect-111c44954a6a • Streaming ETL • Embrace the Anarchy : Apache Kafka's Role in Modern Data Architectures Recording & Slides • Look Ma, no Code! Building Streaming Data Pipelines with Apache Kafka and KSQL • Steps to Building a Streaming ETL Pipeline with Apache Kafka and KSQL Recording & Slides • https://www.confluent.io/blog/ksql-in-action-real-time-streaming-etl-from-oracle-transactional-data • https://github.com/confluentinc/ksql/ Useful links

Slide 30

Slide 30 text

@rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 30 • CDC Spreadsheet • Blog: No More Silos: How to Integrate your Databases with Apache Kafka and CDC • #partner-engineering on Slack for questions • BD team (#partners / partners@confluent.io) can help with introductions on a given sales op Resources #EOF