Slide 1

Slide 1 text

Look Ma, no Code! 
 Building Streaming Data Pipelines with Apache Kafka Kafka Summit London 24 Apr 2018 / Robin Moffatt @rmoff [email protected] https://speakerdeck.com/rmoff/

Slide 2

Slide 2 text

2 Let’s take a trip back in time. Each application has its own database for storing information. But we want that information elsewhere for analytics and reporting.

Slide 3

Slide 3 text

3 We don't want to query the transactional system, so we create a process to extract from the source to a data warehouse / lake

Slide 4

Slide 4 text

4 Let’s take a trip back in time We want to unify data from multiple systems, so create conformed dimensions and batch processes to federate our data. This is all batch driven, so latency is built in by design.

Slide 5

Slide 5 text

5 Let’s take a trip back in time As well as our data warehouse, we want to use our transactional data to populate search replicas, Graph databases, noSQL stores…all introducing more point- to-point dependencies in our system

Slide 6

Slide 6 text

6 Let’s take a trip back in time Ultimately we end up with a spaghetti architecture. It can't scale easily, it's tightly coupled, it's generally batch-driven and we can't get data when we want it where we want it.

Slide 7

Slide 7 text

7 But…there's hope!

Slide 8

Slide 8 text

8 Apache Kafka, a distributed streaming platform, enables us to decouple all our applications creating data from those utilising it. We can create low- latency streams of data, transformed as necessary.

Slide 9

Slide 9 text

9 But…to use stream processing, we need to be Java coders…don't we?

Slide 10

Slide 10 text

10 Happy days! We can actually build streaming data pipelines using just our bare hands, configuration files, and SQL.

Slide 11

Slide 11 text

11 Streaming ETL, with Apache Kafka and Confluent Platform

Slide 12

Slide 12 text

12 $ whoami • Partner Technology Evangelist @ Confluent • Working in data & analytics since 2001 • Oracle ACE Director & Dev Champion • Blogging : http://rmoff.net & 
 https://www.confluent.io/blog/author/robin/ • Twitter: @rmoff • Geek stuff • Beer & Fried Breakfasts https://speakerdeck.com/rmoff/

Slide 13

Slide 13 text

13

Slide 14

Slide 14 text

14

Slide 15

Slide 15 text

15 Build Pipelines Raw logs Raw logs HDFS / S3 App Server

Slide 16

Slide 16 text

16 Build Complex Pipelines Raw logs Raw logs Error logs Elasticsearch HDFS / S3 Stream Processing App Server

Slide 17

Slide 17 text

17 Build Applications Raw logs SLA breaches Alert App Stream Processing App Server

Slide 18

Slide 18 text

18 Build Applications + Pipelines Raw logs Raw logs Error logs SLA breaches Elasticsearch HDFS / S3 Alert App Stream Processing App Server

Slide 19

Slide 19 text

19

Slide 20

Slide 20 text

Confluent Partner Briefing 20 The Connect API of Apache Kafka® ✓ Centralized management and configuration ✓ Support for hundreds of technologies including RDBMS, Elasticsearch, HDFS, S3 ✓ Supports CDC ingest of events from RDBMS ✓ Preserves data schema ✓ Fault tolerant and automatically load balanced ✓ Extensible API ✓ Single Message Transforms ✓ Part of Apache Kafka, included in
 Confluent Open Source Reliable and scalable integration of Kafka with other systems – no coding required. { "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector", "connection.url": "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo", "table.whitelist": "sales,orders,customers" } https://docs.confluent.io/current/connect/

Slide 21

Slide 21 text

21 Kafka Connect Kafka Brokers Kafka Connect Tasks Workers Sources Sinks Amazon S3 syslog flat file CSV JSON

Slide 22

Slide 22 text

22 Streaming Application Data to Kafka • Modifying applications is not always possible or desirable • And what if the data gets changed within the database or by other apps? • Query-based CDC with Kafka Connect's JDBC source is one option for extracting data • Polls the source DB --> Latency & Load implications • Requires key/ts in the schema • Can't capture deletes https://www.confluent.io/blog/no-more-silos-how-to-integrate-your-databases-with-apache-kafka-and-cdc

Slide 23

Slide 23 text

23 Liberate Application Data into Kafka with log-based CDC • Relational databases use transaction logs to ensure Durability of data • Log-based Change-Data-Capture (CDC) mines the log to get raw events from the database • CDC tools that integrate with Kafka Connect include: • Debezium • GoldenGate • Attunity • + more https://www.confluent.io/blog/no-more-silos-how-to-integrate-your-databases-with-apache-kafka-and-cdc

Slide 24

Slide 24 text

24 Single Message Transform (SMT) -- Extract, TRANSFORM, Load… • Modify events before storing in Kafka: • Mask/drop sensitive information • Set partitioning key • Store lineage • Modify events going out of Kafka: • Route high priority events to faster data stores • Direct events to different Elasticsearch indexes • Cast data types to match destination

Slide 25

Slide 25 text

25 But I need to join…aggregate…filter…

Slide 26

Slide 26 text

Declarative Stream Language Processing KSQL is a

Slide 27

Slide 27 text

KSQL is the Streaming SQL Engine for Apache Kafka

Slide 28

Slide 28 text

28 KSQL: a Streaming SQL Engine for Apache Kafka™ from Confluent • Enables stream processing with zero coding required • The simplest way to process streams of data in real-time • Powered by Kafka: scalable, distributed, battle-tested • All you need is Kafka–No complex deployments of bespoke systems for stream processing Ksql>

Slide 29

Slide 29 text

KSQL for Streaming ETL CREATE STREAM vip_actions AS 
 SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id 
 WHERE u.level = 'Platinum'; Joining, filtering, and aggregating streams of event data

Slide 30

Slide 30 text

KSQL for Anomaly Detection CREATE TABLE possible_fraud AS
 SELECT card_number, count(*)
 FROM authorization_attempts 
 WINDOW TUMBLING (SIZE 5 SECONDS)
 GROUP BY card_number
 HAVING count(*) > 3; Identifying patterns or anomalies in real-time data, surfaced in milliseconds

Slide 31

Slide 31 text

KSQL for Real-Time Monitoring • Log data monitoring, tracking and alerting • syslog data • Sensor / IoT data CREATE STREAM SYSLOG_INVALID_USERS AS SELECT HOST, MESSAGE FROM SYSLOG WHERE MESSAGE LIKE '%Invalid user%'; http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting

Slide 32

Slide 32 text

CREATE STREAM views_by_userid WITH (PARTITIONS=6, REPLICAS=5, VALUE_FORMAT='AVRO', TIMESTAMP='view_time') AS 
 SELECT * FROM clickstream PARTITION BY user_id; KSQL for Data Transformation Make simple derivations of existing topics from the command line

Slide 33

Slide 33 text

33 Streaming ETL, powered by Apache Kafka and Confluent Platform KSQL

Slide 34

Slide 34 text

Demo Time! 34

Slide 35

Slide 35 text

ratings
 stream MySQL Debezium Kafka Connect Elasticsearch Kafka Connect

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

ratings
 stream MySQL Debezium Kafka Connect Elasticsearch Kafka Connect

Slide 38

Slide 38 text

ratings
 stream MySQL Debezium Kafka Connect Elasticsearch Kafka Connect slack_notify.py

Slide 39

Slide 39 text

39 Streaming ETL, powered by Apache Kafka and Confluent Platform KSQL

Slide 40

Slide 40 text

40 Confluent Open Source : Apache Kafka with a bunch of cool stuff! For free! Database Changes Log Events loT Data Web Events … CRM Data Warehouse Database Hadoop Data
 Integration … Monitoring Analytics Custom Apps Transformations Real-time Applications … Apache Open Source Confluent Open Source Confluent Enterprise Confluent Platform Confluent Platform Apache Kafka® Core | Connect API | Streams API Data Compatibility Schema Registry Monitoring & Administration Confluent Control Center | Security Operations Replicator | Auto Data Balancing Development and Connectivity Clients | Connectors | REST Proxy | CLI Apache Open Source Confluent Open Source Confluent Enterprise SQL Stream Processing KSQL

Slide 41

Slide 41 text

41 Streaming ETL, powered by Apache Kafka and Confluent Platform @rmoff [email protected] https://slackpass.io/confluentcommunity https://www.confluent.io/download/

Slide 42

Slide 42 text

42 Kafka Summit San Francisco 2018 Dates: October 16-17, 2018 Location: Pier 27, San Francisco CFP Opens: April 30, 2018 www.kafka-summit.org Presented by

Slide 43

Slide 43 text

Look Ma, no Code! 
 Building Streaming Data Pipelines with Apache Kafka Kafka Summit London 24 Apr 2018 / Robin Moffatt @rmoff [email protected] https://speakerdeck.com/rmoff/