Slide 1

Slide 1 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Processing streaming data at a large scale with Kafka

Slide 2

Slide 2 text

I’m Thijs, I work at AppSignal PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA

Slide 3

Slide 3 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA My name is pronounced like this: Ice-T

Slide 4

Slide 4 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA My name is pronounced like this: T-Ice

Slide 5

Slide 5 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA I’m from Amsterdam, 
 The Netherlands.

Slide 6

Slide 6 text

Today is Kingsday!

Slide 7

Slide 7 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA We monitor errors and performance for Ruby and Elixir apps.

Slide 8

Slide 8 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA How hard can it be?

Slide 9

Slide 9 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA It turns out that you have to process a lot of streaming data.

Slide 10

Slide 10 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA • Generated continuously • Coming from multiple data sources • Sent simultaneously • Sent in small sizes Streaming data is:

Slide 11

Slide 11 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA • Database locking • Load balancing • Manipulating datasets • Routing 
 Getting similar data to same server Problems…

Slide 12

Slide 12 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA So let’s try a “simple” streaming data challenge.

Slide 13

Slide 13 text

We have loads of visitors from all over the world on different servers.

Slide 14

Slide 14 text

We want to process the server log files and get some statistics from them.

Slide 15

Slide 15 text

200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 146.185.155.137 - - [02/Jan/2017:05:43:19 +0000] "GET / HTTP/1.1" 200 2469 "-" "Ruby" 191.7.178.2 - - [02/Jan/2017:05:45:01 +0000] "GET /wp-login.php HTTP/1.1" 404 2200 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 191.7.178.2 - - [02/Jan/2017:05:45:02 +0000] "GET / HTTP/1.1" 200 5517 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 76.72.172.208 - - [02/Jan/2017:05:45:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 128.199.222.25 - - [02/Jan/2017:05:49:23 +0000] "GET / HTTP/1.1" 200 2469 "-" "Ruby" 204.79.180.91 - - [02/Jan/2017:05:49:45 +0000] "GET /api/1/search.json? token=5485c77b14932d75a8000000&query=false&callback=jQuery16104168975954468962_1483336183061&_=1483336183277 HTTP/1.1" 200 73 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; Trident/5.0)" 50.23.94.74 - - [02/Jan/2017:05:50:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 64.237.49.203 - - [02/Jan/2017:05:55:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 178.62.38.143 - - [02/Jan/2017:05:55:20 +0000] "GET / HTTP/1.1" 200 2472 "-" "Ruby" 178.255.154.2 - - [02/Jan/2017:06:00:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 180.76.15.145 - - [02/Jan/2017:06:00:52 +0000] "GET /assets/modernizr-335dfd57319af4815822b1b264e6c88a333a133317e5ffb077da6f1d9fbee380.js HTTP/1.1" 200 3899 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 173.255.192.178 - - [02/Jan/2017:06:01:17 +0000] "GET / HTTP/1.1" 200 2470 "-" "Ruby" 184.75.210.90 - - [02/Jan/2017:06:05:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 66.249.79.145 - - [02/Jan/2017:06:06:01 +0000] "GET /robots.txt HTTP/1.1" 200 204 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/ bot.html)" 66.249.79.141 - - [02/Jan/2017:06:06:02 +0000] "GET /apple-app-site-association HTTP/1.1" 404 921 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http:// www.google.com/bot.html)" 58.160.83.212 - - [02/Jan/2017:06:06:47 +0000] "GET /wp-login.php HTTP/1.1" 404 2200 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 58.160.83.212 - - [02/Jan/2017:06:06:48 +0000] "GET / HTTP/1.1" 200 5517 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 104.131.24.242 - - [02/Jan/2017:06:07:17 +0000] "GET / HTTP/1.1" 200 2468 "-" "Ruby" 109.123.101.103 - - [02/Jan/2017:06:10:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/ 1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 146.185.155.137 - - [02/Jan/2017:06:13:17 +0000] "GET / HTTP/1.1" 200 2470 "-" "Ruby" 89.163.242.206 - - [02/Jan/2017:06:15:14 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 66.249.85.16 - - [02/Jan/2017:06:15:33 +0000] "GET /api/1/search.json? token=514602663f61b0121e0001fc&query=La%20sirenita&callback=jQuery1910040832826641725495_1483337719027&_=1483337719028 HTTP/1.1" 200 77 "http:// www.hachemuda.com/2008/07/faceinhole-pon-tu-cara-en-el-cuerpo-de-cualquier-famoso/" "Mozilla/5.0 (Linux; Android 6.0; MotoG3 Build/MPI24.65-33.1-2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.85 Mobile Safari/537.36" 66.249.85.16 - - [02/Jan/2017:06:15:37 +0000] "GET /api/1/search.json? token=514602663f61b0121e0001fc&query=La%20sirenita&callback=jQuery1910040832826641725495_1483337719027&_=1483337719029 HTTP/1.1" 200 77 "http:// www.hachemuda.com/2008/07/faceinhole-pon-tu-cara-en-el-cuerpo-de-cualquier-famoso/" "Mozilla/5.0 (Linux; Android 6.0; MotoG3 Build/MPI24.65-33.1-2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.85 Mobile Safari/537.36" 66.249.85.16 - - [02/Jan/2017:06:15:45 +0000] "GET /api/1/search.json? token=514602663f61b0121e0001fc&query=La%20sirenita&callback=jQuery1910040832826641725495_1483337719031&_=1483337719032 HTTP/1.1" 200 77 "http:// www.hachemuda.com/2008/07/faceinhole-pon-tu-cara-en-el-cuerpo-de-cualquier-famoso/" "Mozilla/5.0 (Linux; Android 6.0; MotoG3 Build/MPI24.65-33.1-2) Loads of lines in a log file,
 streaming to us.

Slide 16

Slide 16 text

0 50 100 150 200 Visits from beautiful countries! ! " # $

Slide 17

Slide 17 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Is this hard to do?

Slide 18

Slide 18 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE Simple approach: Just update the database! NL US RU DE

Slide 19

Slide 19 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA UPDATE countries SET count +=1 WHERE code=“US”; For every log line run this query:

Slide 20

Slide 20 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE NL US RU DE DATABASE LOCKING Uh-oh! Locking!

Slide 21

Slide 21 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE 1 NL US RU DE DATABASE 2 DATABASE 3 Just shard the data!

Slide 22

Slide 22 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Sharding has some downsides

Slide 23

Slide 23 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE 1 NL US RU DE DATABASE 2 DATABASE 3 What if I want to run a query over multiple databases? Querying data…

Slide 24

Slide 24 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE 1 DATABASE 2 DATABASE 3 What if we want to change the sharding?

Slide 25

Slide 25 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA At AppSignal, we do much more to our data than just a simple increment.

Slide 26

Slide 26 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE 1 DATABASE 2 DATABASE 3 WORKER 1 WORKER 2 WORKER 3 Sharding our customers. CUSTOMER 1 CUSTOMER 2 CUSTOMER 3

Slide 27

Slide 27 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Not so reliable.. DATABASE 1 DATABASE 2 DATABASE 3 WORKER 2 WORKER 3 CUSTOMER 2 CUSTOMER 3 WORKER = SINGLE POINT OF FAILURE :( ☠ ☠ ☠ CUSTOMER 1

Slide 28

Slide 28 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE 1 CUSTOMER 1 CUSTOMER 2 CUSTOMER 3 DATABASE 2 DATABASE 2 WORKER 1 WORKER 2 WORKER 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 TIRED OF DRAWING LINES, USE YOUR IMAGINATION! Load balance with random distribution.

Slide 29

Slide 29 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE 1 DATABASE 2 DATABASE 3 Data is fragmented over different workers Not ideal, hard to work with data. WORKER 1 WORKER 2 WORKER 3 CUSTOMER 1 CUSTOMER 2 CUSTOMER 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3

Slide 30

Slide 30 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA We get all data 
 for one customer 
 in the same worker Our life would be easier if…

Slide 31

Slide 31 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA UPDATE countries SET count +=1 WHERE code=“US”; UPDATE countries SET count +=1 WHERE code=“US”; UPDATE countries SET count +=1 WHERE code=“US”; UPDATE countries SET count +=1 WHERE code=“US”; UPDATE countries SET count +=1 WHERE code=“US”; UPDATE countries SET count +=1 WHERE code=“US”; We’d go from loads of queries UPDATE countries SET count +=6 WHERE code=“US”; To a lot less

Slide 32

Slide 32 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA We want to do much more complicated stuff than just +N-ing a record. Percentiles You have to have a complete dataset to do percentiles right. Histograms Show full distribution of all values in the data set. Smarter data collection Compare 50 errors and only display the differences!

Slide 33

Slide 33 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE 1 CUSTOMER 1 CUSTOMER 2 CUSTOMER 3 DATABASE 2 DATABASE 3 WORKER 1 WORKER 2 WORKER 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 So… Customer goes to same worker?

Slide 34

Slide 34 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA CUSTOMER 1 CUSTOMER 2 CUSTOMER 3 DATABASE WORKER 1 WORKER 2 WORKER 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 Less locking, no more sharding?

Slide 35

Slide 35 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA CUSTOMER 1 CUSTOMER 2 CUSTOMER 3 WORKER 2 WORKER 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 ☠ ☠ ☠ Back where we started: Single point of failure. DATABASE

Slide 36

Slide 36 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA We need something else… CUSTOMER 1 CUSTOMER 2 CUSTOMER 3 DATABASE HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3

Slide 37

Slide 37 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Kafka makes it possible to… • Load balance • Scale your infrastructure • Route your data
 Getting similar data to same server

Slide 38

Slide 38 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA The difficult thing about understanding Kafka…

Slide 39

Slide 39 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Has a name (eg. access_logs) and you can write data to it. Conceptually the same as a log. Topic Each partition within a topic has a part of the data. You choose which data goes together (eg. per country). Partition Kafka server that actually stores the topics and partitions. Broker Kafka client that reads from a topic and delivers data for processing by your code. Consumer

Slide 40

Slide 40 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Has a name (eg. access_logs) and you can write data to it. Conceptually the same as a log. Topic Each partition within a topic has a part of the data. You choose which data goes together (eg. per country). Partition Kafka server that actually stores the topics and partitions. Broker Kafka client that reads from a topic and delivers data for processing by your code. Consumer

Slide 41

Slide 41 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Has a name (eg. access_logs) and you can write data to it. Conceptually the same as a log. Topic Each partition within a topic has a part of the data. You choose which data goes together (eg. per country). Partition Kafka server that actually stores the topics and partitions. Broker Kafka client that reads from a topic and delivers data for processing by your code. Consumer

Slide 42

Slide 42 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Has a name (eg. access_logs) and you can write data to it. Conceptually the same as a log. Topic Each partition within a topic has a part of the data. You choose which data goes together (eg. per country). Partition Kafka server that actually stores the topics and partitions. Broker Kafka client that reads from a topic and delivers data for processing by your code. Consumer

Slide 43

Slide 43 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA A topic has partitions which contain messages. Old New Partition 0 0 1 2 3 4 5 6 7 8 9 Partition 1 0 1 2 3 4 5 6 7 8 9 Partition 2 0 1 2 3 4 5 6 7 8 9

Slide 44

Slide 44 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA A topic has partitions which contain messages. "# ! $ Partition 0 0 1 2 3 4 5 6 7 8 9 Partition 1 0 1 2 3 4 5 6 7 8 9 Partition 2 0 1 2 3 4 5 6 7 8 9 You decide how to partition, eg. by country.

Slide 45

Slide 45 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Remember, a broker is a Kafka server. The topic’s partitions live on these servers. Each broker is primary for some partitions. Each broker is secondary for some other partitions Broker

Slide 46

Slide 46 text

Primary for partition 1 to 3 Secondary for partition 4 to 6 Broker 1 Primary for partition 7 to 9 Secondary for partition 1 to 3 Broker 3 Primary for partition 4 to 6 Secondary for partition 7 to 9 Broker 2 A Kafka cluster with three brokers.

Slide 47

Slide 47 text

Primary for partition 1 to 5 Secondary for partition 6 to 9 Broker 1 Broker 3 died. Kafka redistributes the partitions ☠ ☠ ☠ Broker 3 RIP Primary for partition 6 to 9 Secondary for partition 1 to 5 Broker 2

Slide 48

Slide 48 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Scaling up from 3 to 6 servers? Kafka will automatically redistribute the primary and secondary partitions!

Slide 49

Slide 49 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Kafka client that reads from a topic and delivers data for processing by your code. Consumer

Slide 50

Slide 50 text

You can have multiple consumers. Consumer Slack At offset 0 Consumer Email At offset 0 "# Partition 0 0 1 2 3 4 5 6 7 8 9

Slide 51

Slide 51 text

At offset 0 At offset 6 "# Partition 0 0 1 2 3 4 5 6 7 8 9 Slack is down. 
 Email delivery is fine. Consumer Slack Consumer Email

Slide 52

Slide 52 text

At offset 4 At offset 8 "# Partition 0 0 1 2 3 4 5 6 7 8 9 Slack comes back up. Consumer Slack Consumer Email

Slide 53

Slide 53 text

Both consumers are at the latest offset! At offset 9 At offset 9 Partition 0 0 1 2 3 4 5 6 7 8 9 "# Consumer Slack Consumer Email

Slide 54

Slide 54 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA How about consuming multiple partitions?

Slide 55

Slide 55 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Consumers that have the same group name all get assigned a segment of the available partitions. Consumers can be in a group.

Slide 56

Slide 56 text

Consumes partition 4 to 6 Consumer 2 Consumer group Consumes partition 1 to 3 Consumer 1 Consumes partition 6 to 9 Consumer 3 Topic with 9 partitions

Slide 57

Slide 57 text

Consumes partition 1 to 4 Consumer 2 Consumer group Consumes partition 5 to 9 Consumer 3 Topic with 9 partitions ☠ ☠ ☠ Consumer 1 RIP

Slide 58

Slide 58 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Remember this? CUSTOMER CUSTOMER 2 CUSTOMER 3 WORKER 2 WORKER 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 ☠ ☠ ☠ DATABASE

Slide 59

Slide 59 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA CUSTOMER Kafka fixes this for us. CUSTOMER 2 CUSTOMER 3 WORKER 2 WORKER 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 ☠ ☠ ☠ DATABASE KAFKA CLUSTER — DOES ITS MAGIC TO REDISTRIBUTE!

Slide 60

Slide 60 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Let’s look at the code then!

Slide 61

Slide 61 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Lets build this very simple visitor analytics system.

Slide 62

Slide 62 text

200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 146.185.155.137 - - [02/Jan/2017:05:43:19 +0000] "GET / HTTP/1.1" 200 2469 "-" "Ruby" 191.7.178.2 - - [02/Jan/2017:05:45:01 +0000] "GET /wp-login.php HTTP/1.1" 404 2200 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 191.7.178.2 - - [02/Jan/2017:05:45:02 +0000] "GET / HTTP/1.1" 200 5517 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 76.72.172.208 - - [02/Jan/2017:05:45:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 128.199.222.25 - - [02/Jan/2017:05:49:23 +0000] "GET / HTTP/1.1" 200 2469 "-" "Ruby" 204.79.180.91 - - [02/Jan/2017:05:49:45 +0000] "GET /api/1/search.json? token=5485c77b14932d75a8000000&query=false&callback=jQuery16104168975954468962_1483336183061&_=1483336183277 HTTP/1.1" 200 73 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; Trident/5.0)" 50.23.94.74 - - [02/Jan/2017:05:50:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 64.237.49.203 - - [02/Jan/2017:05:55:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 178.62.38.143 - - [02/Jan/2017:05:55:20 +0000] "GET / HTTP/1.1" 200 2472 "-" "Ruby" 178.255.154.2 - - [02/Jan/2017:06:00:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 180.76.15.145 - - [02/Jan/2017:06:00:52 +0000] "GET /assets/modernizr-335dfd57319af4815822b1b264e6c88a333a133317e5ffb077da6f1d9fbee380.js HTTP/1.1" 200 3899 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 173.255.192.178 - - [02/Jan/2017:06:01:17 +0000] "GET / HTTP/1.1" 200 2470 "-" "Ruby" 184.75.210.90 - - [02/Jan/2017:06:05:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 66.249.79.145 - - [02/Jan/2017:06:06:01 +0000] "GET /robots.txt HTTP/1.1" 200 204 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/ bot.html)" 66.249.79.141 - - [02/Jan/2017:06:06:02 +0000] "GET /apple-app-site-association HTTP/1.1" 404 921 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http:// www.google.com/bot.html)" 58.160.83.212 - - [02/Jan/2017:06:06:47 +0000] "GET /wp-login.php HTTP/1.1" 404 2200 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 58.160.83.212 - - [02/Jan/2017:06:06:48 +0000] "GET / HTTP/1.1" 200 5517 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 104.131.24.242 - - [02/Jan/2017:06:07:17 +0000] "GET / HTTP/1.1" 200 2468 "-" "Ruby" 109.123.101.103 - - [02/Jan/2017:06:10:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/ 1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 146.185.155.137 - - [02/Jan/2017:06:13:17 +0000] "GET / HTTP/1.1" 200 2470 "-" "Ruby" 89.163.242.206 - - [02/Jan/2017:06:15:14 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 66.249.85.16 - - [02/Jan/2017:06:15:33 +0000] "GET /api/1/search.json? token=514602663f61b0121e0001fc&query=La%20sirenita&callback=jQuery1910040832826641725495_1483337719027&_=1483337719028 HTTP/1.1" 200 77 "http:// www.hachemuda.com/2008/07/faceinhole-pon-tu-cara-en-el-cuerpo-de-cualquier-famoso/" "Mozilla/5.0 (Linux; Android 6.0; MotoG3 Build/MPI24.65-33.1-2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.85 Mobile Safari/537.36" 66.249.85.16 - - [02/Jan/2017:06:15:37 +0000] "GET /api/1/search.json? token=514602663f61b0121e0001fc&query=La%20sirenita&callback=jQuery1910040832826641725495_1483337719027&_=1483337719029 HTTP/1.1" 200 77 "http:// www.hachemuda.com/2008/07/faceinhole-pon-tu-cara-en-el-cuerpo-de-cualquier-famoso/" "Mozilla/5.0 (Linux; Android 6.0; MotoG3 Build/MPI24.65-33.1-2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.85 Mobile Safari/537.36" 66.249.85.16 - - [02/Jan/2017:06:15:45 +0000] "GET /api/1/search.json? token=514602663f61b0121e0001fc&query=La%20sirenita&callback=jQuery1910040832826641725495_1483337719031&_=1483337719032 HTTP/1.1" 200 77 "http:// www.hachemuda.com/2008/07/faceinhole-pon-tu-cara-en-el-cuerpo-de-cualquier-famoso/" "Mozilla/5.0 (Linux; Android 6.0; MotoG3 Build/MPI24.65-33.1-2) We’ll convert access logs

Slide 63

Slide 63 text

To visits per country

Slide 64

Slide 64 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Our system uses two Kafka topics from three Rake tasks.

Slide 65

Slide 65 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA The end goal is to update data in an ActiveRecord model.

Slide 66

Slide 66 text

code The ActiveRecord model PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA 1 class CountryStat < ApplicationRecord 2 def self.update_country_counts(country_counts) 3 country_counts.each do |country, count| 4 CountryStat 5 .find_or_create_by(country_code: country) 6 .increment!(:visit_count, count) 7 end 8 end 9 end

Slide 67

Slide 67 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA rake processor:import raw_page_views CountryStat rake processor:preprocess rake processor:aggregate page_views

Slide 68

Slide 68 text

Preprocessing avoids “hotspots”

Slide 69

Slide 69 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Step 1: Import logs

Slide 70

Slide 70 text

code Import PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA 2 task :import => :environment do 3 Dir.glob('log/access/*') do |file| 4 File.read(file).lines.each do |line| 5 puts line 6 $kafka.deliver_message( 7 line, 8 topic: 'raw_page_views' 9 ) 10 end 11 end 12 puts 'Imported all available logs in log/accesss' 13 end

Slide 71

Slide 71 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Step 2: Pre-process log lines

Slide 72

Slide 72 text

code Preprocess 1/3 PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA 15 task :preprocess => :environment do 16 puts 'Started processor' 17 18 log_line_regex = %r{^(\S+) - - \[(\S+ \+\d{4})\] "(\S+ \S+ [^"]+)" (\d{3}) (\d+|-) "(.*?)" "([^"]+)"$} 19 geoip = GeoIP.new('GeoLiteCity.dat') 20 21 consumer = $kafka.consumer(group_id: 'preprocesser') 22 consumer.subscribe('raw_page_views')

Slide 73

Slide 73 text

code Preprocess 2/3 PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA 23 consumer.each_message do |message| 24 # We've received a message, parse the log line 25 log_line = parse(log_line_regex, geo_ip, message) 26 35 # Convert it to an intermediary format 36 page_view = { 37 'time' => log_line[2], 38 'ip' => log_line[1], 39 'country' => city.country_name, 40 'browser' => user_agent.browser, 41 'url' => url 42 }

Slide 74

Slide 74 text

code Preprocess 3/3 PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA 47 # Write it to a topic 48 $kafka.deliver_message( 49 page_view.to_json, 50 topic: 'page_views', 51 partition_key: city.country_code2 # MAGIC HERE 52 ) 53 end 54 end

Slide 75

Slide 75 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Step 3: Aggregate visit counts

Slide 76

Slide 76 text

code Aggregate 1/2 PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA 56 task :aggregate => :environment do 57 consumer = $kafka.consumer(group_id: 'aggregator') 58 consumer.subscribe('page_views') 59 60 @count = 0 61 @country_counts = Hash.new(0) 62 @last_tick_time = Time.now.to_i 63 66 consumer.each_message do |message| 67 page_view = JSON.parse(message.value) 68 69 @count += 1 70 @country_counts[page_view['country']] += 1

Slide 77

Slide 77 text

code Aggregate 2/2 PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA 72 now_time = Time.now.to_i 74 if @last_tick_time + 5 < now_time 78 # Update stats in the database 79 CountryStat.update_country_counts(@country_counts) 80 81 # Clear aggregation 82 @count = 0 83 @country_counts.clear 84 85 @last_tick_time = now_time 86 end

Slide 78

Slide 78 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Display results

Slide 79

Slide 79 text

No content

Slide 80

Slide 80 text

code Controller PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA 1 class HomepageController < ApplicationController 2 def index 3 @country_stats = CountryStat.order('visit_count desc') 4 @total_visit_count = CountryStat.sum(:visit_count) 5 @max_visit_count = CountryStat.maximum(:visit_count) 6 end 7 end

Slide 81

Slide 81 text

code View PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA 4 5 <% @country_stats.each do |country_stat| %> 6 7 8 12 13 <% end %> 14 <%= country_stat.country_code %> 9 <%= country_stat.visit_count %> 10
11

Slide 82

Slide 82 text

PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Thijs Cadier Co-Founder AppSignal Email Twitter Github [email protected] @thijsc thijsc Use the coupon code railsconf and get $50 credit! Example code is available here:
 https://github.com/appsignal/kafka-talk-demo Thank you!