Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[2017.12 Meetup] [TALK] Processing Streaming Da...

DevOps Lisbon
December 11, 2017

[2017.12 Meetup] [TALK] Processing Streaming Data at a Large Scale with Kafka

In this talk we'll see how we can leverage Kafka to build and painlessly scale an analytics pipeline. We'll talk about Kafka's unique properties that make this possible, and we'll go through a full demo application step by step. At the end of the talk you'll have a good idea of when and how to get started with Kafka yourself.

Thijs Cadier is a co-founder of AppSignal from Amsterdam, The Netherlands. Besides programming for many years now he also runs operations for https://appsignal.com (https://appsignal.com/), a monitoring platform that monitors over 4 billion requests per month.

DevOps Lisbon

December 11, 2017
Tweet

More Decks by DevOps Lisbon

Other Decks in Technology

Transcript

  1. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Processing

    streaming data at a large scale with Kafka
  2. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA We

    monitor errors and performance for Ruby and Elixir apps.
  3. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA This

    is a talk about what you can achieve with Kafka
  4. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA A

    monitoring product: How hard can it be?
  5. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA It

    turns out that you have to process a lot of streaming data.
  6. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA •

    Generated continuously • Coming from multiple data sources • Sent simultaneously • Sent in small sizes Streaming data is:
  7. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA •

    Database locking • Load balancing • Manipulating datasets • Routing 
 Getting similar data to same server Problems…
  8. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA So

    let’s try an “easy” streaming data challenge.
  9. 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 146.185.155.137 - - [02/Jan/2017:05:43:19 +0000] "GET

    / HTTP/1.1" 200 2469 "-" "Ruby" 191.7.178.2 - - [02/Jan/2017:05:45:01 +0000] "GET /wp-login.php HTTP/1.1" 404 2200 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 191.7.178.2 - - [02/Jan/2017:05:45:02 +0000] "GET / HTTP/1.1" 200 5517 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 76.72.172.208 - - [02/Jan/2017:05:45:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 128.199.222.25 - - [02/Jan/2017:05:49:23 +0000] "GET / HTTP/1.1" 200 2469 "-" "Ruby" 204.79.180.91 - - [02/Jan/2017:05:49:45 +0000] "GET /api/1/search.json? token=5485c77b14932d75a8000000&query=false&callback=jQuery16104168975954468962_1483336183061&_=1483336183277 HTTP/1.1" 200 73 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; Trident/5.0)" 50.23.94.74 - - [02/Jan/2017:05:50:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 64.237.49.203 - - [02/Jan/2017:05:55:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 178.62.38.143 - - [02/Jan/2017:05:55:20 +0000] "GET / HTTP/1.1" 200 2472 "-" "Ruby" 178.255.154.2 - - [02/Jan/2017:06:00:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 180.76.15.145 - - [02/Jan/2017:06:00:52 +0000] "GET /assets/modernizr-335dfd57319af4815822b1b264e6c88a333a133317e5ffb077da6f1d9fbee380.js HTTP/1.1" 200 3899 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 173.255.192.178 - - [02/Jan/2017:06:01:17 +0000] "GET / HTTP/1.1" 200 2470 "-" "Ruby" 184.75.210.90 - - [02/Jan/2017:06:05:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 66.249.79.145 - - [02/Jan/2017:06:06:01 +0000] "GET /robots.txt HTTP/1.1" 200 204 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/ bot.html)" 66.249.79.141 - - [02/Jan/2017:06:06:02 +0000] "GET /apple-app-site-association HTTP/1.1" 404 921 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http:// www.google.com/bot.html)" 58.160.83.212 - - [02/Jan/2017:06:06:47 +0000] "GET /wp-login.php HTTP/1.1" 404 2200 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 58.160.83.212 - - [02/Jan/2017:06:06:48 +0000] "GET / HTTP/1.1" 200 5517 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 104.131.24.242 - - [02/Jan/2017:06:07:17 +0000] "GET / HTTP/1.1" 200 2468 "-" "Ruby" 109.123.101.103 - - [02/Jan/2017:06:10:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/ 1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 146.185.155.137 - - [02/Jan/2017:06:13:17 +0000] "GET / HTTP/1.1" 200 2470 "-" "Ruby" 89.163.242.206 - - [02/Jan/2017:06:15:14 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 66.249.85.16 - - [02/Jan/2017:06:15:33 +0000] "GET /api/1/search.json? token=514602663f61b0121e0001fc&query=La%20sirenita&callback=jQuery1910040832826641725495_1483337719027&_=1483337719028 HTTP/1.1" 200 77 "http:// www.hachemuda.com/2008/07/faceinhole-pon-tu-cara-en-el-cuerpo-de-cualquier-famoso/" "Mozilla/5.0 (Linux; Android 6.0; MotoG3 Build/MPI24.65-33.1-2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.85 Mobile Safari/537.36" 66.249.85.16 - - [02/Jan/2017:06:15:37 +0000] "GET /api/1/search.json? token=514602663f61b0121e0001fc&query=La%20sirenita&callback=jQuery1910040832826641725495_1483337719027&_=1483337719029 HTTP/1.1" 200 77 "http:// www.hachemuda.com/2008/07/faceinhole-pon-tu-cara-en-el-cuerpo-de-cualquier-famoso/" "Mozilla/5.0 (Linux; Android 6.0; MotoG3 Build/MPI24.65-33.1-2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.85 Mobile Safari/537.36" 66.249.85.16 - - [02/Jan/2017:06:15:45 +0000] "GET /api/1/search.json? token=514602663f61b0121e0001fc&query=La%20sirenita&callback=jQuery1910040832826641725495_1483337719031&_=1483337719032 HTTP/1.1" 200 77 "http:// www.hachemuda.com/2008/07/faceinhole-pon-tu-cara-en-el-cuerpo-de-cualquier-famoso/" "Mozilla/5.0 (Linux; Android 6.0; MotoG3 Build/MPI24.65-33.1-2) Loads of lines in a log file,
 streaming to us.
  10. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE

    Simple approach: Just update the database! NL US RU DE
  11. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA UPDATE

    countries SET count +=1 WHERE code=“US”; For every log line run this query:
  12. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE

    NL US RU DE DATABASE LOCKING Uh-oh! Locking!
  13. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE

    1 NL US RU DE DATABASE 2 DATABASE 3 Just shard the data!
  14. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE

    1 NL US RU DE DATABASE 2 DATABASE 3 What if I want to run a query over multiple databases? Querying data…
  15. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE

    1 DATABASE 2 DATABASE 3 What if we want to change the sharding?
  16. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA At

    AppSignal, we do much more to our data than just a simple increment.
  17. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE

    1 DATABASE 2 DATABASE 3 WORKER 1 WORKER 2 WORKER 3 Sharding our customers. CUSTOMER 1 CUSTOMER 2 CUSTOMER 3
  18. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Not

    so reliable.. DATABASE 1 DATABASE 2 DATABASE 3 WORKER 2 WORKER 3 CUSTOMER 2 CUSTOMER 3 WORKER = SINGLE POINT OF FAILURE :( ☠ ☠ ☠ CUSTOMER 1
  19. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE

    1 CUSTOMER 1 CUSTOMER 2 CUSTOMER 3 DATABASE 2 DATABASE 2 WORKER 1 WORKER 2 WORKER 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 TIRED OF DRAWING LINES, USE YOUR IMAGINATION! Load balance with random distribution.
  20. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE

    1 DATABASE 2 DATABASE 3 Data is fragmented over different workers Not ideal, no full set of data per customer. WORKER 1 WORKER 2 WORKER 3 CUSTOMER 1 CUSTOMER 2 CUSTOMER 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3
  21. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA We

    get all data 
 for one customer 
 in the same worker Our life would be easier if…
  22. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA UPDATE

    countries SET count +=1 WHERE code=“US”; UPDATE countries SET count +=1 WHERE code=“US”; UPDATE countries SET count +=1 WHERE code=“US”; UPDATE countries SET count +=1 WHERE code=“US”; UPDATE countries SET count +=1 WHERE code=“US”; UPDATE countries SET count +=1 WHERE code=“US”; We’d go from loads of queries UPDATE countries SET count +=6 WHERE code=“US”; To a lot less
  23. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA We

    want to do much more complicated stuff than just +N-ing a record. Percentiles You have to have a complete dataset to do percentiles right. Histograms Show full distribution of all values in the data set. Smarter data collection Compare 50 errors and only display the differences!
  24. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE

    1 CUSTOMER 1 CUSTOMER 2 CUSTOMER 3 DATABASE 2 DATABASE 3 WORKER 1 WORKER 2 WORKER 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 So… Customer goes to same worker?
  25. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA CUSTOMER

    1 CUSTOMER 2 CUSTOMER 3 DATABASE WORKER 1 WORKER 2 WORKER 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 Less locking, no more sharding?
  26. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA CUSTOMER

    1 CUSTOMER 2 CUSTOMER 3 WORKER 2 WORKER 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 ☠ ☠ ☠ Back where we started: Single point of failure. DATABASE
  27. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA We

    need something else… CUSTOMER 1 CUSTOMER 2 CUSTOMER 3 DATABASE HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3
  28. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Kafka

    makes it possible to… • Load balance • Scale your infrastructure • Route your data
 Getting similar data to same server
  29. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA The

    difficult thing about understanding Kafka…
  30. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Has

    a name (eg. access-logs) and you can write data to it. Conceptually the same as a log. Topic Each partition within a topic has a part of the data. You choose which data goes together (eg. per country). Partition Kafka server that actually stores the topics and partitions. Broker Kafka client that reads from a topic and delivers data for processing by your code. Consumer
  31. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Has

    a name (eg. access-logs) and you can write data to it. Conceptually the same as a log. Topic Each partition within a topic has a part of the data. You choose which data goes together (eg. per country). Partition Kafka server that actually stores the topics and partitions. Broker Kafka client that reads from a topic and delivers data for processing by your code. Consumer
  32. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Has

    a name (eg. access-logs) and you can write data to it. Conceptually the same as a log. Topic Each partition within a topic has a part of the data. You choose which data goes together (eg. per country). Partition Kafka server that actually stores the topics and partitions. Broker Kafka client that reads from a topic and delivers data for processing by your code. Consumer
  33. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Has

    a name (eg. access-logs) and you can write data to it. Conceptually the same as a log. Topic Each partition within a topic has a part of the data. You choose which data goes together (eg. per country). Partition Kafka server that actually stores the topics and partitions. Broker Kafka client that reads from a topic and delivers data for processing by your code. Consumer
  34. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA A

    topic has partitions which contain messages. Old New Partition 0 0 1 2 3 4 5 6 7 8 9 Partition 1 0 1 2 3 4 5 6 7 8 9 Partition 2 0 1 2 3 4 5 6 7 8 9
  35. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA A

    topic has partitions which contain messages. "# ! ' Partition 0 0 1 2 3 4 5 6 7 8 9 Partition 1 0 1 2 3 4 5 6 7 8 9 Partition 2 0 1 2 3 4 5 6 7 8 9 You decide how to partition, eg. by country.
  36. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Remember,

    a broker is a Kafka server. The topic’s partitions live on these servers. Each broker is primary for some partitions. Each broker is secondary for some other partitions Broker
  37. Primary for partition 1 to 3 Secondary for partition 4

    to 6 Broker 1 Primary for partition 7 to 9 Secondary for partition 1 to 3 Broker 3 Primary for partition 4 to 6 Secondary for partition 7 to 9 Broker 2 A Kafka cluster with three brokers.
  38. Primary for partition 1 to 5 Secondary for partition 6

    to 9 Broker 1 Broker 3 died. Kafka redistributes the partitions ☠ ☠ ☠ Broker 3 RIP Primary for partition 6 to 9 Secondary for partition 1 to 5 Broker 2
  39. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Scaling

    up from 3 to 6 servers? Kafka will automatically redistribute the primary and secondary partitions!
  40. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Kafka

    client that reads from a topic and delivers data for processing by your code. Consumer
  41. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Consumers

    that have the same group name each get assigned a segment of the available partitions. Consumers live in groups
  42. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA A

    consumer in a group has a position in the partition (called offset) independent of consumers in other groups. A consumer group has it’s own offset
  43. Example: Two consumer groups Consumer Slack At offset 0 Consumer

    Email At offset 0 "# Partition 0 0 1 2 3 4 5 6 7 8 9
  44. At offset 0 At offset 6 "# Partition 0 0

    1 2 3 4 5 6 7 8 9 Slack is down. 
 Email delivery is fine. Consumer Slack Consumer Email
  45. At offset 4 At offset 8 "# Partition 0 0

    1 2 3 4 5 6 7 8 9 Slack comes back up. Consumer Slack Consumer Email
  46. Both consumer groups are at the latest offset! At offset

    9 At offset 9 Partition 0 0 1 2 3 4 5 6 7 8 9 "# Consumer Slack Consumer Email
  47. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA A

    consumer will “commit” it’s offsets every 30 seconds by default, you can also do this manually if needed. Consumer commit
  48. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA How

    about consuming multiple partitions?
  49. Consumes partition 4 to 6 Consumer 2 Consumer group Consumes

    partition 1 to 3 Consumer 1 Consumes partition 6 to 9 Consumer 3 Topic with 9 partitions
  50. Consumes partition 1 to 4 Consumer 2 Consumer group Consumes

    partition 5 to 9 Consumer 3 Topic with 9 partitions ☠ ☠ ☠ Consumer 1 RIP
  51. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Remember

    this? CUSTOMER CUSTOMER 2 CUSTOMER 3 WORKER 2 WORKER 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 ☠ ☠ ☠ DATABASE
  52. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA CUSTOMER

    Kafka fixes this for us. CUSTOMER 2 CUSTOMER 3 WORKER 2 WORKER 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 ☠ ☠ ☠ DATABASE KAFKA CLUSTER — DOES ITS MAGIC TO REDISTRIBUTE!
  53. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Lets

    build this very simple visitor analytics system.
  54. 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 146.185.155.137 - - [02/Jan/2017:05:43:19 +0000] "GET

    / HTTP/1.1" 200 2469 "-" "Ruby" 191.7.178.2 - - [02/Jan/2017:05:45:01 +0000] "GET /wp-login.php HTTP/1.1" 404 2200 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 191.7.178.2 - - [02/Jan/2017:05:45:02 +0000] "GET / HTTP/1.1" 200 5517 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 76.72.172.208 - - [02/Jan/2017:05:45:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 128.199.222.25 - - [02/Jan/2017:05:49:23 +0000] "GET / HTTP/1.1" 200 2469 "-" "Ruby" 204.79.180.91 - - [02/Jan/2017:05:49:45 +0000] "GET /api/1/search.json? token=5485c77b14932d75a8000000&query=false&callback=jQuery16104168975954468962_1483336183061&_=1483336183277 HTTP/1.1" 200 73 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; Trident/5.0)" 50.23.94.74 - - [02/Jan/2017:05:50:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 64.237.49.203 - - [02/Jan/2017:05:55:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 178.62.38.143 - - [02/Jan/2017:05:55:20 +0000] "GET / HTTP/1.1" 200 2472 "-" "Ruby" 178.255.154.2 - - [02/Jan/2017:06:00:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 180.76.15.145 - - [02/Jan/2017:06:00:52 +0000] "GET /assets/modernizr-335dfd57319af4815822b1b264e6c88a333a133317e5ffb077da6f1d9fbee380.js HTTP/1.1" 200 3899 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 173.255.192.178 - - [02/Jan/2017:06:01:17 +0000] "GET / HTTP/1.1" 200 2470 "-" "Ruby" 184.75.210.90 - - [02/Jan/2017:06:05:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 66.249.79.145 - - [02/Jan/2017:06:06:01 +0000] "GET /robots.txt HTTP/1.1" 200 204 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/ bot.html)" 66.249.79.141 - - [02/Jan/2017:06:06:02 +0000] "GET /apple-app-site-association HTTP/1.1" 404 921 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http:// www.google.com/bot.html)" 58.160.83.212 - - [02/Jan/2017:06:06:47 +0000] "GET /wp-login.php HTTP/1.1" 404 2200 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 58.160.83.212 - - [02/Jan/2017:06:06:48 +0000] "GET / HTTP/1.1" 200 5517 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 104.131.24.242 - - [02/Jan/2017:06:07:17 +0000] "GET / HTTP/1.1" 200 2468 "-" "Ruby" 109.123.101.103 - - [02/Jan/2017:06:10:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/ 1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 146.185.155.137 - - [02/Jan/2017:06:13:17 +0000] "GET / HTTP/1.1" 200 2470 "-" "Ruby" 89.163.242.206 - - [02/Jan/2017:06:15:14 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 66.249.85.16 - - [02/Jan/2017:06:15:33 +0000] "GET /api/1/search.json? token=514602663f61b0121e0001fc&query=La%20sirenita&callback=jQuery1910040832826641725495_1483337719027&_=1483337719028 HTTP/1.1" 200 77 "http:// www.hachemuda.com/2008/07/faceinhole-pon-tu-cara-en-el-cuerpo-de-cualquier-famoso/" "Mozilla/5.0 (Linux; Android 6.0; MotoG3 Build/MPI24.65-33.1-2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.85 Mobile Safari/537.36" 66.249.85.16 - - [02/Jan/2017:06:15:37 +0000] "GET /api/1/search.json? token=514602663f61b0121e0001fc&query=La%20sirenita&callback=jQuery1910040832826641725495_1483337719027&_=1483337719029 HTTP/1.1" 200 77 "http:// www.hachemuda.com/2008/07/faceinhole-pon-tu-cara-en-el-cuerpo-de-cualquier-famoso/" "Mozilla/5.0 (Linux; Android 6.0; MotoG3 Build/MPI24.65-33.1-2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.85 Mobile Safari/537.36" 66.249.85.16 - - [02/Jan/2017:06:15:45 +0000] "GET /api/1/search.json? token=514602663f61b0121e0001fc&query=La%20sirenita&callback=jQuery1910040832826641725495_1483337719031&_=1483337719032 HTTP/1.1" 200 77 "http:// www.hachemuda.com/2008/07/faceinhole-pon-tu-cara-en-el-cuerpo-de-cualquier-famoso/" "Mozilla/5.0 (Linux; Android 6.0; MotoG3 Build/MPI24.65-33.1-2) We’ll convert access logs
  55. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Our

    system uses two Kafka topics from three processes.
  56. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA The

    end goal is to update data in a relational database
  57. code The database model PROCESSING STREAMING DATA AT A LARGE

    SCALE WITH KAFKA 1 class CountryStat < ApplicationRecord 2 def self.update_country_counts(country_counts) 3 country_counts.each do |country, count| 4 CountryStat 5 .find_or_create_by(country_code: country) 6 .increment!(:visit_count, count) 7 end 8 end 9 end
  58. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA rake

    processor:import raw_page_views CountryStat rake processor:preprocess rake processor:aggregate page_views
  59. code Import log lines PROCESSING STREAMING DATA AT A LARGE

    SCALE WITH KAFKA 2 task :import => :environment do 3 Dir.glob('log/access/*') do |file| 4 File.read(file).lines.each do |line| 5 puts line 6 producer.produce( 7 payload: line, 8 topic: ‘raw-page-views’ 9 ) 10 end 11 end 12 puts 'Imported all available logs in log/accesss' 13 end
  60. code Preprocess 1/4 PROCESSING STREAMING DATA AT A LARGE SCALE

    WITH KAFKA 15 task :preprocess => :environment do 16 puts 'Started processor' 17 18 log_line_regex = %r{^(\S+) - - \[(\S+ \+\d{4})\] "(\S+ \S+ [^"]+)" (\d{3}) (\d+|-) "(.*?)" "([^"]+)"$} 19 geoip = GeoIP.new('GeoLiteCity.dat') 20 21 consumer = Rdkafka::Config.new.consumer 22 consumer.subscribe(‘raw-page-views’) 23 24 delivery_handles = []
  61. code Preprocess 2/4 PROCESSING STREAMING DATA AT A LARGE SCALE

    WITH KAFKA 23 consumer.each do |message| 24 # We've received a message, parse the log line 25 log_line = parse(log_line_regex, geo_ip, message) 26 35 # Convert it to an intermediary format 36 page_view = { 37 'time' => log_line[2], 38 'ip' => log_line[1], 39 'country' => city.country_name, 40 'browser' => user_agent.browser, 41 'url' => url 42 }
  62. code Preprocess 3/4 PROCESSING STREAMING DATA AT A LARGE SCALE

    WITH KAFKA 47 # Write it to a topic 48 handle = producer.produce( 49 payload: page_view.to_json, 50 topic: 'page_views', 51 key: city.country_code2 # MAGIC HERE 52 ) 52 delivery_handles.push(handle) 54 end 55 end
  63. code PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA

    84 now_time = Time.now.to_i 85 # Run this code every 5 seconds 86 if @last_tick_time + 5 < now_time 89 # Wait for delivery of all messages 90 delivery_handles.each(&:wait) 91 delivery_handles.clear 92 93 # Commit consumer position 94 consumer.commit 95 99 @last_tick_time = now_time 100 end Preprocess 4/4
  64. code Aggregate 1/2 PROCESSING STREAMING DATA AT A LARGE SCALE

    WITH KAFKA 56 task :aggregate => :environment do 57 consumer = Rdkafka::Config.new.consumer 58 consumer.subscribe(‘page-views') 59 60 @count = 0 61 @country_counts = Hash.new(0) 62 @last_tick_time = Time.now.to_i 63 66 consumer.each do |message| 67 page_view = JSON.parse(message.payload) 68 69 @count += 1 70 @country_counts[page_view['country']] += 1
  65. code Aggregate 2/2 PROCESSING STREAMING DATA AT A LARGE SCALE

    WITH KAFKA 72 now_time = Time.now.to_i 74 if @last_tick_time + 5 < now_time 78 # Update stats in the database 79 CountryStat.update_country_counts(@country_counts) 80 81 consumer.commit 82 83 # Clear aggregation 84 @count = 0 85 @country_counts.clear 86 87 @last_tick_time = now_time 88 end
  66. code Controller PROCESSING STREAMING DATA AT A LARGE SCALE WITH

    KAFKA 1 class HomepageController < ApplicationController 2 def index 3 @country_stats = CountryStat.order('visit_count desc') 4 @total_visit_count = CountryStat.sum(:visit_count) 5 @max_visit_count = CountryStat.maximum(:visit_count) 6 end 7 end
  67. code View PROCESSING STREAMING DATA AT A LARGE SCALE WITH

    KAFKA 4 <table class="stats"> 5 <% @country_stats.each do |country_stat| %> 6 <tr> 7 <th><%= country_stat.country_code %> </th> 8 <td> 9 <%= country_stat.visit_count %> 10 <div style="width: <%= width %>%"></div> 11 </td> 12 </tr> 13 <% end %> 14 </table>
  68. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Thijs

    Cadier Co-Founder AppSignal Email Twitter Github [email protected] @thijsc thijsc Use the coupon code lisboa and get $50 credit! Example code is available here:
 https://github.com/appsignal/kafka-talk-demo Thank you!