Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Processing streaming data at a large scale with...

Processing streaming data at a large scale with Kafka

Using a standard Rails stack is great, but when you want to process streams of data at a large scale you'll hit the stack's limitations. What if you want to build an analytics system on a global scale and want to stay within the Ruby world you know and love?

In this talk we'll see how we can leverage Kafka to build and painlessly scale an analytics pipeline. We'll talk about Kafka's unique properties that make this possible, and we'll go through a full demo application step by step. At the end of the talk you'll have a good idea of when and how to get started with Kafka yourself.

Demo code is available at https://github.com/appsignal/kafka-talk-demo

Thijs Cadier

April 27, 2017
Tweet

More Decks by Thijs Cadier

Other Decks in Programming

Transcript

  1. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Processing

    streaming data at a large scale with Kafka
  2. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA My

    name is pronounced like this: Ice-T
  3. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA My

    name is pronounced like this: T-Ice
  4. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA We

    monitor errors and performance for Ruby and Elixir apps.
  5. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA It

    turns out that you have to process a lot of streaming data.
  6. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA •

    Generated continuously • Coming from multiple data sources • Sent simultaneously • Sent in small sizes Streaming data is:
  7. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA •

    Database locking • Load balancing • Manipulating datasets • Routing 
 Getting similar data to same server Problems…
  8. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA So

    let’s try a “simple” streaming data challenge.
  9. 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 146.185.155.137 - - [02/Jan/2017:05:43:19 +0000] "GET

    / HTTP/1.1" 200 2469 "-" "Ruby" 191.7.178.2 - - [02/Jan/2017:05:45:01 +0000] "GET /wp-login.php HTTP/1.1" 404 2200 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 191.7.178.2 - - [02/Jan/2017:05:45:02 +0000] "GET / HTTP/1.1" 200 5517 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 76.72.172.208 - - [02/Jan/2017:05:45:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 128.199.222.25 - - [02/Jan/2017:05:49:23 +0000] "GET / HTTP/1.1" 200 2469 "-" "Ruby" 204.79.180.91 - - [02/Jan/2017:05:49:45 +0000] "GET /api/1/search.json? token=5485c77b14932d75a8000000&query=false&callback=jQuery16104168975954468962_1483336183061&_=1483336183277 HTTP/1.1" 200 73 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; Trident/5.0)" 50.23.94.74 - - [02/Jan/2017:05:50:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 64.237.49.203 - - [02/Jan/2017:05:55:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 178.62.38.143 - - [02/Jan/2017:05:55:20 +0000] "GET / HTTP/1.1" 200 2472 "-" "Ruby" 178.255.154.2 - - [02/Jan/2017:06:00:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 180.76.15.145 - - [02/Jan/2017:06:00:52 +0000] "GET /assets/modernizr-335dfd57319af4815822b1b264e6c88a333a133317e5ffb077da6f1d9fbee380.js HTTP/1.1" 200 3899 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 173.255.192.178 - - [02/Jan/2017:06:01:17 +0000] "GET / HTTP/1.1" 200 2470 "-" "Ruby" 184.75.210.90 - - [02/Jan/2017:06:05:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 66.249.79.145 - - [02/Jan/2017:06:06:01 +0000] "GET /robots.txt HTTP/1.1" 200 204 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/ bot.html)" 66.249.79.141 - - [02/Jan/2017:06:06:02 +0000] "GET /apple-app-site-association HTTP/1.1" 404 921 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http:// www.google.com/bot.html)" 58.160.83.212 - - [02/Jan/2017:06:06:47 +0000] "GET /wp-login.php HTTP/1.1" 404 2200 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 58.160.83.212 - - [02/Jan/2017:06:06:48 +0000] "GET / HTTP/1.1" 200 5517 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 104.131.24.242 - - [02/Jan/2017:06:07:17 +0000] "GET / HTTP/1.1" 200 2468 "-" "Ruby" 109.123.101.103 - - [02/Jan/2017:06:10:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/ 1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 146.185.155.137 - - [02/Jan/2017:06:13:17 +0000] "GET / HTTP/1.1" 200 2470 "-" "Ruby" 89.163.242.206 - - [02/Jan/2017:06:15:14 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 66.249.85.16 - - [02/Jan/2017:06:15:33 +0000] "GET /api/1/search.json? token=514602663f61b0121e0001fc&query=La%20sirenita&callback=jQuery1910040832826641725495_1483337719027&_=1483337719028 HTTP/1.1" 200 77 "http:// www.hachemuda.com/2008/07/faceinhole-pon-tu-cara-en-el-cuerpo-de-cualquier-famoso/" "Mozilla/5.0 (Linux; Android 6.0; MotoG3 Build/MPI24.65-33.1-2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.85 Mobile Safari/537.36" 66.249.85.16 - - [02/Jan/2017:06:15:37 +0000] "GET /api/1/search.json? token=514602663f61b0121e0001fc&query=La%20sirenita&callback=jQuery1910040832826641725495_1483337719027&_=1483337719029 HTTP/1.1" 200 77 "http:// www.hachemuda.com/2008/07/faceinhole-pon-tu-cara-en-el-cuerpo-de-cualquier-famoso/" "Mozilla/5.0 (Linux; Android 6.0; MotoG3 Build/MPI24.65-33.1-2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.85 Mobile Safari/537.36" 66.249.85.16 - - [02/Jan/2017:06:15:45 +0000] "GET /api/1/search.json? token=514602663f61b0121e0001fc&query=La%20sirenita&callback=jQuery1910040832826641725495_1483337719031&_=1483337719032 HTTP/1.1" 200 77 "http:// www.hachemuda.com/2008/07/faceinhole-pon-tu-cara-en-el-cuerpo-de-cualquier-famoso/" "Mozilla/5.0 (Linux; Android 6.0; MotoG3 Build/MPI24.65-33.1-2) Loads of lines in a log file,
 streaming to us.
  10. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE

    Simple approach: Just update the database! NL US RU DE
  11. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA UPDATE

    countries SET count +=1 WHERE code=“US”; For every log line run this query:
  12. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE

    NL US RU DE DATABASE LOCKING Uh-oh! Locking!
  13. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE

    1 NL US RU DE DATABASE 2 DATABASE 3 Just shard the data!
  14. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE

    1 NL US RU DE DATABASE 2 DATABASE 3 What if I want to run a query over multiple databases? Querying data…
  15. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE

    1 DATABASE 2 DATABASE 3 What if we want to change the sharding?
  16. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA At

    AppSignal, we do much more to our data than just a simple increment.
  17. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE

    1 DATABASE 2 DATABASE 3 WORKER 1 WORKER 2 WORKER 3 Sharding our customers. CUSTOMER 1 CUSTOMER 2 CUSTOMER 3
  18. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Not

    so reliable.. DATABASE 1 DATABASE 2 DATABASE 3 WORKER 2 WORKER 3 CUSTOMER 2 CUSTOMER 3 WORKER = SINGLE POINT OF FAILURE :( ☠ ☠ ☠ CUSTOMER 1
  19. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE

    1 CUSTOMER 1 CUSTOMER 2 CUSTOMER 3 DATABASE 2 DATABASE 2 WORKER 1 WORKER 2 WORKER 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 TIRED OF DRAWING LINES, USE YOUR IMAGINATION! Load balance with random distribution.
  20. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE

    1 DATABASE 2 DATABASE 3 Data is fragmented over different workers Not ideal, hard to work with data. WORKER 1 WORKER 2 WORKER 3 CUSTOMER 1 CUSTOMER 2 CUSTOMER 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3
  21. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA We

    get all data 
 for one customer 
 in the same worker Our life would be easier if…
  22. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA UPDATE

    countries SET count +=1 WHERE code=“US”; UPDATE countries SET count +=1 WHERE code=“US”; UPDATE countries SET count +=1 WHERE code=“US”; UPDATE countries SET count +=1 WHERE code=“US”; UPDATE countries SET count +=1 WHERE code=“US”; UPDATE countries SET count +=1 WHERE code=“US”; We’d go from loads of queries UPDATE countries SET count +=6 WHERE code=“US”; To a lot less
  23. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA We

    want to do much more complicated stuff than just +N-ing a record. Percentiles You have to have a complete dataset to do percentiles right. Histograms Show full distribution of all values in the data set. Smarter data collection Compare 50 errors and only display the differences!
  24. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA DATABASE

    1 CUSTOMER 1 CUSTOMER 2 CUSTOMER 3 DATABASE 2 DATABASE 3 WORKER 1 WORKER 2 WORKER 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 So… Customer goes to same worker?
  25. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA CUSTOMER

    1 CUSTOMER 2 CUSTOMER 3 DATABASE WORKER 1 WORKER 2 WORKER 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 Less locking, no more sharding?
  26. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA CUSTOMER

    1 CUSTOMER 2 CUSTOMER 3 WORKER 2 WORKER 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 ☠ ☠ ☠ Back where we started: Single point of failure. DATABASE
  27. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA We

    need something else… CUSTOMER 1 CUSTOMER 2 CUSTOMER 3 DATABASE HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3
  28. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Kafka

    makes it possible to… • Load balance • Scale your infrastructure • Route your data
 Getting similar data to same server
  29. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA The

    difficult thing about understanding Kafka…
  30. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Has

    a name (eg. access_logs) and you can write data to it. Conceptually the same as a log. Topic Each partition within a topic has a part of the data. You choose which data goes together (eg. per country). Partition Kafka server that actually stores the topics and partitions. Broker Kafka client that reads from a topic and delivers data for processing by your code. Consumer
  31. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Has

    a name (eg. access_logs) and you can write data to it. Conceptually the same as a log. Topic Each partition within a topic has a part of the data. You choose which data goes together (eg. per country). Partition Kafka server that actually stores the topics and partitions. Broker Kafka client that reads from a topic and delivers data for processing by your code. Consumer
  32. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Has

    a name (eg. access_logs) and you can write data to it. Conceptually the same as a log. Topic Each partition within a topic has a part of the data. You choose which data goes together (eg. per country). Partition Kafka server that actually stores the topics and partitions. Broker Kafka client that reads from a topic and delivers data for processing by your code. Consumer
  33. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Has

    a name (eg. access_logs) and you can write data to it. Conceptually the same as a log. Topic Each partition within a topic has a part of the data. You choose which data goes together (eg. per country). Partition Kafka server that actually stores the topics and partitions. Broker Kafka client that reads from a topic and delivers data for processing by your code. Consumer
  34. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA A

    topic has partitions which contain messages. Old New Partition 0 0 1 2 3 4 5 6 7 8 9 Partition 1 0 1 2 3 4 5 6 7 8 9 Partition 2 0 1 2 3 4 5 6 7 8 9
  35. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA A

    topic has partitions which contain messages. "# ! $ Partition 0 0 1 2 3 4 5 6 7 8 9 Partition 1 0 1 2 3 4 5 6 7 8 9 Partition 2 0 1 2 3 4 5 6 7 8 9 You decide how to partition, eg. by country.
  36. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Remember,

    a broker is a Kafka server. The topic’s partitions live on these servers. Each broker is primary for some partitions. Each broker is secondary for some other partitions Broker
  37. Primary for partition 1 to 3 Secondary for partition 4

    to 6 Broker 1 Primary for partition 7 to 9 Secondary for partition 1 to 3 Broker 3 Primary for partition 4 to 6 Secondary for partition 7 to 9 Broker 2 A Kafka cluster with three brokers.
  38. Primary for partition 1 to 5 Secondary for partition 6

    to 9 Broker 1 Broker 3 died. Kafka redistributes the partitions ☠ ☠ ☠ Broker 3 RIP Primary for partition 6 to 9 Secondary for partition 1 to 5 Broker 2
  39. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Scaling

    up from 3 to 6 servers? Kafka will automatically redistribute the primary and secondary partitions!
  40. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Kafka

    client that reads from a topic and delivers data for processing by your code. Consumer
  41. You can have multiple consumers. Consumer Slack At offset 0

    Consumer Email At offset 0 "# Partition 0 0 1 2 3 4 5 6 7 8 9
  42. At offset 0 At offset 6 "# Partition 0 0

    1 2 3 4 5 6 7 8 9 Slack is down. 
 Email delivery is fine. Consumer Slack Consumer Email
  43. At offset 4 At offset 8 "# Partition 0 0

    1 2 3 4 5 6 7 8 9 Slack comes back up. Consumer Slack Consumer Email
  44. Both consumers are at the latest offset! At offset 9

    At offset 9 Partition 0 0 1 2 3 4 5 6 7 8 9 "# Consumer Slack Consumer Email
  45. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA How

    about consuming multiple partitions?
  46. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Consumers

    that have the same group name all get assigned a segment of the available partitions. Consumers can be in a group.
  47. Consumes partition 4 to 6 Consumer 2 Consumer group Consumes

    partition 1 to 3 Consumer 1 Consumes partition 6 to 9 Consumer 3 Topic with 9 partitions
  48. Consumes partition 1 to 4 Consumer 2 Consumer group Consumes

    partition 5 to 9 Consumer 3 Topic with 9 partitions ☠ ☠ ☠ Consumer 1 RIP
  49. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Remember

    this? CUSTOMER CUSTOMER 2 CUSTOMER 3 WORKER 2 WORKER 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 ☠ ☠ ☠ DATABASE
  50. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA CUSTOMER

    Kafka fixes this for us. CUSTOMER 2 CUSTOMER 3 WORKER 2 WORKER 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 HOST 1 HOST 2 HOST 3 ☠ ☠ ☠ DATABASE KAFKA CLUSTER — DOES ITS MAGIC TO REDISTRIBUTE!
  51. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Lets

    build this very simple visitor analytics system.
  52. 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 146.185.155.137 - - [02/Jan/2017:05:43:19 +0000] "GET

    / HTTP/1.1" 200 2469 "-" "Ruby" 191.7.178.2 - - [02/Jan/2017:05:45:01 +0000] "GET /wp-login.php HTTP/1.1" 404 2200 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 191.7.178.2 - - [02/Jan/2017:05:45:02 +0000] "GET / HTTP/1.1" 200 5517 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 76.72.172.208 - - [02/Jan/2017:05:45:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 128.199.222.25 - - [02/Jan/2017:05:49:23 +0000] "GET / HTTP/1.1" 200 2469 "-" "Ruby" 204.79.180.91 - - [02/Jan/2017:05:49:45 +0000] "GET /api/1/search.json? token=5485c77b14932d75a8000000&query=false&callback=jQuery16104168975954468962_1483336183061&_=1483336183277 HTTP/1.1" 200 73 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; Trident/5.0)" 50.23.94.74 - - [02/Jan/2017:05:50:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 64.237.49.203 - - [02/Jan/2017:05:55:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 178.62.38.143 - - [02/Jan/2017:05:55:20 +0000] "GET / HTTP/1.1" 200 2472 "-" "Ruby" 178.255.154.2 - - [02/Jan/2017:06:00:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 180.76.15.145 - - [02/Jan/2017:06:00:52 +0000] "GET /assets/modernizr-335dfd57319af4815822b1b264e6c88a333a133317e5ffb077da6f1d9fbee380.js HTTP/1.1" 200 3899 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 173.255.192.178 - - [02/Jan/2017:06:01:17 +0000] "GET / HTTP/1.1" 200 2470 "-" "Ruby" 184.75.210.90 - - [02/Jan/2017:06:05:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 66.249.79.145 - - [02/Jan/2017:06:06:01 +0000] "GET /robots.txt HTTP/1.1" 200 204 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/ bot.html)" 66.249.79.141 - - [02/Jan/2017:06:06:02 +0000] "GET /apple-app-site-association HTTP/1.1" 404 921 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http:// www.google.com/bot.html)" 58.160.83.212 - - [02/Jan/2017:06:06:47 +0000] "GET /wp-login.php HTTP/1.1" 404 2200 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 58.160.83.212 - - [02/Jan/2017:06:06:48 +0000] "GET / HTTP/1.1" 200 5517 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 104.131.24.242 - - [02/Jan/2017:06:07:17 +0000] "GET / HTTP/1.1" 200 2468 "-" "Ruby" 109.123.101.103 - - [02/Jan/2017:06:10:13 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/ 1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 146.185.155.137 - - [02/Jan/2017:06:13:17 +0000] "GET / HTTP/1.1" 200 2470 "-" "Ruby" 89.163.242.206 - - [02/Jan/2017:06:15:14 +0000] "GET /api/1/search.json?token=4dbfc79e3f61b05b53000011&query=ruby&callback=jsonp1316430637976 HTTP/1.1" 200 53 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 66.249.85.16 - - [02/Jan/2017:06:15:33 +0000] "GET /api/1/search.json? token=514602663f61b0121e0001fc&query=La%20sirenita&callback=jQuery1910040832826641725495_1483337719027&_=1483337719028 HTTP/1.1" 200 77 "http:// www.hachemuda.com/2008/07/faceinhole-pon-tu-cara-en-el-cuerpo-de-cualquier-famoso/" "Mozilla/5.0 (Linux; Android 6.0; MotoG3 Build/MPI24.65-33.1-2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.85 Mobile Safari/537.36" 66.249.85.16 - - [02/Jan/2017:06:15:37 +0000] "GET /api/1/search.json? token=514602663f61b0121e0001fc&query=La%20sirenita&callback=jQuery1910040832826641725495_1483337719027&_=1483337719029 HTTP/1.1" 200 77 "http:// www.hachemuda.com/2008/07/faceinhole-pon-tu-cara-en-el-cuerpo-de-cualquier-famoso/" "Mozilla/5.0 (Linux; Android 6.0; MotoG3 Build/MPI24.65-33.1-2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.85 Mobile Safari/537.36" 66.249.85.16 - - [02/Jan/2017:06:15:45 +0000] "GET /api/1/search.json? token=514602663f61b0121e0001fc&query=La%20sirenita&callback=jQuery1910040832826641725495_1483337719031&_=1483337719032 HTTP/1.1" 200 77 "http:// www.hachemuda.com/2008/07/faceinhole-pon-tu-cara-en-el-cuerpo-de-cualquier-famoso/" "Mozilla/5.0 (Linux; Android 6.0; MotoG3 Build/MPI24.65-33.1-2) We’ll convert access logs
  53. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Our

    system uses two Kafka topics from three Rake tasks.
  54. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA The

    end goal is to update data in an ActiveRecord model.
  55. code The ActiveRecord model PROCESSING STREAMING DATA AT A LARGE

    SCALE WITH KAFKA 1 class CountryStat < ApplicationRecord 2 def self.update_country_counts(country_counts) 3 country_counts.each do |country, count| 4 CountryStat 5 .find_or_create_by(country_code: country) 6 .increment!(:visit_count, count) 7 end 8 end 9 end
  56. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA rake

    processor:import raw_page_views CountryStat rake processor:preprocess rake processor:aggregate page_views
  57. code Import PROCESSING STREAMING DATA AT A LARGE SCALE WITH

    KAFKA 2 task :import => :environment do 3 Dir.glob('log/access/*') do |file| 4 File.read(file).lines.each do |line| 5 puts line 6 $kafka.deliver_message( 7 line, 8 topic: 'raw_page_views' 9 ) 10 end 11 end 12 puts 'Imported all available logs in log/accesss' 13 end
  58. code Preprocess 1/3 PROCESSING STREAMING DATA AT A LARGE SCALE

    WITH KAFKA 15 task :preprocess => :environment do 16 puts 'Started processor' 17 18 log_line_regex = %r{^(\S+) - - \[(\S+ \+\d{4})\] "(\S+ \S+ [^"]+)" (\d{3}) (\d+|-) "(.*?)" "([^"]+)"$} 19 geoip = GeoIP.new('GeoLiteCity.dat') 20 21 consumer = $kafka.consumer(group_id: 'preprocesser') 22 consumer.subscribe('raw_page_views')
  59. code Preprocess 2/3 PROCESSING STREAMING DATA AT A LARGE SCALE

    WITH KAFKA 23 consumer.each_message do |message| 24 # We've received a message, parse the log line 25 log_line = parse(log_line_regex, geo_ip, message) 26 35 # Convert it to an intermediary format 36 page_view = { 37 'time' => log_line[2], 38 'ip' => log_line[1], 39 'country' => city.country_name, 40 'browser' => user_agent.browser, 41 'url' => url 42 }
  60. code Preprocess 3/3 PROCESSING STREAMING DATA AT A LARGE SCALE

    WITH KAFKA 47 # Write it to a topic 48 $kafka.deliver_message( 49 page_view.to_json, 50 topic: 'page_views', 51 partition_key: city.country_code2 # MAGIC HERE 52 ) 53 end 54 end
  61. code Aggregate 1/2 PROCESSING STREAMING DATA AT A LARGE SCALE

    WITH KAFKA 56 task :aggregate => :environment do 57 consumer = $kafka.consumer(group_id: 'aggregator') 58 consumer.subscribe('page_views') 59 60 @count = 0 61 @country_counts = Hash.new(0) 62 @last_tick_time = Time.now.to_i 63 66 consumer.each_message do |message| 67 page_view = JSON.parse(message.value) 68 69 @count += 1 70 @country_counts[page_view['country']] += 1
  62. code Aggregate 2/2 PROCESSING STREAMING DATA AT A LARGE SCALE

    WITH KAFKA 72 now_time = Time.now.to_i 74 if @last_tick_time + 5 < now_time 78 # Update stats in the database 79 CountryStat.update_country_counts(@country_counts) 80 81 # Clear aggregation 82 @count = 0 83 @country_counts.clear 84 85 @last_tick_time = now_time 86 end
  63. code Controller PROCESSING STREAMING DATA AT A LARGE SCALE WITH

    KAFKA 1 class HomepageController < ApplicationController 2 def index 3 @country_stats = CountryStat.order('visit_count desc') 4 @total_visit_count = CountryStat.sum(:visit_count) 5 @max_visit_count = CountryStat.maximum(:visit_count) 6 end 7 end
  64. code View PROCESSING STREAMING DATA AT A LARGE SCALE WITH

    KAFKA 4 <table class="stats"> 5 <% @country_stats.each do |country_stat| %> 6 <tr> 7 <th><%= country_stat.country_code %> </th> 8 <td> 9 <%= country_stat.visit_count %> 10 <div style="width: <%= width %>%"></div> 11 </td> 12 </tr> 13 <% end %> 14 </table>
  65. PROCESSING STREAMING DATA AT A LARGE SCALE WITH KAFKA Thijs

    Cadier Co-Founder AppSignal Email Twitter Github [email protected] @thijsc thijsc Use the coupon code railsconf and get $50 credit! Example code is available here:
 https://github.com/appsignal/kafka-talk-demo Thank you!