Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Event Stream Processing with Kafka and Samza

Zach Cox
November 01, 2014

Event Stream Processing with Kafka and Samza

Presented at Iowa Code Camp Fall 2014.

Zach Cox

November 01, 2014
Tweet

Other Decks in Technology

Transcript

  1. Why? Businesses generate and process events Unified event log promotes

    data integration Process event streams to take actions quickly
  2. References Kafka Samza Kafka Documentation The Log: What every software

    engineer should know about real- time data's unifying abstraction Benchmarking Apache Kafka Samza Documentation Questioning the Lamba Architecture Moving faster with data streams: The rise of Samza at LinkedIn Why local state is a fundamental primitive in stream processing Real time insights into LinkedIn's performance using Apache Samza
  3. Why? Businesses generate and process events Unified event log promotes

    data integration Process event streams to take actions quickly
  4. Event Describes what happened Who did it? What did they

    do? What was the result? Provides context When did it happen? Where did it happen? How did they do it? Why did they do it?
  5. Event Example: Pageview User viewed web page User ID: a2be9031-9465-4ecb-9302-9b962fa854ac

    IP: 65.121.142.238 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 1095) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101 Safari/537.36 Web Page URL: Context Time: 2014-10-14T10:49:24.438-05:00 https://www.mycompany.com/page.html
  6. Event Example: Clickthrough User clicked link User ID: a2be9031-9465-4ecb-9302-9b962fa854ac IP:

    65.121.142.238 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 1095) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101 Safari/537.36 Link URL: Referer: Context Time: 2014-10-14T10:49:24.438-05:00 https://www.mycompany.com/product.html https://www.othersite.com/foo.html
  7. Event Example: User Update User changed first name User ID:

    161fa4bf-6ae9-4f4e-b72e-01c40e7783e5 First name: Zach Context Time: 2014-10-14T10:59:56.481-05:00 IP: 65.121.142.238
  8. Event Example: User Update User uploaded a new profile image

    User ID: 161fa4bf-6ae9-4f4e-b72e-01c40e7783e5 Profile Image URL: Context Time: 2014-10-14T10:59:56.481-05:00 IP: 65.121.142.238 Using: webcam http://profile-images.s3.amazonaws.com/katy-perry.jpg
  9. Event Example: Tweet User posted a tweet User ID: Username:

    @zcox Name: Zach Cox Bio: Developer @BannoHQ | @iascala organizer | co-founded @Pongr Tweet ID: 527152511568719872 URL: URL: Text: Going to talk about processing event streams using @apachekafka and @samzastream this Saturday @iowacodecamp Mentions: @apachekafka, @samzastream, @iowacodecamp URLs: Context Time: 2014-10-14T10:59:56.481-05:00 Using: Twitter for Android Location: 41.7146365,-93.5914038 https://twitter.com/zcox/status/527152511568719872 http://iowacodecamp.com/session/list#66 http://iowacodecamp.com/session/list#66
  10. Event Example: HTTP Request Latency Some measured code took some

    time to execute Code production.my-app.some-server.http.get-user-profile Time to execute Min: 20 msec Max: 950 msec Average: 190 msec Median: 110 msec 50%: 100 msec 75%: 120 msec 95%: 150 msec 99%: 500 msec Context Time: 2014-10-14T11:17:01.597-05:00
  11. Event Example: Runtime Exception Some code threw a runtime exception

    Some code Stack trace: [...] Exception Message: HBase read timed out Context Time: 2014-10-14T11:21:23.749-05:00 Application: my-app Machine: some-server.my-company.com
  12. Event Example: Application Logging Some code logged some information [

    I N F O ] [ 2 0 1 4 - 1 0 - 1 4 1 1 : 2 5 : 4 4 , 7 5 0 ] [ s e n t r y - a k k a . a c t o r . d e f a u l t - d i s p a t c h e r - 2 ] a . e . s . S l f 4 j E v e n t H a n d l e r : S l f 4 j E v e n t H a n d l e r s t a r t e d Message: Slf4jEventHandler started Level: INFO Time: 2014-10-14 11:25:44,750 Thread: sentry-akka.actor.default-dispatcher-2 Logger: akka.event.slf4j.Slf4jEventHandler
  13. Why? Businesses generate and process events Unified event log promotes

    data integration Process event streams to take actions quickly
  14. Unified Log Events need to be sent somewhere Events should

    be accessible to any program Log provides a place for events to be sent and accessed Kafka is a great log service
  15. Log Sequence of records Append-only Ordered by time Each record

    assigned unique sequential number Records stored persistently on disk
  16. Log for Event Streams Simple to send events to Broadcasts

    events to all consumers Buffers events on disk: producers and consumers decoupled Consumers can start reading at any offset
  17. Kafka Apache OSS, mainly from LinkedIn Handles all the logs/event

    streams High-throughput: millions events/sec High-volume: TBs - PBs of events Low-latency: single-digit msec from producer to consumer Scalable: topics are partitioned across cluster Durable: topics are replicated across cluster Available: auto failover
  18. Twitter Example Receive messages via long-lived HTTP connection as JSON

    Write messages to a Kafka topic Twitter Streaming API
  19. Twitter Example Twitter rate-limits clients <1% sample, ~50-100 tweets/sec 400

    keywords, ? tweets/sec 1 weird trick to get more tweets: multiple clients, same Kafka topic!
  20. Why? Businesses generate and process events Unified event log promotes

    data integration Process event streams to take actions quickly
  21. Event Stream Processing Turn events into valuable, actionable information Process

    events as they happen, not later (batch) Do all of this reliably, at scale
  22. Samza Event stream processing framework Apache OSS, mainly from LinkedIn

    Simple Java API Scalable: runs jobs in parallel across cluster Reliable: fault-tolerance and durability built-in Tools for stateful stream processing
  23. Samza Job 1) Class that extends S t r e

    a m T a s k : c l a s s M y T a s k e x t e n d s S t r e a m T a s k { o v e r r i d e d e f p r o c e s s ( e n v e l o p e : I n c o m i n g M e s s a g e E n v e l o p e , c o l l e c t o r : M e s s a g e C o l l e c t o r , c o o r d i n a t o r : T a s k C o o r d i n a t o r ) : U n i t = { / / p r o c e s s m e s s a g e i n e n v e l o p e } } 2) my-task.properties config file j o b . f a c t o r y . c l a s s = o r g . a p a c h e . s a m z a . j o b . l o c a l . T h r e a d J o b F a c t o r y j o b . n a m e = m y - t a s k t a s k . c l a s s = c o m . b a n n o . M y T a s k . . .
  24. Stateless Processing One event at a time Take action using

    only that event S E L E C T * F R O M r a w _ m e s s a g e s W H E R E m e s s a g e _ t y p e = ' s t a t u s ' ;
  25. Samza Job: Separate Message Types Many message types from Twitter

    Samza job to separate into type-specific streams Other jobs process specific message types
  26. Stateful Stream Processing One event at a time Take action

    using that event and state State = data built up from past events Aggregation Grouping Joins
  27. Aggregation State = aggregated values (e.g. count) Incorporate each new

    event into that aggregation Output aggregated values as events to new stream What happens if job stops? Crash, deploy, ... Can't lose state! Samza handles this all for you S E L E C T C O U N T ( * ) F R O M s t a t u s e s ;
  28. Samza Job: Total Status Count Increment a counter on every

    status (tweet) Periodically output current count
  29. Grouping State = some data per group Two Samza jobs:

    Output statuses by user (map) Count statuses per user (reduce) Output: (user, count) Could use as input to job that sorts by count (most active users) S E L E C T u s e r _ i d , C O U N T ( u s e r _ i d ) F R O M s t a t u s e s G R O U P B Y u s e r _ i d ; S E L E C T u s e r _ i d , C O U N T ( u s e r _ i d ) F R O M s t a t u s e s G R O U P B Y u s e r _ i d O R D E R B Y C O U N T ( u s e r _ i d ) D E S C L I M I T 5 ;
  30. Joins Samza job has multiple input streams Stream-Stream join: ad

    impressions + ad clicks Stream-Table join: page views + user zip code Table-Table join: user data + user settings Joins involving tables need DB changelog S E L E C T u . u s e r n a m e , s . t e x t F R O M s t a t u s e s s J O I N u s e r s u O N u . i d = s . u s e r _ i d ;
  31. What else can we compute? Tweets per sec/min/hour (recent, not

    for-all-time) Enrich tweets with weather at current location Most active users, locations, etc Emojis: % of tweets that contain, top emojis Hashtags: % of tweets that contain, top #hashtags URLs: % of tweets that contain, top domains Photo URLs: % of tweets that contain, top domains Text analysis: sentiment, spam
  32. Druid Send it events Druid reads from Kafka topic That

    Kafka topic is a Samza output stream Super fast time-series queries: aggregations, filters, top-n, etc http://druid.io
  33. Why? Businesses generate and process events Unified event log promotes

    data integration Process event streams to take actions quickly
  34. References Kafka Samza Kafka Documentation The Log: What every software

    engineer should know about real- time data's unifying abstraction Benchmarking Apache Kafka Samza Documentation Questioning the Lamba Architecture Moving faster with data streams: The rise of Samza at LinkedIn Why local state is a fundamental primitive in stream processing Real time insights into LinkedIn's performance using Apache Samza