engineer should know about real- time data's unifying abstraction Benchmarking Apache Kafka Samza Documentation Questioning the Lamba Architecture Moving faster with data streams: The rise of Samza at LinkedIn Why local state is a fundamental primitive in stream processing Real time insights into LinkedIn's performance using Apache Samza
IP: 65.121.142.238 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 1095) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101 Safari/537.36 Web Page URL: Context Time: 2014-10-14T10:49:24.438-05:00 https://www.mycompany.com/page.html
65.121.142.238 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 1095) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101 Safari/537.36 Link URL: Referer: Context Time: 2014-10-14T10:49:24.438-05:00 https://www.mycompany.com/product.html https://www.othersite.com/foo.html
I N F O ] [ 2 0 1 4 - 1 0 - 1 4 1 1 : 2 5 : 4 4 , 7 5 0 ] [ s e n t r y - a k k a . a c t o r . d e f a u l t - d i s p a t c h e r - 2 ] a . e . s . S l f 4 j E v e n t H a n d l e r : S l f 4 j E v e n t H a n d l e r s t a r t e d Message: Slf4jEventHandler started Level: INFO Time: 2014-10-14 11:25:44,750 Thread: sentry-akka.actor.default-dispatcher-2 Logger: akka.event.slf4j.Slf4jEventHandler
streams High-throughput: millions events/sec High-volume: TBs - PBs of events Low-latency: single-digit msec from producer to consumer Scalable: topics are partitioned across cluster Durable: topics are replicated across cluster Available: auto failover
a m T a s k : c l a s s M y T a s k e x t e n d s S t r e a m T a s k { o v e r r i d e d e f p r o c e s s ( e n v e l o p e : I n c o m i n g M e s s a g e E n v e l o p e , c o l l e c t o r : M e s s a g e C o l l e c t o r , c o o r d i n a t o r : T a s k C o o r d i n a t o r ) : U n i t = { / / p r o c e s s m e s s a g e i n e n v e l o p e } } 2) my-task.properties config file j o b . f a c t o r y . c l a s s = o r g . a p a c h e . s a m z a . j o b . l o c a l . T h r e a d J o b F a c t o r y j o b . n a m e = m y - t a s k t a s k . c l a s s = c o m . b a n n o . M y T a s k . . .
event into that aggregation Output aggregated values as events to new stream What happens if job stops? Crash, deploy, ... Can't lose state! Samza handles this all for you S E L E C T C O U N T ( * ) F R O M s t a t u s e s ;
Output statuses by user (map) Count statuses per user (reduce) Output: (user, count) Could use as input to job that sorts by count (most active users) S E L E C T u s e r _ i d , C O U N T ( u s e r _ i d ) F R O M s t a t u s e s G R O U P B Y u s e r _ i d ; S E L E C T u s e r _ i d , C O U N T ( u s e r _ i d ) F R O M s t a t u s e s G R O U P B Y u s e r _ i d O R D E R B Y C O U N T ( u s e r _ i d ) D E S C L I M I T 5 ;
impressions + ad clicks Stream-Table join: page views + user zip code Table-Table join: user data + user settings Joins involving tables need DB changelog S E L E C T u . u s e r n a m e , s . t e x t F R O M s t a t u s e s s J O I N u s e r s u O N u . i d = s . u s e r _ i d ;
for-all-time) Enrich tweets with weather at current location Most active users, locations, etc Emojis: % of tweets that contain, top emojis Hashtags: % of tweets that contain, top #hashtags URLs: % of tweets that contain, top domains Photo URLs: % of tweets that contain, top domains Text analysis: sentiment, spam
engineer should know about real- time data's unifying abstraction Benchmarking Apache Kafka Samza Documentation Questioning the Lamba Architecture Moving faster with data streams: The rise of Samza at LinkedIn Why local state is a fundamental primitive in stream processing Real time insights into LinkedIn's performance using Apache Samza