Building data pipelines: from simple to more advanced - hands-on experience / CrunchConf - Oct 29, 2015

Building data pipelines 01 from simple to more advanced -
hands-on Sergii Khomenko, Data Scientist [email protected], @lc0d3r CrunchConf - October 29, 2015

Sergii Khomenko 2 Data scientist at one of the biggest
fashion communities, Stylight. Data analysis and visualisation hobbyist, working on problems not only in working time but in free time for fun and personal data visualisations. Speaker at Berlin Buzzwords 2014, ApacheCon Europe 2014, Puppet Camp London, Berlin Buzzwords 2015 , Tableau Conference on Tour, Budapest BI Forum 2015

Profitable Leads Stylight provides its partners with high- quality leads
enabling partner shops to leverage Stylight as a ROI positive traffic channel. Inspiration Stylight offers shoppable inspiration that makes it easy to know what to buy and how to style it. Branding & Reach Stylight offers a unique opportunity for brands to reach an audience that is actively looking for style online. Shopping Stylight helps users search and shop fashion and lifestyle products smarter across hundreds of shops. 3 Stylight – Make Style Happen Core Target Group Stylight help aspiring women between 18 and 35 to evolve their style through shoppable inspiration.

Stylight – acting on a global scale

Experienced & Ambitious Team Innovative cross- functional organisation with flat
hierarchy builds a   unique team spirit. • +200 employees • 40 PhDs/Engineers • 28 years average age • 63% female • 23 nationalities • 0 suits 5

Agenda 6 T h e G o o d ,
T h e B a d A n d T h e L e g a c y O p e n S o u r c e s t a c k A m a z o n A W S G o o g l e C l o u d T i p s , t r i c k s a n d b e s t p r a c t i c e s

7 I n c o m p u t i
n g , a p i p e l i n e i s a s e t o f d a t a p r o c e s s i n g e l e m e n t s c o n n e c t e d i n s e r i e s , w h e r e t h e o u t p u t o f o n e e l e m e n t i s t h e i n p u t o f t h e n e x t o n e .

The Good, The Bad And The Legacy 8

Sources of data: 9 • Web tracking • Metrics tracking
• Behaviour tracking • Business intelligence ETL • Internal Services • ML tagging service

Access patterns 10 • Real-time • Nearly real-time • Daily
batches

Properties 13 • Data consistency • Doesn’t scale • Hard
to add new sources • Complex system • Many interfaces • As lean and legacy as possible • No need for special services

15 Streaming

Open Source Stack 16

17 http://lambda-architecture.net/

18 A p a c h e K a f
k a i s p u b l i s h - s u b s c r i b e m e s s a g i n g r e t h o u g h t a s a d i s t r i b u t e d c o m m i t l o g .

21 http://www.ipponusa.com/wp-content/uploads/2014/10/spark-architecture.jpg

Results 22 • Scalable • Flexible • High costs of
maintenance • Not so easy to setup

23 A p r o g r a m m
i n g l a n g u a g e i s l o w l e v e l w h e n i t s p r o g r a m s r e q u i r e a t t e n t i o n t o t h e i r r e l e v a n t . Alan Jay Perlis / Epigrams on Programming

Amazon AWS 24

Kinesis Streams

29 business development & ﬁnance website events enrichment Business Intelligence

Kinesis Firehose Kinesis Analytics

34 custom uniﬁcation pipeline Product Processing Business Intelligence ML/Tagging Product
events variety of event types and structures

AWS Data Pipeline

Google Cloud 39

Tips, tricks and best practices 46

Cross-Functional Team 47 Department: mission oriented team with all resources
and the least dependencies Product Team: builds the software the department or its customers use Squad: team that executes the product development 47 Department Product Team Squad PO Engineer Engineer Designer Data Scientist Head of Business Role Business Role

Cross-Functional Team 49 • You build it - you run
it • You check your numbers (domain knowledge) • You provide your data as interface layer • Data report comes after data tracking 49 Department Product Team Squad PO Engineer Engineer Designer Data Scientist Head of Business Role Business Role

54 I t h i n k t h a
t i t ' s e x t r a o r d i n a r i l y i m p o r t a n t t h a t w e i n c o m p u t e r s c i e n c e k e e p f u n i n c o m p u t i n g . W h e n i t s t a r t e d o u t , i t w a s a n a w f u l l o t o f f u n . Alan Jay Perlis / The Structure and Interpretation of Computer Programs

www.stylight.com [email protected] @lc0d3r

Related talks 56 • Helping Data Teams with Puppet /
Puppet Camp London • Secure Data Scalability at Stylight with Tableau Online and Amazon Redshift / Tableau Conference on Tour - Berlin • Google Cloud Dataﬂow Two Worlds Become a Much Better One

Building data pipelines: from simple to more ad...

Building data pipelines: from simple to more advanced - hands-on experience / CrunchConf - Oct 29, 2015

More Decks by Sergii Khomenko

Other Decks in Programming

Featured

Transcript