Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[Sergii Khomenko] Building data pipelines: from...

[Sergii Khomenko] Building data pipelines: from simple to more advanced - hands-on experience

Presentation from GDG DevFest Ukraine 2016.
Learn more at: https://devfest.gdg.org.ua

Google Developers Group Lviv

September 10, 2016
Tweet

More Decks by Google Developers Group Lviv

Other Decks in Technology

Transcript

  1. Building data pipelines 01 from simple to more advanced -

    hands-on experience Sergii Khomenko, Lead Data Scientist [email protected], @lc0d3r GDG DevFest Ukraine 2016 - September 10, 2016
  2. Sergii Khomenko 2 Lead Data Scientist at one of the

    biggest fashion communities, Stylight Data analysis and visualisation hobbyist, working on problems not only in working time but in free time for fun and personal data visualisations. Originally from computer engineering background. GDG Munich Gophers Founder. Speaker at Berlin Buzzwords 2014, ApacheCon Europe 2014, Puppet Camp London 2015, Berlin Buzzwords 2015 , Tableau Conference on Tour 2015, Budapest BI Forum 2015, Crunchsconf 2015, FOSDEM 2016, PyData Amsterdam 2016, Munich Applied R User Group, etc
  3. Profitable Leads Stylight provides its partners with high- quality leads

    enabling partner shops to leverage Stylight as a ROI positive traffic channel. Inspiration Stylight offers shoppable inspiration that makes it easy to know what to buy and how to style it. Branding & Reach Stylight offers a unique opportunity for brands to reach an audience that is actively looking for style online. Shopping Stylight helps users search and shop fashion and lifestyle products smarter across hundreds of shops. 3 Stylight – Make Style Happen Core Target Group Stylight help aspiring women between 18 and 35 to evolve their style through shoppable inspiration.
  4. Experienced & Ambitious Team Innovative cross- functional organisation with flat

    hierarchy builds a 
 unique team spirit. • ~200 employees • 40 Engineers/PhDs • 28 years average age • 63% female • 23 nationalities • 0 suits 5
  5. Agenda 6 T h e G o o d ,

    T h e B a d A n d T h e L e g a c y O p e n S o u r c e S t a c k A m a z o n A W S G o o g l e C l o u d T i p s , t r i c k s a n d b e s t p r a c t i c e s
  6. 7 I n c o m p u t i

    n g , a p i p e l i n e i s a s e t o f d a t a p r o c e s s i n g e l e m e n t s c o n n e c t e d i n s e r i e s , w h e r e t h e o u t p u t o f o n e e l e m e n t i s t h e i n p u t o f t h e n e x t o n e .
  7. Sources of data: 9 • Web tracking • Metrics tracking

    • Behaviour tracking • Business intelligence ETL • Internal Services • ML tagging service
  8. 11

  9. 12

  10. Properties 13 • Data consistency • Doesn’t scale • Hard

    to add new sources • Complex system • Many interfaces • As lean and legacy as possible • No need for special services
  11. 14

  12. 18 A p a c h e K a f

    k a i s p u b l i s h - s u b s c r i b e m e s s a g i n g r e t h o u g h t a s a d i s t r i b u t e d c o m m i t l o g .
  13. 19

  14. 20

  15. 21

  16. Results 23 • Scalable • Flexible • High costs of

    maintenance • Not so easy to setup
  17. 24 A p r o g r a m m

    i n g l a n g u a g e i s l o w l e v e l w h e n i t s p r o g r a m s r e q u i r e a t t e n t i o n t o t h e i r r e l e v a n t . Alan Jay Perlis / Epigrams on Programming
  18. 28

  19. 29

  20. 34

  21. 35

  22. 37

  23. 38

  24. 40

  25. 42

  26. 44

  27. 45

  28. 46

  29. 47

  30. 51

  31. 52

  32. Cross-Functional Team 53 Department: mission oriented team with all resources

    and the least dependencies Product Team: builds the software the department or its customers use Squad: team that executes the product development 53 Department Product Team Squad PO Engineer Engineer Designer Data Scientist Head of Business Role Business Role
  33. 54

  34. Cross-Functional Team 55 • You build it - you run

    it • You check your numbers (domain knowledge) • You provide your data as interface layer • Data report comes after data tracking 55 Department Product Team Squad PO Engineer Engineer Designer Data Scientist Head of Business Role Business Role
  35. 56

  36. 57

  37. 58

  38. 59

  39. 61 I t h i n k t h a

    t i t ' s e x t r a o r d i n a r i l y i m p o r t a n t t h a t w e i n c o m p u t e r s c i e n c e k e e p f u n i n c o m p u t i n g . W h e n i t s t a r t e d o u t , i t w a s a n a w f u l l o t o f f u n . Alan Jay Perlis / The Structure and Interpretation of Computer Programs
  40. Related talks 62 • Helping Data Teams with Puppet /

    Puppet Camp London • Secure Data Scalability at Stylight with Tableau Online and Amazon Redshift / Tableau Conference on Tour - Berlin • Scaling up Business Intelligence from the scratch and to 15 countries worldwide • Handle your Lambdas: From event-based processing to Continuous Integration
  41. Related talks 63 • R Use-Cases @ Stylight - deploy,

    scale, enjoy! - R edition • Event data pipelines - Some emerging best practices in event data processing