Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Analytics Infrastructure - DSSG 2015-11-23

Dat Le
November 23, 2015

Data Analytics Infrastructure - DSSG 2015-11-23

Data Science Singapore meetup 23 Nov 2015

Dat Le

November 23, 2015
Tweet

More Decks by Dat Le

Other Decks in Technology

Transcript

  1. Data Analytics Infrastructure Le Nguyen The Dat @lenguyenthedat Data Science

    SG Nov 2015 Meetup
  2. Backgrounds ZALORA Group (2013 – 2014) o  Biggest online fashion

    retails in South East Asia o  Data Infrastructure & Data Science
  3. Backgrounds Commercialize.TV (2015 – ) o  Multi Channel media network

    – focusing on China audiences o  Data Infrastructure & Insights
  4. Challenges No central data source: o  Data stored in multiple

    locations o  Unclear ownership Data definition and quality: o  Little to none documentation o  Different formula, rules owned by different departments o  Always dirty no matter what Reporting – Descriptive analytics: o  Immediate needs, automations o  Important to do it right (and quick!)
  5. Data Warehouse

  6. Database Technologies SQL – Relational Databases: o  MySQL, PostgreSQL o 

    MS SQL Server, Oracle SQL NoSQL: o  Redis o  Cassandra o  MongoDB o  DynamoDB (AWS) o  RethinkDB Map Reduce ecosystem: o  Hadoop: HDFS – Pig – Hive – Hbase o  Spark: RDD – Spork – Shark (Spark SQL) – Hbase-Spark Massively Parallel Processing (MPP): o  Vertica (HP) - Greenplum (EMC) – Netezza (IBM) o  ParAccel (Amazon Redshift)
  7. Database Technologies SQL – Relational Databases: ✔ o  MySQL, PostgreSQL

    o  MS SQL Server, Oracle SQL NoSQL: (?) o  Redis ✖ o  Cassandra ✔ o  MongoDB ✖ o  DynamoDB (AWS) ✖ o  Neo4j ✖ Map Reduce ecosystem: ✔ o  Hadoop: HDFS – Pig – Hive – Hbase o  Spark: RDD – Spork – Shark (Spark SQL) – Hbase-Spark Massively Parallel Processing (MPP): ✔ o  Vertica (HP) - Greenplum (EMC) – Netezza (IBM) o  ParAccel (Amazon Redshift)
  8. Data Warehouse Amazon Redshift aws.amazon.com/redshift o  Cloud-based, Fully managed o 

    SQL (PostgreSQL 8.0.2) o  On-demand ($2000/year) o  Scalable (Petabyte-scale) o  FAST! amplab.cs.berkeley.edu/benchmark/ All in ONE place! o  Product information o  Customer information o  Tracking data o  External data sources (Social Media, 3rd Party datasets)
  9. Extract-Transform-Load (ETL)

  10. ETL Amazon Redshift’s COPY command. docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

  11. ETL Custom made: o  Simple bash script, python, SQL o 

    Use cases: •  Scrappers •  Excel / CSV imports Data Pipeline Frameworks: o  Large scale, more complicated o  Examples: •  Spotify’s Luigi – github.com/spotify/luigi •  Yelp’s Mycroft – github.com/Yelp/mycroft 3rd Party Services: o  aws.amazon.com/redshift/partners •  Flydata •  Rjmetrics
  12. Applications

  13. Self-services Dashboards Intermediate users o  SQL / Excel / WYSIWYG

    Tools o  Re:dash – www.redash.io •  Open source: github.com/getredash/redash •  Try it out: demo.redash.io •  Self-manage & deployment: o  Docker o  Pre-baked AMI (Amazon Web Services) o  Google Cloud Images •  Supports lots of database types (Redshift, MySQL, PostgreSQL, Big Query, MongoDB…) •  Users need to know SQL •  Web-based, collaborative work type
  14. Demo: re:dash data sources usage http://demo.redash.io/queries/756

  15. Demo: NYC Taxis Tip Amounts http://demo.redash.io/queries/753

  16. Self-services Dashboards Intermediate users o  SQL / Excel / WYSIWYG

    Tools o  Tableau – www.tableau.com •  Licensed software (14 days trial) •  Tableau Public (Free: public.tableau.com) •  Self-host or Tableau host (fully managed) •  Supports a lot more database types •  Group, User management – customized access right •  Drag & Drop software as well as web-based
  17. Demo: Social media dashboard Baidu è Import.io API èData Warehouse

    è Tableau
  18. Demo: Market Research - video platform performance SimilarWeb API èData

    Warehouse è Tableau
  19. Others (Tableau Public): •  SEA Games Result history tiny.cc/seagames • 

    Rakuten – Viki data challenge tiny.cc/viki-viz
  20. Advanced applications Advanced users o  Data Warehouse connection (JDBC -

    PostgreSQL) o  Automated, highly customized reports. o  Data Science: •  Recommendation engine •  Predictive modeling •  Classifications
  21. Advanced applications Internal reporting tool Data Warehouse è SQL, Python

    (Django), JS è Product-Finder [ Black dress | SKU or ID | Tiffany| Atmosphere ] Sales Info Tracking Info Product Info
  22. Advanced applications Recommendation Engine Data Warehouse è SQL, Python, Haskell

    è ZALORA Website Similar products: Similar products: Similar products: Similar products:
  23. Conclusions

  24. Team & Technology stack o  Small team of 1-4 programmers

    o  Amazon Web Services •  No upfront cost •  Low maintenance •  Scalability •  Integrations o  Shell Scripts, Python, Haskell, D3.js o  Unix, Open-source technologies
  25. Takeaways o  (Good) data infrastructure is important: •  Build it

    first (before you hire a data scientist!) •  Build it right: stable – fast – scalable. o  There is no silver bullet: •  Understand what you need •  Always do more research o  Data infrastructure is NOT that hard! •  Utilize existing, modern technologies •  Avoid old, proprietary technology that were built for the 90s!
  26. References Engineering Blogs and Tutorials o  https://blog.asana.com/2014/11/stable-accessible-data-infrastructure-startup/ o  https://medium.com/@samson_hu/building-analytics-at-500px-92e9a7005c83 o 

    https://engineering.pinterest.com/blog/powering-interactive-data-analysis- redshift o  http://engineering.ifttt.com/data/2015/10/14/data-infrastructure/ o  https://blog.rjmetrics.com/2015/10/15/the-data-infrastructure-meta-analysis- how-top-engineering-organizations-built-their-big-data-stacks/ o  https://www.youtube.com/watch?v=reQtXquDpzo o  https://www.periscopedata.com/amazon-redshift-guide Benchmarks o  https://amplab.cs.berkeley.edu/benchmark/ o  https://www.flydata.com/blog/with-amazon-redshift-ssd-querying-a-tb-of-data- took-less-than-10-seconds/ o  https://www.flydata.com/blog/hive-and-redshift-a-brief-comparison/
  27. Thank you!