Data Science at honestbee - DSSG 2016-10-24

827edc42d80fceca858a1603738385b4?s=47 Dat Le
October 24, 2016

Data Science at honestbee - DSSG 2016-10-24

827edc42d80fceca858a1603738385b4?s=128

Dat Le

October 24, 2016
Tweet

Transcript

  1. Dat Le @lenguyenthedat data science at honestbee 24th Oct 2016

    - Data Science SG
  2. honestbee

  3. • What is honestbee? • Full-service online grocery + laundry

    delivery company • Singapore - Hong Kong - Taiwan - Japan • Malaysia - Philippines - Indonesia - Thailand • Wide range of supermarkets and boutique stores • Referral: GIVE $20 GET $10 honestbee https:/ /honestbee.sg/r/DATL8886 Let me know!
  4. Data Science

  5. Predictive models • Item availability predictions • Customer life-time value

    / customer profitability grading • Customer demand forecast & trending Recommendation engines • Item-based recommendations • CRM campaigns recommendations Clustering analysis, data mining • Customer Segmentation (profiling, 360 view, clustering) Operational optimizations • Task scheduling • Route optimization Data Science
  6. Item Availability Prediction

  7. What? • Item not available at the store! • We

    don’t know until the bee is picking the item Why? • Customer happiness • Business profitability How? • Predictive Model (Binary Classification) • Communicate with our customers before they even make a purchase Item Availability Prediction
  8. Features • Date of delivery (date of week, time slot)

    • Product metadata (brand, name, category, price, discount) • Store metadata (store type, location) • External data (weather, public holiday, promotion periods, financial data: STI, inflation rate, un-employment rate) • Ground truth (Available vs Out of Stock) Item Availability Prediction
  9. Algorithm: XGBoost (https:/ /github.com/dmlc/xgboost) • Decision tree based Gradient Boosting

    Machine • Available in Python, R, and Julia • State-of-the-art, winning algorithm for lots of Kaggle’s data science challenges: • 1st @ Crowdflower Search Results Relevance • 1st @ Microsoft Malware Classification Challenge (BIG 2015) • 1st @ Tradeshift Text Classification • 1st @ Otto Group Product Classification Item Availability Prediction
  10. Evaluation metrics: AUC (Area Under The Curve) score • http:/

    /scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html • AUC vs ACC http:/ /datascience.stackexchange.com/questions/806/advantages-of- auc-vs-standard-accuracy • Not affected by highly-skewed dataset • AUC’s score range: • 0.5-0.6 (Fail) • 0.6-0.7 (Poor) • 0.7-0.8 (Fair) • 0.8 (Good) - 1.0 (Perfect) Item Availability Prediction
  11. Item Availability Prediction Buy me! On Production: Likely Out of

    Stock!
  12. Item-based Recommendation Engine

  13. What? • Recommendation Engine • People who bought Tortilla Chips

    also bought Coca Cola Zero Why? • Better User Experience • Increase cart size How? • Collaborative Filtering • Python Pandas + Jaccard Index Item-based Recommendation Engine
  14. Collaborative Filtering • traditional & popular technique used in recommendation

    systems • input: User - Item matrix • continuous values: User Rating (from 1* to 5*, 0% to 100%) • binary values: User Behavior (Purchases / Visits / Clicks) • 2 different methodologies: user-based and item-based recommendations Item-based Recommendation Engine https:/ /buildingrecommenders.wordpress.com/
  15. Collaborative Filtering • user-based: “users like you usually buy these”

    • works for social networks • works for “taste”-like recommendations (i.e movies, fashions, social networks) • output: User - User matrix • performance scales with number of users • user home page, emails, in-app notification Item-based Recommendation Engine
  16. Collaborative Filtering • item-based: “users who bought X also bought

    Y” • complementary purchases (e- commerce), news suggestions • output: Item - Item matrix • performance scales with number of items • product page, cart page recommendations Item-based Recommendation Engine
  17. Algorithm: Jaccard Index (https:/ /en.wikipedia.org/wiki/ Jaccard_index) • Set Theory •

    Ratio of intersection gives similarity score • Sensitive to sparse input Item-based Recommendation Engine J v1 1 ,v2 ( )= U1 ∩ U2 U1 ∪ U2 J=2/6
  18. Pandas: http:/ /pandas.pydata.org/ • Python • Data Analysis toolkit Item-based

    Recommendation Engine
  19. Item-based Recommendation Engine On Production (soon!): Cooking ingredients!

  20. Item-based Recommendation Engine On Production (soon!): Baby products!

  21. Item-based Recommendation Engine On Production (soon!): BBQ-style Parties!

  22. Data Infrastructure

  23. Data Infrastructure

  24. Data Infrastructure Auto Integration & Deployment https:/ /mesosphere.com/blog/2015/04/02/continuous- deployment-with-mesos-marathon-docker/

  25. Platform: Amazon Web Services with EC2, S3, RDS Postgres, and

    Redshift
 
 Application: Docker, Airflow
 
 Code Review, Test and Integration: Github + Travis CI
 
 Resource management: Apache Mesos, AWS Autoscaling
 
 Application & Discovery management: Apache Marathon
 
 Languages: Python, SQL
 Data Infrastructure
  26. the end