Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science at honestbee - DSSG 2016-10-24

Dat Le
October 24, 2016

Data Science at honestbee - DSSG 2016-10-24

Dat Le

October 24, 2016
Tweet

More Decks by Dat Le

Other Decks in Technology

Transcript

  1. • What is honestbee? • Full-service online grocery + laundry

    delivery company • Singapore - Hong Kong - Taiwan - Japan • Malaysia - Philippines - Indonesia - Thailand • Wide range of supermarkets and boutique stores • Referral: GIVE $20 GET $10 honestbee https:/ /honestbee.sg/r/DATL8886 Let me know!
  2. Predictive models • Item availability predictions • Customer life-time value

    / customer profitability grading • Customer demand forecast & trending Recommendation engines • Item-based recommendations • CRM campaigns recommendations Clustering analysis, data mining • Customer Segmentation (profiling, 360 view, clustering) Operational optimizations • Task scheduling • Route optimization Data Science
  3. What? • Item not available at the store! • We

    don’t know until the bee is picking the item Why? • Customer happiness • Business profitability How? • Predictive Model (Binary Classification) • Communicate with our customers before they even make a purchase Item Availability Prediction
  4. Features • Date of delivery (date of week, time slot)

    • Product metadata (brand, name, category, price, discount) • Store metadata (store type, location) • External data (weather, public holiday, promotion periods, financial data: STI, inflation rate, un-employment rate) • Ground truth (Available vs Out of Stock) Item Availability Prediction
  5. Algorithm: XGBoost (https:/ /github.com/dmlc/xgboost) • Decision tree based Gradient Boosting

    Machine • Available in Python, R, and Julia • State-of-the-art, winning algorithm for lots of Kaggle’s data science challenges: • 1st @ Crowdflower Search Results Relevance • 1st @ Microsoft Malware Classification Challenge (BIG 2015) • 1st @ Tradeshift Text Classification • 1st @ Otto Group Product Classification Item Availability Prediction
  6. Evaluation metrics: AUC (Area Under The Curve) score • http:/

    /scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html • AUC vs ACC http:/ /datascience.stackexchange.com/questions/806/advantages-of- auc-vs-standard-accuracy • Not affected by highly-skewed dataset • AUC’s score range: • 0.5-0.6 (Fail) • 0.6-0.7 (Poor) • 0.7-0.8 (Fair) • 0.8 (Good) - 1.0 (Perfect) Item Availability Prediction
  7. What? • Recommendation Engine • People who bought Tortilla Chips

    also bought Coca Cola Zero Why? • Better User Experience • Increase cart size How? • Collaborative Filtering • Python Pandas + Jaccard Index Item-based Recommendation Engine
  8. Collaborative Filtering • traditional & popular technique used in recommendation

    systems • input: User - Item matrix • continuous values: User Rating (from 1* to 5*, 0% to 100%) • binary values: User Behavior (Purchases / Visits / Clicks) • 2 different methodologies: user-based and item-based recommendations Item-based Recommendation Engine https:/ /buildingrecommenders.wordpress.com/
  9. Collaborative Filtering • user-based: “users like you usually buy these”

    • works for social networks • works for “taste”-like recommendations (i.e movies, fashions, social networks) • output: User - User matrix • performance scales with number of users • user home page, emails, in-app notification Item-based Recommendation Engine
  10. Collaborative Filtering • item-based: “users who bought X also bought

    Y” • complementary purchases (e- commerce), news suggestions • output: Item - Item matrix • performance scales with number of items • product page, cart page recommendations Item-based Recommendation Engine
  11. Algorithm: Jaccard Index (https:/ /en.wikipedia.org/wiki/ Jaccard_index) • Set Theory •

    Ratio of intersection gives similarity score • Sensitive to sparse input Item-based Recommendation Engine J v1 1 ,v2 ( )= U1 ∩ U2 U1 ∪ U2 J=2/6
  12. Platform: Amazon Web Services with EC2, S3, RDS Postgres, and

    Redshift
 
 Application: Docker, Airflow
 
 Code Review, Test and Integration: Github + Travis CI
 
 Resource management: Apache Mesos, AWS Autoscaling
 
 Application & Discovery management: Apache Marathon
 
 Languages: Python, SQL
 Data Infrastructure