Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science at honestbee - DSSG 2016-10-24

Dat Le
October 24, 2016

Data Science at honestbee - DSSG 2016-10-24

Dat Le

October 24, 2016
Tweet

More Decks by Dat Le

Other Decks in Technology

Transcript

  1. Dat Le @lenguyenthedat
    data science at honestbee
    24th Oct 2016 - Data Science SG

    View Slide

  2. honestbee

    View Slide

  3. • What is honestbee?
    • Full-service online grocery + laundry delivery company
    • Singapore - Hong Kong - Taiwan - Japan
    • Malaysia - Philippines - Indonesia - Thailand
    • Wide range of supermarkets and boutique stores
    • Referral: GIVE $20 GET $10
    honestbee
    https:/
    /honestbee.sg/r/DATL8886
    Let me know!

    View Slide

  4. Data Science

    View Slide

  5. Predictive models
    • Item availability predictions
    • Customer life-time value / customer profitability grading
    • Customer demand forecast & trending
    Recommendation engines
    • Item-based recommendations
    • CRM campaigns recommendations
    Clustering analysis, data mining
    • Customer Segmentation (profiling, 360 view, clustering)
    Operational optimizations
    • Task scheduling
    • Route optimization
    Data Science

    View Slide

  6. Item Availability
    Prediction

    View Slide

  7. What?
    • Item not available at the store!
    • We don’t know until the bee is picking the item
    Why?
    • Customer happiness
    • Business profitability
    How?
    • Predictive Model (Binary Classification)
    • Communicate with our customers before they even make a
    purchase
    Item Availability
    Prediction

    View Slide

  8. Features
    • Date of delivery (date of week, time slot)
    • Product metadata (brand, name, category, price, discount)
    • Store metadata (store type, location)
    • External data (weather, public holiday, promotion periods, financial data:
    STI, inflation rate, un-employment rate)
    • Ground truth (Available vs Out of Stock)
    Item Availability
    Prediction

    View Slide

  9. Algorithm: XGBoost (https:/
    /github.com/dmlc/xgboost)
    • Decision tree based Gradient Boosting Machine
    • Available in Python, R, and Julia
    • State-of-the-art, winning algorithm for lots of Kaggle’s data
    science challenges:
    • 1st @ Crowdflower Search Results Relevance
    • 1st @ Microsoft Malware Classification Challenge (BIG 2015)
    • 1st @ Tradeshift Text Classification
    • 1st @ Otto Group Product Classification
    Item Availability
    Prediction

    View Slide

  10. Evaluation metrics: AUC (Area Under The Curve) score
    • http:/
    /scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html
    • AUC vs ACC http:/
    /datascience.stackexchange.com/questions/806/advantages-of-
    auc-vs-standard-accuracy
    • Not affected by highly-skewed dataset
    • AUC’s score range:
    • 0.5-0.6 (Fail)
    • 0.6-0.7 (Poor)
    • 0.7-0.8 (Fair)
    • 0.8 (Good) - 1.0 (Perfect)
    Item Availability
    Prediction

    View Slide

  11. Item Availability
    Prediction
    Buy me!
    On Production:
    Likely Out of Stock!

    View Slide

  12. Item-based
    Recommendation
    Engine

    View Slide

  13. What?
    • Recommendation Engine
    • People who bought Tortilla Chips also bought Coca Cola Zero
    Why?
    • Better User Experience
    • Increase cart size
    How?
    • Collaborative Filtering
    • Python Pandas + Jaccard Index
    Item-based
    Recommendation
    Engine

    View Slide

  14. Collaborative Filtering
    • traditional & popular technique used
    in recommendation systems
    • input: User - Item matrix
    • continuous values: User Rating
    (from 1* to 5*, 0% to 100%)
    • binary values: User Behavior
    (Purchases / Visits / Clicks)
    • 2 different methodologies: user-based
    and item-based recommendations
    Item-based
    Recommendation
    Engine
    https:/
    /buildingrecommenders.wordpress.com/

    View Slide

  15. Collaborative Filtering
    • user-based: “users like you usually buy
    these”
    • works for social networks
    • works for “taste”-like
    recommendations (i.e movies,
    fashions, social networks)
    • output: User - User matrix
    • performance scales with number of
    users
    • user home page, emails, in-app
    notification
    Item-based
    Recommendation
    Engine

    View Slide

  16. Collaborative Filtering
    • item-based: “users who bought X
    also bought Y”
    • complementary purchases (e-
    commerce), news suggestions
    • output: Item - Item matrix
    • performance scales with number
    of items
    • product page, cart page
    recommendations
    Item-based
    Recommendation
    Engine

    View Slide

  17. Algorithm: Jaccard Index (https:/
    /en.wikipedia.org/wiki/
    Jaccard_index)
    • Set Theory
    • Ratio of intersection gives similarity score
    • Sensitive to sparse input
    Item-based
    Recommendation
    Engine
    J v1
    1
    ,v2
    ( )=
    U1
    ∩ U2
    U1
    ∪ U2
    J=2/6

    View Slide

  18. Pandas: http:/
    /pandas.pydata.org/
    • Python
    • Data Analysis toolkit
    Item-based
    Recommendation
    Engine

    View Slide

  19. Item-based
    Recommendation
    Engine
    On Production (soon!):
    Cooking ingredients!

    View Slide

  20. Item-based
    Recommendation
    Engine
    On Production (soon!):
    Baby products!

    View Slide

  21. Item-based
    Recommendation
    Engine
    On Production (soon!):
    BBQ-style Parties!

    View Slide

  22. Data Infrastructure

    View Slide

  23. Data
    Infrastructure

    View Slide

  24. Data
    Infrastructure
    Auto Integration & Deployment
    https:/
    /mesosphere.com/blog/2015/04/02/continuous-
    deployment-with-mesos-marathon-docker/

    View Slide

  25. Platform: Amazon Web Services with EC2, S3, RDS
    Postgres, and Redshift


    Application: Docker, Airflow


    Code Review, Test and Integration: Github + Travis CI


    Resource management: Apache Mesos, AWS
    Autoscaling


    Application & Discovery management: Apache
    Marathon


    Languages: Python, SQL

    Data
    Infrastructure

    View Slide

  26. the end

    View Slide