Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Double helix multi-stage text classification model to enhance chat user experience in e-commerce website

Double helix multi-stage text classification model to enhance chat user experience in e-commerce website

The study emphases on text-based classification on the chat feature to detect emptiness product stock from the seller’s text messages.

This research has been presented at The conference of the International Federation of Classification Societies (IFCS) on August 2019, Thessaloniki, Greece

Co-author: Abdullah Ghifari, Rya Meyvriska

Fiqry Revadiansyah

August 29, 2019
Tweet

More Decks by Fiqry Revadiansyah

Other Decks in Research

Transcript

  1. Bukalapak IFCS – 2019 - Thessaloniki
    The 2019 conference of the International Federation of Classification Societies
    Thessaloniki, Greece 2019
    Fiqry Revadiansyah | Data Scientist | PT. Bukalapak.com
    Double helix multi-stage text
    classification model to enhance chat
    user experience in e-commerce website

    View full-size slide

  2. Bukalapak IFCS – 2019 - Thessaloniki
    01
    About Bukalapak and Beyond
    02
    Problem Statement | Users Problem
    03
    Research Idea and Methodology
    04
    Data-driven Insight | Data-driven Decision

    View full-size slide

  3. Overview
    Bukalapak IFCS – 2019 - Thessaloniki
    01
    About Bukalapak and Beyond

    View full-size slide

  4. Bukalapak Inside Bukalapak
    Bukalapak is one of the largest Tech Unicorn companies in Southeast Asia. Bukalapak
    was founded in 2010 and now has more than 50 million active users with more than half a
    million transactions per day from various products, including small kiosks and the e-
    commerce platform. As of today, Bukalapak with its 2600 employees is also available
    internationally by widening its market to ASEAN countries to contribute more in the
    ASEAN digital playground in the form of BukaGlobal.

    View full-size slide

  5. Our Business Size Today
    Empowering individuals and SME’s in Indonesia
    Bukalapak
    Active Access/sec
    100K+
    Inside Bukalapak
    Sellers
    4Mio+
    Mitra Bukalapak
    500K+
    Age 18 - 35
    70%
    *data as per January 2019
    © Employer Branding Bukalapak 2019

    View full-size slide

  6. Our Team Size Today
    Empowering individuals and SME’s in
    Indonesia
    Bukalapak
    Business Squad
    *data as per January 2019
    80+
    Inside Bukalapak
    100+ 4 PB+
    Big Data stored
    (Structured +
    Unstructured)
    BI Specialists, Data
    Scientists, & Data
    Engineers
    © Employer Branding Bukalapak 2019

    View full-size slide

  7. Currently we have
    ~4 PB of Data
    stored in our system
    Bukalapak Inside Bukalapak
    Data lake &
    Datawarehousing
    Experiment & track everything
    Real-time information
    Distributed query

    View full-size slide

  8. Background
    Bukalapak IFCS – 2019 - Thessaloniki
    02
    Problem Statement | Users Problem

    View full-size slide

  9. Bukalapak
    Open Marketplace – Bukalapak.com
    IFCS – 2019 - Thessaloniki

    View full-size slide

  10. Bukalapak
    Open Marketplace – Various Product and Seller
    IFCS – 2019 - Thessaloniki

    View full-size slide

  11. Bukalapak
    Buyer A
    Open Marketplace – Open Discussion
    IFCS – 2019 - Thessaloniki
    Hi, is it ready (stock available) ?
    Unfortunately, no. Just sold out
    yesterday. I forgot to update the
    product stock. Sorry.
    What an unfortunate fate.
    Ok, thanks, gonna find to the other.
    € 200
    Forgot to update their
    product stock
    Seller B

    View full-size slide

  12. Bukalapak
    Seller B
    Open Marketplace – Open Discussion
    IFCS – 2019 - Thessaloniki
    Hi, is it ready (stock available) ?
    @#$Y#@$@$(@
    € 200
    Forgot to update their
    product stock
    Over hundreds of our prospective buyers

    View full-size slide

  13. Bukalapak
    Seller B
    Open Marketplace – Open Shop Across Any Other E-commerce
    IFCS – 2019 - Thessaloniki

    View full-size slide

  14. Bukalapak
    How could we help them to locate the solution?
    IFCS – 2019 - Thessaloniki
    Seller B
    Our prospective buyers
    Chat Chat
    Text data Text data
    Provide an automatic solution
    Our sellers

    View full-size slide

  15. Bukalapak
    How could we help them to locate the solution?
    IFCS – 2019 - Thessaloniki
    Seller B
    Our prospective buyers
    Chat Chat
    Text data Text data
    Provide an automatic solution
    Intent
    Empty
    Not
    Empty
    01
    Our sellers
    02

    View full-size slide

  16. Methodology
    Bukalapak IFCS – 2019 - Thessaloniki
    03
    Research Idea and Methodology

    View full-size slide

  17. Bukalapak
    Inspired by the architecture of our DNA helix
    IFCS – 2019 - Thessaloniki
    Our prospective buyers
    Chat Chat
    Text data Text data
    Intent A Intent A
    Buyer text Seller text
    Our sellers

    View full-size slide

  18. Bukalapak
    Inspired by the architecture of our DNA helix
    IFCS – 2019 - Thessaloniki
    Our prospective buyers Our sellers
    Chat Chat
    Text data Text data
    Intent A Intent A
    Buyer text Seller text
    “Is it ready?”
    “Any blue color?”
    “Free shipping?
    “Hi, good morning!”
    “I’ve transferred
    the money to xxx”
    “Available for
    two products?”
    Messages Messages
    “Yes its ready”
    “No, how about red?”
    “Please use voucher”
    “Hi, happy shopping”
    “Okay, I will proceed”
    “Unfortunately, no.
    Just one. Take it?”

    View full-size slide

  19. Bukalapak
    Inspired by the architecture of our DNA helix
    IFCS – 2019 - Thessaloniki
    Our prospective buyers Our sellers
    Chat Chat
    Text data Text data
    Intent A Intent A
    Buyer text Seller text
    Text Preprocessing
    STAGE I – Buyer Text STAGE II – Seller Text
    1
    Classify the intent
    2
    Asking
    Product Stock
    Other
    Run independently
    Text Preprocessing 1
    Classify the intent 2
    Empty Not Empty/
    Available

    View full-size slide

  20. Bukalapak IFCS – 2019 - Thessaloniki
    Start
    1. Data Retrieval
    2. Data
    Preprocessing
    3. Feature Extraction
    & Engineering
    7. Hyperparameter
    Optimization
    TF-IDF
    Normalization
    FastText Word2Vec
    4. Train Test Split
    Model 1 –
    TFIDF
    Model 2 –
    FastText
    Model 3 –
    Word2Vec
    5. Data Training
    Train Data –
    TF-IDF
    Train Data –
    FastText
    Train Data –
    Word2Vec
    Test Data –
    TF-IDF
    Test Data –
    FastText
    Test Data –
    Word2Vec
    6. Model
    Evaluation and
    Selection
    End
    General Methodology

    View full-size slide

  21. Bukalapak IFCS – 2019 - Thessaloniki
    1. General
    Methodology
    Double-helix Multi Stage Methodology
    Buyer Text -
    Start
    Ask the
    Product Stock
    Ask Other
    2. Match the Result
    End
    1. General
    Methodology
    Seller Text -
    Start
    Other Class
    Product is
    Empty
    First Helix Second Helix

    View full-size slide

  22. 1. Data Retrieval
    Bukalapak IFCS – 2019 - Thessaloniki
    Buyer and Text Conversation
    We took the conversation data which
    contains `emptiness product stock`
    symptom, such as containing “stock is
    empty”, “sold out”, etc. (More slang
    words in Bahasa language) within the
    first quarter of 2019. We captured the
    chat which sent by the buyer also as the
    conversation initiator.


    View full-size slide

  23. 2. Data Preprocessing
    Bukalapak IFCS – 2019 - Thessaloniki
    Data Preprocessing used in Bahasa
    Indonesia Language.
    It is obvious that the preprocessing stage
    is one of the most difficult task in NLP,
    because we have to understand the
    intent for each message and transform
    to the better structure for our machine
    learning. We done mainly six (6) steps,
    from punctuation removal to the
    stopwords removal.
    For example here is the complete chat
    message from the buyer to the seller.
    “Hi! could I ask, that this camera is
    available or not? Ty!”


    Remove
    Punctuation
    Tokenization
    Slang Words
    Transform
    Stemming
    Spell
    Checker
    Stopwords
    Removal
    Hi could I ask that this camera is
    available or not Ty
    “Hi” “could“ “I“ “ask“ ”that” “this”
    “camera” “is” “available” “or” “not” “ty”
    “Hi” “could“ “I“ “ask“ ”that” “this”
    “camera” “is” “available” “or” “not”
    “thank”
    “Hi” “can“ “I“ “ask“ ”that” “this”
    “camera” “is” “available” “or” “not”
    “thank”
    “Hi” “can“ “I“ “ask“ ”that” “this”
    “camera” “is” “available” “or” “not”
    “thank”
    “Hi” “I“ “ask“ “camera” “available”
    “thank”

    View full-size slide

  24. 3. Feature Extraction and
    Engineering
    Bukalapak IFCS – 2019 - Thessaloniki
    Having a set of text data, we have to
    transform it into numeric values, so that
    our computer could understand and
    learn from the data. We used 3
    distinguish feature extraction for NLP,
    Normalized TF-IDF (Term Frequency –
    Inverse Document Frequency) , FastText ,
    and Word2Vec.
    Academic Paper Related
    1. TF-IDF: Allahyari, et al. 2017. A Brief Survey of Text Mining:
    Classification, Clustering and Extraction Techniques. In
    Processings of KDD Bigdas, Halifax, Canada, August 2017, 13
    pages.
    2. FastText: Joulin, et al. 2017. FastText.zip: Compressing Text
    Classification Models. 5th International Conference on Learning
    Representations Proceedings.
    3. Word2Vec: Mikolov, et al. 2013. Distributed Representations of
    Words and Phrases and their Compositionality.


    The appearance
    frequency for each
    word vs the
    sentence, and the
    whole document
    TF-IDF(1)
    TF Formula
    IDF Formula
    Normalization
    Pre-trained DNN
    Model from
    Facebook Research
    FastText(2)
    Pre-trained DNN
    Model using skip
    grams
    Word2Vec(3)

    View full-size slide

  25. Word2Vec:
    Train Data
    FastText: Test
    Data
    70%
    Word2Vec:
    Train Data
    70%
    FastText: Train
    Data
    4. Train Test Split
    Bukalapak IFCS – 2019 - Thessaloniki
    We split the data to get training and
    testing dataset within a proportion of
    70%: 30%, randomly,for each feature.


    70%
    Normalized TF-
    IDF: Train Data
    30%
    Normalized TF-
    IDF: Test Data
    70%
    Train
    Dataset
    30%
    Test
    Dataset

    View full-size slide

  26. 5. Data Training
    Bukalapak IFCS – 2019 - Thessaloniki
    We train our dataset, as well buyer and
    seller data, in order to get the
    prediction. We used several machine
    learning algorithms, in which we
    compare the accuracy result for each
    model. The Machine Learning
    algorithmmthat we used are:
    1. Logistic Regression (Baseline
    Model)
    2. K-Nearest Neighbor
    3. Naïve Bayes
    4. Decision Tree
    5. Random Forest
    6. Gradient Boosting Classifier
    7. Extreme Gradient Boosting


    Logistic Regression (Baseline Model)
    K-Nearest Neighbor
    Naïve Bayes
    Decision Tree Classifier
    Random Forest Classifier
    Gradient Boosting Classifier
    Extreme Gradient Boosting Classifier

    View full-size slide

  27. 5. Data Training
    Bukalapak IFCS – 2019 - Thessaloniki
    We trained our data using K-Fold Cross
    Validation, which is used to evaluate
    the performance of the model by subset
    the data into train and validation by K-
    fold.


    K-Fold Cross Validation

    View full-size slide

  28. 6. Model Evaluation and
    Selection
    Bukalapak IFCS – 2019 - Thessaloniki
    In order to evaluate the model
    performance, we used several metrics to
    measure, such as accuracy score, negative
    recall, and AUC score. However, we
    focused to push down the False Positive
    by having high recall value.
    Higher accuracy score stands for a
    balanced-performance to predict both 0
    and 1 classes.
    Higher recall stands for a very good
    performance on avoiding the false positive
    case.
    Higher AUC score stands for a good
    reliability of the model performance to
    predict imbalance dataset.


    =
    +
    + + +
    =

    +
    () =

    +
    () =

    +

    View full-size slide

  29. 7. Hyperparameter
    Optimization
    Bukalapak IFCS – 2019 - Thessaloniki
    The chosen model will be tuned (their
    hyperparameter) in order to get the
    optimum and higher accuracy for the
    train dataset.
    We used Randomized Search and
    Bayesian Search to get the best params
    from the hyperparameter space on the
    best selected models.
    Academic Paper Related
    1. Random Search: Zabinsky, Zelda B. 2009. Random Search
    Algorithms. University of Washington, USA.
    2. Bayesian Search: Shahriari, Bobak, et al. 2016. Taking the
    Human Out of the Loop: A Review of Bayesian Optimization.
    Proceedings of the IEEE (Volume: 104, Issue 1).


    Bayesian Search
    Random Search

    View full-size slide

  30. Result and Discussion
    Bukalapak IFCS – 2019 - Thessaloniki
    04
    Data-driven Insight | Data-driven Decision

    View full-size slide

  31. Bukalapak IFCS – 2019 - Thessaloniki
    Results
    Model
    TFIDF Word2Vec FastText
    Accuracy
    Neg
    Recall
    Log Loss Accuracy
    Neg
    Recall
    Log Loss Accuracy
    Neg
    Recall
    Log Loss
    Logistic Regression 85.37% 56.20% 0.37 88.52% 63.28% 0.33 73.65% 0.00% 0.58
    K-Nearest Neighbor 84.18% 58.39% 1.80 87.90% 70.33% 1.56 68.89% 10.00% 0.63
    Naïve Bayes 85.52% 59.29% 0.39 84.95% 66.88% 1.04 73.65% 0.00% 0.58
    Decision Tree 84.04% 62.57% 4.10 84.28% 72.50% 5.43 73.65% 0.00% 0.58
    Random Forest 84.65% 60.20% 1.16 88.33% 71.77% 0.95 73.65% 0.00% 0.58
    Gradient Boosting Classifier 84.52% 52.25% 0.38 88.09% 69.96% 0.31 73.65% 0.00% 0.58
    Extreme Gradient Boosting 83.23% 49.54% 0.40 88.42% 69.96% 0.30 73.65% 0.00% 0.58
    Model
    TFIDF Word2Vec FastText
    Accuracy
    Neg
    Recall
    Log Loss Accuracy
    Neg
    Recall
    Log Loss Accuracy
    Neg
    Recall
    Log Loss
    Logistic Regression 85.84% 85.66% 0.37 84.46% 79.74% 0.43 53.96% 100.00% 0.69
    K-Nearest Neighbor 81.40% 80.09% 0.97 84.17% 84.28% 1.59 47.77% 20.14% 0.72
    Naïve Bayes 86.55% 82.75% 0.35 83.45% 84.81% 0.74 53.96% 100.00% 0.69
    Decision Tree 85.69% 85.13% 3.16 76.83% 77.59% 8.00 53.96% 100.00% 0.69
    Random Forest 86.55% 88.04% 0.55 83.02% 86.95% 1.05 53.96% 100.00% 0.69
    Gradient Boosting Classifier 86.12% 87.79% 0.33 84.17% 84.00% 0.38 53.96% 100.00% 0.69
    Extreme Gradient Boosting 84.12% 87.26% 0.35 83.45% 84.27% 0.37 53.96% 100.00% 0.69
    Seller Model
    Buyer Model

    View full-size slide

  32. Hyperparameter Tuning
    Results
    Bukalapak IFCS – 2019 - Thessaloniki
    XGBoost (Extreme Gradient Boosting)
    model outperform the other models in the
    seller text classification, while Random
    Forest model perfectly fit on the buyer text
    classification.
    Buyer Model (First Helix)
    Random Forest Tuned by Random Search
    -> AUC: 88.28%
    Seller Model (Second Helix)
    XGBoost Tuned by Bayesian Hyperopt
    -> AUC: 92.31%


    View full-size slide

  33. What is Next?
    Bukalapak IFCS – 2019 - Thessaloniki
    Buyer Model (First Helix)
    Random Forest Tuned by Random Search
    -> AUC: 88.28%
    Seller Model (Second Helix)
    XGBoost Tuned by Bayesian Hyperopt
    -> AUC: 92.31%


    Store Weights
    (Model
    parameters)
    Integrate with
    the
    Microservices
    Deploy

    View full-size slide

  34. Thank you
    Bukalapak IFCS – 2019 - Thessaloniki
    Fiqry Revadiansyah
    Data Scientist
    @fiqryr
    Abdullah Ghifari
    Data Scientist
    @abdullahghifari
    Rya Meyvriska
    Data Scientist
    @rya_mey

    View full-size slide