Double helix multi-stage text classification model to enhance chat user experience in e-commerce website

Slide 1

Slide 1 text

Bukalapak IFCS – 2019 - Thessaloniki The 2019 conference of the International Federation of Classification Societies Thessaloniki, Greece 2019 Fiqry Revadiansyah | Data Scientist | PT. Bukalapak.com Double helix multi-stage text classification model to enhance chat user experience in e-commerce website

Slide 2

Slide 2 text

Bukalapak IFCS – 2019 - Thessaloniki 01 About Bukalapak and Beyond 02 Problem Statement | Users Problem 03 Research Idea and Methodology 04 Data-driven Insight | Data-driven Decision

Slide 3

Slide 3 text

Overview Bukalapak IFCS – 2019 - Thessaloniki 01 About Bukalapak and Beyond

Slide 4

Slide 4 text

Bukalapak Inside Bukalapak Bukalapak is one of the largest Tech Unicorn companies in Southeast Asia. Bukalapak was founded in 2010 and now has more than 50 million active users with more than half a million transactions per day from various products, including small kiosks and the e- commerce platform. As of today, Bukalapak with its 2600 employees is also available internationally by widening its market to ASEAN countries to contribute more in the ASEAN digital playground in the form of BukaGlobal.

Slide 5

Slide 5 text

Our Business Size Today Empowering individuals and SME’s in Indonesia Bukalapak Active Access/sec 100K+ Inside Bukalapak Sellers 4Mio+ Mitra Bukalapak 500K+ Age 18 - 35 70% *data as per January 2019 © Employer Branding Bukalapak 2019

Slide 6

Slide 6 text

Our Team Size Today Empowering individuals and SME’s in Indonesia Bukalapak Business Squad *data as per January 2019 80+ Inside Bukalapak 100+ 4 PB+ Big Data stored (Structured + Unstructured) BI Specialists, Data Scientists, & Data Engineers © Employer Branding Bukalapak 2019

Slide 7

Slide 7 text

Currently we have ~4 PB of Data stored in our system Bukalapak Inside Bukalapak Data lake & Datawarehousing Experiment & track everything Real-time information Distributed query

Slide 8

Slide 8 text

Background Bukalapak IFCS – 2019 - Thessaloniki 02 Problem Statement | Users Problem

Slide 9

Slide 9 text

Bukalapak Open Marketplace – Bukalapak.com IFCS – 2019 - Thessaloniki

Slide 10

Slide 10 text

Bukalapak Open Marketplace – Various Product and Seller IFCS – 2019 - Thessaloniki

Slide 11

Slide 11 text

Bukalapak Buyer A Open Marketplace – Open Discussion IFCS – 2019 - Thessaloniki Hi, is it ready (stock available) ? Unfortunately, no. Just sold out yesterday. I forgot to update the product stock. Sorry. What an unfortunate fate. Ok, thanks, gonna find to the other. € 200 Forgot to update their product stock Seller B

Slide 12

Slide 12 text

Bukalapak Seller B Open Marketplace – Open Discussion IFCS – 2019 - Thessaloniki Hi, is it ready (stock available) ? @#$Y#@$@$(@ € 200 Forgot to update their product stock Over hundreds of our prospective buyers

Slide 13

Slide 13 text

Bukalapak Seller B Open Marketplace – Open Shop Across Any Other E-commerce IFCS – 2019 - Thessaloniki

Slide 14

Slide 14 text

Bukalapak How could we help them to locate the solution? IFCS – 2019 - Thessaloniki Seller B Our prospective buyers Chat Chat Text data Text data Provide an automatic solution Our sellers

Slide 15

Slide 15 text

Bukalapak How could we help them to locate the solution? IFCS – 2019 - Thessaloniki Seller B Our prospective buyers Chat Chat Text data Text data Provide an automatic solution Intent Empty Not Empty 01 Our sellers 02

Slide 16

Slide 16 text

Methodology Bukalapak IFCS – 2019 - Thessaloniki 03 Research Idea and Methodology

Slide 17

Slide 17 text

Bukalapak Inspired by the architecture of our DNA helix IFCS – 2019 - Thessaloniki Our prospective buyers Chat Chat Text data Text data Intent A Intent A Buyer text Seller text Our sellers

Slide 18

Slide 18 text

Bukalapak Inspired by the architecture of our DNA helix IFCS – 2019 - Thessaloniki Our prospective buyers Our sellers Chat Chat Text data Text data Intent A Intent A Buyer text Seller text “Is it ready?” “Any blue color?” “Free shipping? “Hi, good morning!” “I’ve transferred the money to xxx” “Available for two products?” Messages Messages “Yes its ready” “No, how about red?” “Please use voucher” “Hi, happy shopping” “Okay, I will proceed” “Unfortunately, no. Just one. Take it?”

Slide 19

Slide 19 text

Bukalapak Inspired by the architecture of our DNA helix IFCS – 2019 - Thessaloniki Our prospective buyers Our sellers Chat Chat Text data Text data Intent A Intent A Buyer text Seller text Text Preprocessing STAGE I – Buyer Text STAGE II – Seller Text 1 Classify the intent 2 Asking Product Stock Other Run independently Text Preprocessing 1 Classify the intent 2 Empty Not Empty/ Available

Slide 20

Slide 20 text

Bukalapak IFCS – 2019 - Thessaloniki Start 1. Data Retrieval 2. Data Preprocessing 3. Feature Extraction & Engineering 7. Hyperparameter Optimization TF-IDF Normalization FastText Word2Vec 4. Train Test Split Model 1 – TFIDF Model 2 – FastText Model 3 – Word2Vec 5. Data Training Train Data – TF-IDF Train Data – FastText Train Data – Word2Vec Test Data – TF-IDF Test Data – FastText Test Data – Word2Vec 6. Model Evaluation and Selection End General Methodology

Slide 21

Slide 21 text

Bukalapak IFCS – 2019 - Thessaloniki 1. General Methodology Double-helix Multi Stage Methodology Buyer Text - Start Ask the Product Stock Ask Other 2. Match the Result End 1. General Methodology Seller Text - Start Other Class Product is Empty First Helix Second Helix

Slide 22

Slide 22 text

1. Data Retrieval Bukalapak IFCS – 2019 - Thessaloniki Buyer and Text Conversation We took the conversation data which contains `emptiness product stock` symptom, such as containing “stock is empty”, “sold out”, etc. (More slang words in Bahasa language) within the first quarter of 2019. We captured the chat which sent by the buyer also as the conversation initiator. “ “

Slide 23

Slide 23 text

2. Data Preprocessing Bukalapak IFCS – 2019 - Thessaloniki Data Preprocessing used in Bahasa Indonesia Language. It is obvious that the preprocessing stage is one of the most difficult task in NLP, because we have to understand the intent for each message and transform to the better structure for our machine learning. We done mainly six (6) steps, from punctuation removal to the stopwords removal. For example here is the complete chat message from the buyer to the seller. “Hi! could I ask, that this camera is available or not? Ty!” “ “ Remove Punctuation Tokenization Slang Words Transform Stemming Spell Checker Stopwords Removal Hi could I ask that this camera is available or not Ty “Hi” “could“ “I“ “ask“ ”that” “this” “camera” “is” “available” “or” “not” “ty” “Hi” “could“ “I“ “ask“ ”that” “this” “camera” “is” “available” “or” “not” “thank” “Hi” “can“ “I“ “ask“ ”that” “this” “camera” “is” “available” “or” “not” “thank” “Hi” “can“ “I“ “ask“ ”that” “this” “camera” “is” “available” “or” “not” “thank” “Hi” “I“ “ask“ “camera” “available” “thank”

Slide 24

Slide 24 text

3. Feature Extraction and Engineering Bukalapak IFCS – 2019 - Thessaloniki Having a set of text data, we have to transform it into numeric values, so that our computer could understand and learn from the data. We used 3 distinguish feature extraction for NLP, Normalized TF-IDF (Term Frequency – Inverse Document Frequency) , FastText , and Word2Vec. Academic Paper Related 1. TF-IDF: Allahyari, et al. 2017. A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques. In Processings of KDD Bigdas, Halifax, Canada, August 2017, 13 pages. 2. FastText: Joulin, et al. 2017. FastText.zip: Compressing Text Classification Models. 5th International Conference on Learning Representations Proceedings. 3. Word2Vec: Mikolov, et al. 2013. Distributed Representations of Words and Phrases and their Compositionality. “ “ The appearance frequency for each word vs the sentence, and the whole document TF-IDF(1) TF Formula IDF Formula Normalization Pre-trained DNN Model from Facebook Research FastText(2) Pre-trained DNN Model using skip grams Word2Vec(3)

Slide 25

Slide 25 text

Word2Vec: Train Data FastText: Test Data 70% Word2Vec: Train Data 70% FastText: Train Data 4. Train Test Split Bukalapak IFCS – 2019 - Thessaloniki We split the data to get training and testing dataset within a proportion of 70%: 30%, randomly,for each feature. “ “ 70% Normalized TF- IDF: Train Data 30% Normalized TF- IDF: Test Data 70% Train Dataset 30% Test Dataset

Slide 26

Slide 26 text

5. Data Training Bukalapak IFCS – 2019 - Thessaloniki We train our dataset, as well buyer and seller data, in order to get the prediction. We used several machine learning algorithms, in which we compare the accuracy result for each model. The Machine Learning algorithmmthat we used are: 1. Logistic Regression (Baseline Model) 2. K-Nearest Neighbor 3. Naïve Bayes 4. Decision Tree 5. Random Forest 6. Gradient Boosting Classifier 7. Extreme Gradient Boosting “ “ Logistic Regression (Baseline Model) K-Nearest Neighbor Naïve Bayes Decision Tree Classifier Random Forest Classifier Gradient Boosting Classifier Extreme Gradient Boosting Classifier

Slide 27

Slide 27 text

5. Data Training Bukalapak IFCS – 2019 - Thessaloniki We trained our data using K-Fold Cross Validation, which is used to evaluate the performance of the model by subset the data into train and validation by K- fold. “ “ K-Fold Cross Validation

Slide 28

Slide 28 text

6. Model Evaluation and Selection Bukalapak IFCS – 2019 - Thessaloniki In order to evaluate the model performance, we used several metrics to measure, such as accuracy score, negative recall, and AUC score. However, we focused to push down the False Positive by having high recall value. Higher accuracy score stands for a balanced-performance to predict both 0 and 1 classes. Higher recall stands for a very good performance on avoiding the false positive case. Higher AUC score stands for a good reliability of the model performance to predict imbalance dataset. “ “ = + + + + = + () = + () = +

Slide 29

Slide 29 text

7. Hyperparameter Optimization Bukalapak IFCS – 2019 - Thessaloniki The chosen model will be tuned (their hyperparameter) in order to get the optimum and higher accuracy for the train dataset. We used Randomized Search and Bayesian Search to get the best params from the hyperparameter space on the best selected models. Academic Paper Related 1. Random Search: Zabinsky, Zelda B. 2009. Random Search Algorithms. University of Washington, USA. 2. Bayesian Search: Shahriari, Bobak, et al. 2016. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proceedings of the IEEE (Volume: 104, Issue 1). “ “ Bayesian Search Random Search

Slide 30

Slide 30 text

Result and Discussion Bukalapak IFCS – 2019 - Thessaloniki 04 Data-driven Insight | Data-driven Decision

Slide 31

Slide 31 text

Bukalapak IFCS – 2019 - Thessaloniki Results Model TFIDF Word2Vec FastText Accuracy Neg Recall Log Loss Accuracy Neg Recall Log Loss Accuracy Neg Recall Log Loss Logistic Regression 85.37% 56.20% 0.37 88.52% 63.28% 0.33 73.65% 0.00% 0.58 K-Nearest Neighbor 84.18% 58.39% 1.80 87.90% 70.33% 1.56 68.89% 10.00% 0.63 Naïve Bayes 85.52% 59.29% 0.39 84.95% 66.88% 1.04 73.65% 0.00% 0.58 Decision Tree 84.04% 62.57% 4.10 84.28% 72.50% 5.43 73.65% 0.00% 0.58 Random Forest 84.65% 60.20% 1.16 88.33% 71.77% 0.95 73.65% 0.00% 0.58 Gradient Boosting Classifier 84.52% 52.25% 0.38 88.09% 69.96% 0.31 73.65% 0.00% 0.58 Extreme Gradient Boosting 83.23% 49.54% 0.40 88.42% 69.96% 0.30 73.65% 0.00% 0.58 Model TFIDF Word2Vec FastText Accuracy Neg Recall Log Loss Accuracy Neg Recall Log Loss Accuracy Neg Recall Log Loss Logistic Regression 85.84% 85.66% 0.37 84.46% 79.74% 0.43 53.96% 100.00% 0.69 K-Nearest Neighbor 81.40% 80.09% 0.97 84.17% 84.28% 1.59 47.77% 20.14% 0.72 Naïve Bayes 86.55% 82.75% 0.35 83.45% 84.81% 0.74 53.96% 100.00% 0.69 Decision Tree 85.69% 85.13% 3.16 76.83% 77.59% 8.00 53.96% 100.00% 0.69 Random Forest 86.55% 88.04% 0.55 83.02% 86.95% 1.05 53.96% 100.00% 0.69 Gradient Boosting Classifier 86.12% 87.79% 0.33 84.17% 84.00% 0.38 53.96% 100.00% 0.69 Extreme Gradient Boosting 84.12% 87.26% 0.35 83.45% 84.27% 0.37 53.96% 100.00% 0.69 Seller Model Buyer Model

Slide 32

Slide 32 text

Hyperparameter Tuning Results Bukalapak IFCS – 2019 - Thessaloniki XGBoost (Extreme Gradient Boosting) model outperform the other models in the seller text classification, while Random Forest model perfectly fit on the buyer text classification. Buyer Model (First Helix) Random Forest Tuned by Random Search -> AUC: 88.28% Seller Model (Second Helix) XGBoost Tuned by Bayesian Hyperopt -> AUC: 92.31% “ “

Slide 33

Slide 33 text

What is Next? Bukalapak IFCS – 2019 - Thessaloniki Buyer Model (First Helix) Random Forest Tuned by Random Search -> AUC: 88.28% Seller Model (Second Helix) XGBoost Tuned by Bayesian Hyperopt -> AUC: 92.31% “ “ Store Weights (Model parameters) Integrate with the Microservices Deploy

Slide 34

Slide 34 text

Thank you Bukalapak IFCS – 2019 - Thessaloniki Fiqry Revadiansyah Data Scientist @fiqryr Abdullah Ghifari Data Scientist @abdullahghifari Rya Meyvriska Data Scientist @rya_mey