Double helix multi-stage text classification model to enhance chat user experience in e-commerce website

Bukalapak IFCS – 2019 - Thessaloniki The 2019 conference of
the International Federation of Classification Societies Thessaloniki, Greece 2019 Fiqry Revadiansyah | Data Scientist | PT. Bukalapak.com Double helix multi-stage text classification model to enhance chat user experience in e-commerce website

Bukalapak IFCS – 2019 - Thessaloniki 01 About Bukalapak and
Beyond 02 Problem Statement | Users Problem 03 Research Idea and Methodology 04 Data-driven Insight | Data-driven Decision

Overview Bukalapak IFCS – 2019 - Thessaloniki 01 About Bukalapak
and Beyond

Bukalapak Inside Bukalapak Bukalapak is one of the largest Tech
Unicorn companies in Southeast Asia. Bukalapak was founded in 2010 and now has more than 50 million active users with more than half a million transactions per day from various products, including small kiosks and the e- commerce platform. As of today, Bukalapak with its 2600 employees is also available internationally by widening its market to ASEAN countries to contribute more in the ASEAN digital playground in the form of BukaGlobal.

Our Business Size Today Empowering individuals and SME’s in Indonesia
Bukalapak Active Access/sec 100K+ Inside Bukalapak Sellers 4Mio+ Mitra Bukalapak 500K+ Age 18 - 35 70% *data as per January 2019 © Employer Branding Bukalapak 2019

Our Team Size Today Empowering individuals and SME’s in Indonesia
Bukalapak Business Squad *data as per January 2019 80+ Inside Bukalapak 100+ 4 PB+ Big Data stored (Structured + Unstructured) BI Specialists, Data Scientists, & Data Engineers © Employer Branding Bukalapak 2019

Currently we have ~4 PB of Data stored in our
system Bukalapak Inside Bukalapak Data lake & Datawarehousing Experiment & track everything Real-time information Distributed query

Background Bukalapak IFCS – 2019 - Thessaloniki 02 Problem Statement
| Users Problem

Bukalapak Open Marketplace – Bukalapak.com IFCS – 2019 - Thessaloniki

Bukalapak Open Marketplace – Various Product and Seller IFCS –
2019 - Thessaloniki

Bukalapak Buyer A Open Marketplace – Open Discussion IFCS –
2019 - Thessaloniki Hi, is it ready (stock available) ? Unfortunately, no. Just sold out yesterday. I forgot to update the product stock. Sorry. What an unfortunate fate. Ok, thanks, gonna find to the other. € 200 Forgot to update their product stock Seller B

Bukalapak Seller B Open Marketplace – Open Discussion IFCS –
2019 - Thessaloniki Hi, is it ready (stock available) ? @#$Y#@$@$(@ € 200 Forgot to update their product stock Over hundreds of our prospective buyers

Bukalapak Seller B Open Marketplace – Open Shop Across Any
Other E-commerce IFCS – 2019 - Thessaloniki

Bukalapak How could we help them to locate the solution?
IFCS – 2019 - Thessaloniki Seller B Our prospective buyers Chat Chat Text data Text data Provide an automatic solution Our sellers

Bukalapak How could we help them to locate the solution?
IFCS – 2019 - Thessaloniki Seller B Our prospective buyers Chat Chat Text data Text data Provide an automatic solution Intent Empty Not Empty 01 Our sellers 02

Methodology Bukalapak IFCS – 2019 - Thessaloniki 03 Research Idea
and Methodology

Bukalapak Inspired by the architecture of our DNA helix IFCS
– 2019 - Thessaloniki Our prospective buyers Chat Chat Text data Text data Intent A Intent A Buyer text Seller text Our sellers

– 2019 - Thessaloniki Our prospective buyers Our sellers Chat Chat Text data Text data Intent A Intent A Buyer text Seller text “Is it ready?” “Any blue color?” “Free shipping? “Hi, good morning!” “I’ve transferred the money to xxx” “Available for two products?” Messages Messages “Yes its ready” “No, how about red?” “Please use voucher” “Hi, happy shopping” “Okay, I will proceed” “Unfortunately, no. Just one. Take it?”

– 2019 - Thessaloniki Our prospective buyers Our sellers Chat Chat Text data Text data Intent A Intent A Buyer text Seller text Text Preprocessing STAGE I – Buyer Text STAGE II – Seller Text 1 Classify the intent 2 Asking Product Stock Other Run independently Text Preprocessing 1 Classify the intent 2 Empty Not Empty/ Available

Bukalapak IFCS – 2019 - Thessaloniki Start 1. Data Retrieval
2. Data Preprocessing 3. Feature Extraction & Engineering 7. Hyperparameter Optimization TF-IDF Normalization FastText Word2Vec 4. Train Test Split Model 1 – TFIDF Model 2 – FastText Model 3 – Word2Vec 5. Data Training Train Data – TF-IDF Train Data – FastText Train Data – Word2Vec Test Data – TF-IDF Test Data – FastText Test Data – Word2Vec 6. Model Evaluation and Selection End General Methodology

Bukalapak IFCS – 2019 - Thessaloniki 1. General Methodology Double-helix
Multi Stage Methodology Buyer Text - Start Ask the Product Stock Ask Other 2. Match the Result End 1. General Methodology Seller Text - Start Other Class Product is Empty First Helix Second Helix

1. Data Retrieval Bukalapak IFCS – 2019 - Thessaloniki Buyer
and Text Conversation We took the conversation data which contains `emptiness product stock` symptom, such as containing “stock is empty”, “sold out”, etc. (More slang words in Bahasa language) within the first quarter of 2019. We captured the chat which sent by the buyer also as the conversation initiator. “ “

2. Data Preprocessing Bukalapak IFCS – 2019 - Thessaloniki Data
Preprocessing used in Bahasa Indonesia Language. It is obvious that the preprocessing stage is one of the most difficult task in NLP, because we have to understand the intent for each message and transform to the better structure for our machine learning. We done mainly six (6) steps, from punctuation removal to the stopwords removal. For example here is the complete chat message from the buyer to the seller. “Hi! could I ask, that this camera is available or not? Ty!” “ “ Remove Punctuation Tokenization Slang Words Transform Stemming Spell Checker Stopwords Removal Hi could I ask that this camera is available or not Ty “Hi” “could“ “I“ “ask“ ”that” “this” “camera” “is” “available” “or” “not” “ty” “Hi” “could“ “I“ “ask“ ”that” “this” “camera” “is” “available” “or” “not” “thank” “Hi” “can“ “I“ “ask“ ”that” “this” “camera” “is” “available” “or” “not” “thank” “Hi” “can“ “I“ “ask“ ”that” “this” “camera” “is” “available” “or” “not” “thank” “Hi” “I“ “ask“ “camera” “available” “thank”

3. Feature Extraction and Engineering Bukalapak IFCS – 2019 -
Thessaloniki Having a set of text data, we have to transform it into numeric values, so that our computer could understand and learn from the data. We used 3 distinguish feature extraction for NLP, Normalized TF-IDF (Term Frequency – Inverse Document Frequency) , FastText , and Word2Vec. Academic Paper Related 1. TF-IDF: Allahyari, et al. 2017. A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques. In Processings of KDD Bigdas, Halifax, Canada, August 2017, 13 pages. 2. FastText: Joulin, et al. 2017. FastText.zip: Compressing Text Classification Models. 5th International Conference on Learning Representations Proceedings. 3. Word2Vec: Mikolov, et al. 2013. Distributed Representations of Words and Phrases and their Compositionality. “ “ The appearance frequency for each word vs the sentence, and the whole document TF-IDF(1) TF Formula IDF Formula Normalization Pre-trained DNN Model from Facebook Research FastText(2) Pre-trained DNN Model using skip grams Word2Vec(3)

Word2Vec: Train Data FastText: Test Data 70% Word2Vec: Train Data
70% FastText: Train Data 4. Train Test Split Bukalapak IFCS – 2019 - Thessaloniki We split the data to get training and testing dataset within a proportion of 70%: 30%, randomly,for each feature. “ “ 70% Normalized TF- IDF: Train Data 30% Normalized TF- IDF: Test Data 70% Train Dataset 30% Test Dataset

5. Data Training Bukalapak IFCS – 2019 - Thessaloniki We
train our dataset, as well buyer and seller data, in order to get the prediction. We used several machine learning algorithms, in which we compare the accuracy result for each model. The Machine Learning algorithmmthat we used are: 1. Logistic Regression (Baseline Model) 2. K-Nearest Neighbor 3. Naïve Bayes 4. Decision Tree 5. Random Forest 6. Gradient Boosting Classifier 7. Extreme Gradient Boosting “ “ Logistic Regression (Baseline Model) K-Nearest Neighbor Naïve Bayes Decision Tree Classifier Random Forest Classifier Gradient Boosting Classifier Extreme Gradient Boosting Classifier

5. Data Training Bukalapak IFCS – 2019 - Thessaloniki We
trained our data using K-Fold Cross Validation, which is used to evaluate the performance of the model by subset the data into train and validation by K- fold. “ “ K-Fold Cross Validation

6. Model Evaluation and Selection Bukalapak IFCS – 2019 -
Thessaloniki In order to evaluate the model performance, we used several metrics to measure, such as accuracy score, negative recall, and AUC score. However, we focused to push down the False Positive by having high recall value. Higher accuracy score stands for a balanced-performance to predict both 0 and 1 classes. Higher recall stands for a very good performance on avoiding the false positive case. Higher AUC score stands for a good reliability of the model performance to predict imbalance dataset. “ “ = + + + + = + () = + () = +

7. Hyperparameter Optimization Bukalapak IFCS – 2019 - Thessaloniki The
chosen model will be tuned (their hyperparameter) in order to get the optimum and higher accuracy for the train dataset. We used Randomized Search and Bayesian Search to get the best params from the hyperparameter space on the best selected models. Academic Paper Related 1. Random Search: Zabinsky, Zelda B. 2009. Random Search Algorithms. University of Washington, USA. 2. Bayesian Search: Shahriari, Bobak, et al. 2016. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proceedings of the IEEE (Volume: 104, Issue 1). “ “ Bayesian Search Random Search

Result and Discussion Bukalapak IFCS – 2019 - Thessaloniki 04
Data-driven Insight | Data-driven Decision

Bukalapak IFCS – 2019 - Thessaloniki Results Model TFIDF Word2Vec
FastText Accuracy Neg Recall Log Loss Accuracy Neg Recall Log Loss Accuracy Neg Recall Log Loss Logistic Regression 85.37% 56.20% 0.37 88.52% 63.28% 0.33 73.65% 0.00% 0.58 K-Nearest Neighbor 84.18% 58.39% 1.80 87.90% 70.33% 1.56 68.89% 10.00% 0.63 Naïve Bayes 85.52% 59.29% 0.39 84.95% 66.88% 1.04 73.65% 0.00% 0.58 Decision Tree 84.04% 62.57% 4.10 84.28% 72.50% 5.43 73.65% 0.00% 0.58 Random Forest 84.65% 60.20% 1.16 88.33% 71.77% 0.95 73.65% 0.00% 0.58 Gradient Boosting Classifier 84.52% 52.25% 0.38 88.09% 69.96% 0.31 73.65% 0.00% 0.58 Extreme Gradient Boosting 83.23% 49.54% 0.40 88.42% 69.96% 0.30 73.65% 0.00% 0.58 Model TFIDF Word2Vec FastText Accuracy Neg Recall Log Loss Accuracy Neg Recall Log Loss Accuracy Neg Recall Log Loss Logistic Regression 85.84% 85.66% 0.37 84.46% 79.74% 0.43 53.96% 100.00% 0.69 K-Nearest Neighbor 81.40% 80.09% 0.97 84.17% 84.28% 1.59 47.77% 20.14% 0.72 Naïve Bayes 86.55% 82.75% 0.35 83.45% 84.81% 0.74 53.96% 100.00% 0.69 Decision Tree 85.69% 85.13% 3.16 76.83% 77.59% 8.00 53.96% 100.00% 0.69 Random Forest 86.55% 88.04% 0.55 83.02% 86.95% 1.05 53.96% 100.00% 0.69 Gradient Boosting Classifier 86.12% 87.79% 0.33 84.17% 84.00% 0.38 53.96% 100.00% 0.69 Extreme Gradient Boosting 84.12% 87.26% 0.35 83.45% 84.27% 0.37 53.96% 100.00% 0.69 Seller Model Buyer Model

Hyperparameter Tuning Results Bukalapak IFCS – 2019 - Thessaloniki XGBoost
(Extreme Gradient Boosting) model outperform the other models in the seller text classification, while Random Forest model perfectly fit on the buyer text classification. Buyer Model (First Helix) Random Forest Tuned by Random Search -> AUC: 88.28% Seller Model (Second Helix) XGBoost Tuned by Bayesian Hyperopt -> AUC: 92.31% “ “

What is Next? Bukalapak IFCS – 2019 - Thessaloniki Buyer
Model (First Helix) Random Forest Tuned by Random Search -> AUC: 88.28% Seller Model (Second Helix) XGBoost Tuned by Bayesian Hyperopt -> AUC: 92.31% “ “ Store Weights (Model parameters) Integrate with the Microservices Deploy

Thank you Bukalapak IFCS – 2019 - Thessaloniki Fiqry Revadiansyah
Data Scientist @fiqryr Abdullah Ghifari Data Scientist @abdullahghifari Rya Meyvriska Data Scientist @rya_mey

Double helix multi-stage text classification mo...

Double helix multi-stage text classification model to enhance chat user experience in e-commerce website

More Decks by Fiqry Revadiansyah

Other Decks in Research

Featured

Transcript