Levon Ter-Isahakyan (Agoda), How Agoda uses Machine Learning and Big Data tools to scale aggregation of millions of properties and rooms, CodeFest 2017

CUPID Entities match maker A multi-purpose classification framework Levon Ter-Isahakyan

We are a Super-Aggregator BEST PRICE

DUPLICATES

Solution?

Matching?

Can’t do everything by hand

Automation

A multi-purpose classification framework. De-Duplicate Properties and Rooms Cupid

The Toys Hardware 50 TB 10 PB 11k cores

Results ML Properties Data Pipeline Partner 1 Agoda Partner 2
Data Preparation Train Predict Model Cleanup DB Manual Verification

Training Labels Training Set Training Training Features Learned model (F)
Prediction Features Testing Test Input Learned model (F) ML (Magic Layer)

Matching is a Classification problem?

What is Classification? • The process in which ideas and
objects are recognized • e.g. generate a prediction of a category based on an image f( ) = “apple” f( ) = “tomato” f( ) = “cow”

Y = F(X) output prediction function Feature (input) The aim
is to find F

Y = F(X) output prediction function Feature (input) Training Given
a set of labeled examples {(x1 ,y1 ), …, (xN ,yN )} estimate the prediction function F

Y = F(X) output prediction function Feature (input) Testing Apply
F to a never before seen test example x and output the predicted value.

Multi class vs Two class  Multi Class o Allows
for many different categories  Two Class o Only 2 categories allowed (+/1, -/0) o Multi class can be generalized as many two class problems f( ) = “apple” (1) f( ) = “no apple” (0)

How to find the model(F) • Machine Learning comes to
help • Lots of techniques exists • SVM • Neural Networks • Logistic classification • Randomized Forests • Etc.

Nearest Neighbor Test example Training examples from class “apple” Training
examples from class “non apple” f(x) = label of the training example nearest to x • Represent input images as point is some space (feature space) • All we need is a distance function for our inputs • No training required!

Linear “apple” “apple” “apple” “apple” “apple” “apple” “non apple” “non
apple” “non apple” “non apple” “non apple” “non apple” “apple” Find a linear function to separate the classes: f(x) = sgn(w  x + b)

Generalization

Simple Classifiers don’t give High Accuracy

Need Classifier that works in high dimensions

SVM, or Support Vector Machine, is a supervised Machine Learning
algorithm.

It is a frontier which best segregates two classes (hyper-plane
in higher dimensions) Support Vectors Y X

How to choose the right Hyperplane

Select the hyperplane that best segregates the two classes

Select the one that maximizes the distances between nearest data
points and the hyperplane

What about Hotel Matching?

Matching is a 2-class classification problem Use SVM

• Use labeled samples to build a model • For
each new entity, generate all possible pairs • Classify using the model • Several pairs can be classified as a match o Decide between them

Training Labels (Oak hotel, oak residence) (poisson resort, poisson) (lebua
@S, lebua state) etc Training Training pairs Feature Extraction Learned model (f) Prediction Feature Extraction Testing Learned model (f) (MBS, Marina bay sands) The FLOW

Feature Extraction Pan Pacific Singapore Hotel Pan Pacific Singapore 7
Raffles Boulevard, Marina Square, Marina Bay, Singapore, Singapore, 039595 7 Raffles-Boulevard, Marina Bay, Singapore, 039595 (1.292, 103.86) (1.293, 103.858) Name Exact Match Name Filtered Match Name Edit Distance Geo - Distance Address Match Address Number Zip-code match City match 0 1 .22 0.5 0 1 1 1

SVM as a distance • SVM combines all weaker features
to provide a robust distance metric • Calculate distance to all possible candidates • Accuracy is proportional to distance from hyperplane Amara Sentosa Hotel Pan Pacific Singapore Yes Chinatown Point Agoda Partner Pan Pacific Singapore Hotel Nice! by a beary good hostel Pair 1 Pair 2 Pair 3 Pair 4 SVM (distance from HP) low low low high Match

Several similar candidates • One Match • Probably our match
• Many matches • Not sure which one is the right one… • No Match • Either it is missing or the model failed to catch it

It works, but does it Scale?

− Yes, but have to be careful − Prediction is
expensive − Cannot consider all pairs − Reduce by only pairing in same country, geo proximity, etc.

Data integrity issues

− Longitude latitude reversed − City names different − Zip
code numbers swapped − Can’t catch them all.

− Run on a larger set, less frequently − Verify
and make up the difference manually

Thank You

Questions? [email protected] Levon Ter-Isahakyan

Levon Ter-Isahakyan (Agoda), How Agoda uses Mac...

Levon Ter-Isahakyan (Agoda), How Agoda uses Machine Learning and Big Data tools to scale aggregation of millions of properties and rooms, CodeFest 2017

More Decks by CodeFest

Other Decks in Programming

Featured

Transcript