Levon Ter-Isahakyan (Agoda), How Agoda uses Machine Learning and Big Data tools to scale aggregation of millions of properties and rooms, CodeFest 2017

16b6c87229eaf58768d25ed7b2bbbf52?s=47 CodeFest
January 30, 2018

Levon Ter-Isahakyan (Agoda), How Agoda uses Machine Learning and Big Data tools to scale aggregation of millions of properties and rooms, CodeFest 2017


Agoda is experiencing intense growth these days. We add a new partner every one to two weeks that brings with it tens of thousands of properties, many more rooms and allows us to serve our customers the best prices. Do we just kick back and enjoy the results? Not even close.

Every integration creates possibly thousands of duplicate properties. Once we identify these duplicates we create even more problems, hundreds of thousands of duplicate rooms. Add in data integrity issues, lack of consistent standards, the scale we operate in, and it never stops.

The answer to our problems? Machine learning, Big Data tools and an almost fully automated process. This talk will focus on how we tackled all these issues, the compromises we had to make and most importantly how we are scaling it to support our growth.



January 30, 2018


  1. CUPID Entities match maker A multi-purpose classification framework Levon Ter-Isahakyan

  2. None
  3. We are a Super-Aggregator BEST PRICE

  4. None

  6. None
  7. None
  8. None
  9. Solution?

  10. Matching?

  11. Can’t do everything by hand

  12. Automation

  13. None
  14. A multi-purpose classification framework. De-Duplicate Properties and Rooms Cupid

  15. The Toys Hardware 50 TB 10 PB 11k cores

  16. Results ML Properties Data Pipeline Partner 1 Agoda Partner 2

    Data Preparation Train Predict Model Cleanup DB Manual Verification
  17. Training Labels Training Set Training Training Features Learned model (F)

    Prediction Features Testing Test Input Learned model (F) ML (Magic Layer)
  18. None
  19. Matching is a Classification problem?

  20. What is Classification? • The process in which ideas and

    objects are recognized • e.g. generate a prediction of a category based on an image f( ) = “apple” f( ) = “tomato” f( ) = “cow”
  21. Y = F(X) output prediction function Feature (input) The aim

    is to find F
  22. Y = F(X) output prediction function Feature (input) Training Given

    a set of labeled examples {(x1 ,y1 ), …, (xN ,yN )} estimate the prediction function F
  23. Y = F(X) output prediction function Feature (input) Testing Apply

    F to a never before seen test example x and output the predicted value.
  24. Multi class vs Two class  Multi Class o Allows

    for many different categories  Two Class o Only 2 categories allowed (+/1, -/0) o Multi class can be generalized as many two class problems f( ) = “apple” (1) f( ) = “no apple” (0)
  25. None
  26. None
  27. How to find the model(F) • Machine Learning comes to

    help • Lots of techniques exists • SVM • Neural Networks • Logistic classification • Randomized Forests • Etc.
  28. Nearest Neighbor Test example Training examples from class “apple” Training

    examples from class “non apple” f(x) = label of the training example nearest to x • Represent input images as point is some space (feature space) • All we need is a distance function for our inputs • No training required!
  29. Linear “apple” “apple” “apple” “apple” “apple” “apple” “non apple” “non

    apple” “non apple” “non apple” “non apple” “non apple” “apple” Find a linear function to separate the classes: f(x) = sgn(w  x + b)
  30. Generalization

  31. Simple Classifiers don’t give High Accuracy

  32. Need Classifier that works in high dimensions

  33. None
  34. SVM, or Support Vector Machine, is a supervised Machine Learning

  35. It is a frontier which best segregates two classes (hyper-plane

    in higher dimensions) Support Vectors Y X
  36. How to choose the right Hyperplane

  37. Select the hyperplane that best segregates the two classes

  38. Select the one that maximizes the distances between nearest data

    points and the hyperplane
  39. What about Hotel Matching?

  40. None
  41. Matching is a 2-class classification problem Use SVM

  42. • Use labeled samples to build a model • For

    each new entity, generate all possible pairs • Classify using the model • Several pairs can be classified as a match o Decide between them
  43. Training Labels (Oak hotel, oak residence) (poisson resort, poisson) (lebua

    @S, lebua state) etc Training Training pairs Feature Extraction Learned model (f) Prediction Feature Extraction Testing Learned model (f) (MBS, Marina bay sands) The FLOW
  44. Feature Extraction Pan Pacific Singapore Hotel Pan Pacific Singapore 7

    Raffles Boulevard, Marina Square, Marina Bay, Singapore, Singapore, 039595 7 Raffles-Boulevard, Marina Bay, Singapore, 039595 (1.292, 103.86) (1.293, 103.858) Name Exact Match Name Filtered Match Name Edit Distance Geo - Distance Address Match Address Number Zip-code match City match 0 1 .22 0.5 0 1 1 1
  45. SVM as a distance • SVM combines all weaker features

    to provide a robust distance metric • Calculate distance to all possible candidates • Accuracy is proportional to distance from hyperplane Amara Sentosa Hotel Pan Pacific Singapore Yes Chinatown Point Agoda Partner Pan Pacific Singapore Hotel Nice! by a beary good hostel Pair 1 Pair 2 Pair 3 Pair 4 SVM (distance from HP) low low low high Match
  46. Several similar candidates • One Match • Probably our match

    • Many matches • Not sure which one is the right one… • No Match • Either it is missing or the model failed to catch it
  47. It works, but does it Scale?

  48. − Yes, but have to be careful − Prediction is

    expensive − Cannot consider all pairs − Reduce by only pairing in same country, geo proximity, etc.
  49. Data integrity issues

  50. − Longitude latitude reversed − City names different − Zip

    code numbers swapped − Can’t catch them all.
  51. − Run on a larger set, less frequently − Verify

    and make up the difference manually
  52. Thank You

  53. Questions? Ltimath@gmail.com Levon Ter-Isahakyan