Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Levon Ter-Isahakyan (Agoda), How Agoda uses Machine Learning and Big Data tools to scale aggregation of millions of properties and rooms, CodeFest 2017

CodeFest
January 30, 2018

Levon Ter-Isahakyan (Agoda), How Agoda uses Machine Learning and Big Data tools to scale aggregation of millions of properties and rooms, CodeFest 2017

https://2017.codefest.ru/lecture/1197

Agoda is experiencing intense growth these days. We add a new partner every one to two weeks that brings with it tens of thousands of properties, many more rooms and allows us to serve our customers the best prices. Do we just kick back and enjoy the results? Not even close.

Every integration creates possibly thousands of duplicate properties. Once we identify these duplicates we create even more problems, hundreds of thousands of duplicate rooms. Add in data integrity issues, lack of consistent standards, the scale we operate in, and it never stops.

The answer to our problems? Machine learning, Big Data tools and an almost fully automated process. This talk will focus on how we tackled all these issues, the compromises we had to make and most importantly how we are scaling it to support our growth.

CodeFest

January 30, 2018
Tweet

More Decks by CodeFest

Other Decks in Programming

Transcript

  1. Results ML Properties Data Pipeline Partner 1 Agoda Partner 2

    Data Preparation Train Predict Model Cleanup DB Manual Verification
  2. Training Labels Training Set Training Training Features Learned model (F)

    Prediction Features Testing Test Input Learned model (F) ML (Magic Layer)
  3. What is Classification? • The process in which ideas and

    objects are recognized • e.g. generate a prediction of a category based on an image f( ) = “apple” f( ) = “tomato” f( ) = “cow”
  4. Y = F(X) output prediction function Feature (input) Training Given

    a set of labeled examples {(x1 ,y1 ), …, (xN ,yN )} estimate the prediction function F
  5. Y = F(X) output prediction function Feature (input) Testing Apply

    F to a never before seen test example x and output the predicted value.
  6. Multi class vs Two class  Multi Class o Allows

    for many different categories  Two Class o Only 2 categories allowed (+/1, -/0) o Multi class can be generalized as many two class problems f( ) = “apple” (1) f( ) = “no apple” (0)
  7. How to find the model(F) • Machine Learning comes to

    help • Lots of techniques exists • SVM • Neural Networks • Logistic classification • Randomized Forests • Etc.
  8. Nearest Neighbor Test example Training examples from class “apple” Training

    examples from class “non apple” f(x) = label of the training example nearest to x • Represent input images as point is some space (feature space) • All we need is a distance function for our inputs • No training required!
  9. Linear “apple” “apple” “apple” “apple” “apple” “apple” “non apple” “non

    apple” “non apple” “non apple” “non apple” “non apple” “apple” Find a linear function to separate the classes: f(x) = sgn(w  x + b)
  10. It is a frontier which best segregates two classes (hyper-plane

    in higher dimensions) Support Vectors Y X
  11. • Use labeled samples to build a model • For

    each new entity, generate all possible pairs • Classify using the model • Several pairs can be classified as a match o Decide between them
  12. Training Labels (Oak hotel, oak residence) (poisson resort, poisson) (lebua

    @S, lebua state) etc Training Training pairs Feature Extraction Learned model (f) Prediction Feature Extraction Testing Learned model (f) (MBS, Marina bay sands) The FLOW
  13. Feature Extraction Pan Pacific Singapore Hotel Pan Pacific Singapore 7

    Raffles Boulevard, Marina Square, Marina Bay, Singapore, Singapore, 039595 7 Raffles-Boulevard, Marina Bay, Singapore, 039595 (1.292, 103.86) (1.293, 103.858) Name Exact Match Name Filtered Match Name Edit Distance Geo - Distance Address Match Address Number Zip-code match City match 0 1 .22 0.5 0 1 1 1
  14. SVM as a distance • SVM combines all weaker features

    to provide a robust distance metric • Calculate distance to all possible candidates • Accuracy is proportional to distance from hyperplane Amara Sentosa Hotel Pan Pacific Singapore Yes Chinatown Point Agoda Partner Pan Pacific Singapore Hotel Nice! by a beary good hostel Pair 1 Pair 2 Pair 3 Pair 4 SVM (distance from HP) low low low high Match
  15. Several similar candidates • One Match • Probably our match

    • Many matches • Not sure which one is the right one… • No Match • Either it is missing or the model failed to catch it
  16. − Yes, but have to be careful − Prediction is

    expensive − Cannot consider all pairs − Reduce by only pairing in same country, geo proximity, etc.
  17. − Longitude latitude reversed − City names different − Zip

    code numbers swapped − Can’t catch them all.
  18. − Run on a larger set, less frequently − Verify

    and make up the difference manually