Helping Travellers Make Better Hotel Choices 500 Million Times a Month

Helping Travellers Make Better Hotel Choices 500 Million Times a
Month Miguel Cabrera @mfcabrera https://www.flickr.com/photos/18694857@N00/5614701858/

ABOUT ME

•  Neuberliner •  Ing. Sistemas e Inf. Universidad Nacional -
Med •  M.Sc. In Informatics TUM, Hons. Technology Management. •  Work for TrustYou as Data (Scientist|Engineer| Juggler)™ •  Founder and former organizer of Munich DataGeeks ABOUT ME

•  What we do •  Architecture •  Technology •  Crawling
•  Textual Processing •  Workﬂow Management and Scale •  Sample Application AGENDA

WHAT WE DO

For every hotel on the planet, provide a summary of
traveler reviews.

•  Crawling •  Natural Language Processing / Semantic Analysis • 
Record Linkage / Deduplication •  Ranking •  Recommendation •  Classiﬁcation •  Clustering Tasks

ARCHITECTURE

Data Flow Crawling Seman-c Analysis Database
API Clients • Google • Kayak+ • TY Analytics

Batch Layer • Hadoop • Python • Pig* • Java* Service Layer • PostgreSQL • MongoDB
• Redis • Cassandra DATA DATA Hadoop Cluster Application Machines Stack

SOME NUMBERS

25 supported languages

500,000+ Properties

30,000,000+ daily crawled reviews

Deduplicated against 250,000,000+ reviews

300,000+ daily new reviews

https://www.flickr.com/photos/22646823@N08/2694765397/ Lots of text

TECHNOLOGY

•  Numpy •  NLTK •  Scikit-Learn •  Pandas •  IPython
/ Jupyter •  Scrapy Python

•  Hadoop Streaming •  MRJob •  Oozie •  Luigi • 
… Python + Hadoop

Crawling

•  Build your own web crawlers •  Extract data via
CSS selectors, XPath, regexes, etc. •  Handles queuing, request parallelism, cookies, throttling … •  Comprehensive and well-designed •  Commercial support by http://scrapinghub.com/

•  2 - 3 million new reviews/week •  Customers want
alerts 8 - 24h after review publication! •  Smart crawl frequency & depth, but still high overhead •  Pools of constantly refreshed EC2 proxy IPs •  Direct API connections with many sites Crawling at TrustYou

•  Custom framework very similar to scrapy •  Runs on
Hadoop cluster (100 nodes) •  Not 100% suitable for MapReduce •  Nodes mostly waiting •  Coordination/messaging between nodes required: –  Distributed queue –  Rate Limiting Crawling at TrustYou

Text Processing

Text Processing Raw text Setence spli:ng Tokenizing
Stopwords Stemming Topic Models Word Vectors Classiﬁcation

Text Processing

•  “great rooms” •  “great hotel” •  “rooms are terrible”
•  “hotel is terrible” Text Processing JJ NN JJ NN NN VB JJ NN VB JJ >> nltk.pos_tag(nltk.word_tokenize("hotel is terrible")) [('hotel', 'NN'), ('is', 'VBZ'), ('terrible', 'JJ')]

•  25+ languages •  Linguistic system (morphology, taggers, grammars, parsers
…) •  Hadoop: Scale out CPU •  ~1B opinions in the database •  Python for ML & NLP libraries Semantic Analysis

Word2Vec/Doc2Vec

Group of algorithms

An instance of shallow learning

Feature learning model

Generates real-valued vectors represenation of words

“king” – “man” + “woman” = “queen”

Word2Vec Source: h*p://technology.s4tchﬁx.com/blog/2015/03/11/word-‐is-‐worth-‐a-‐thousand-‐vectors/

Similar words/documents are nearby vectors

Wor2vec oﬀer a similarity metric of words

Can be extended to paragraphs and documents

A fast Python based implementation available via Gensim

Workﬂow Management and Scale

Crawl Extract Clean Stats ML
ML NLP

Luigi “ A python framework for data ﬂow deﬁnition and
execution ”

Luigi •  Build complex pipelines of batch jobs •  Dependency
resolution •  Parallelism •  Resume failed jobs •  Some support for Hadoop

Luigi •  Dependency deﬁnition •  Hadoop / HDFS Integration • 
Object oriented abstraction •  Parallelism •  Resume failed jobs •  Visualization of pipelines •  Command line integration

Minimal Bolerplate Code class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self):
return InputText(date) def output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close()

class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self): return InputText(date) def
output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close() Task Parameters

output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close() Programmatically Defined Dependencies

output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close() Each Task produces an ouput

output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close() Write Logic in Python

Hadoop

https://www.flickr.com/photos/12914838@N00/15015146343/ Hadoop = Java?

Hadoop Streaming cat input.txt | ./map.py | sort | ./reduce.py
> output.txt

Hadoop Streaming hadoop jar contrib/streaming/hadoop-*streaming*.jar \ -file /home/hduser/mapper.py -mapper /home/hduser/mapper.py
\ -file /home/hduser/reducer.py -reducer /home/hduser/reducer.py \ -input /user/hduser/text.txt -output /user/hduser/gutenberg-output

class WordCount(luigi.hadoop.JobTask): date = luigi.DateParameter() def requires(self): return InputText(date) def
output(self): return luigi.hdfs.HdfsTarget(’%s' % self.date_interval) def mapper(self, line): for word in line.strip().split(): yield word, 1 def reducer(self, key, values): yield key, sum(values) Luigi + Hadoop/HDFS

Go and learn:

Data Flow Visualization

Before •  Bash scripts + Cron •  Manual cleanup • 
Manual failure recovery •  Hard(er) to debug

Now •  Complex nested Luigi jobs graphs •  Automatic retries
•  Still Hard to debug

We use it for… •  Standalone executables •  Dump data
from databases •  General Hadoop Streaming •  Bash Scripts / MRJob •  Pig* Scripts

You can wrap anything

Sample Application

Reviews are boring…

Source: hGp://www.telegraph.co.uk/travel/hotels/11240430/TripAdvisor-‐the-‐funniest-‐ reviews-‐biggest-‐controversies-‐and-‐best-‐spoofs.html

Reviews highlight the individuality and personality of users

Snippets from Reviews “Hips don’t lie” “Maid was banging” “Beautiful
bowl flowers” “Irish dance, I love that” “No ghost sighting” “One ghost touching” “Too much cardio, not enough squats in the gym” “it is like hugging a bony super model”

Hotel Reviews + Gensim + Python + Luigi = ?

ExtractSentences LearnBigrams LearnModel ExtractClusterIds UploadEmbeddings Pig

from gensim.models.doc2vec import Doc2Vec class LearnModelTask(luigi.Task): # Parameters.... blah blah
blah def output(self): return luigi.LocalTarget(os.path.join(self.output_directory, self.model_out)) def requires(self): return LearnBigramsTask() def run(self): sentences = LabeledClusterIDSentence(self.input().path) model = Doc2Vec(sentences=sentences, size=int(self.size), dm=int(self.distmem), negative=int(self.negative), workers=int(self.workers), window=int(self.window), min_count=int(self.min_count), train_words=True) model.save(self.output().path)

Wor2vec/Doc2vec offer a similarity metric of words

Similarities are useful for non- personalized recommender systems

Non-personalized recommenders recommend items based on what other consumers have
said about the items.

http://demo.trustyou.com

Takeaways

Takeaways •  It is possible to use Python as the
primary language for doing large data processing on Hadoop. •  It is not a perfect setup but works well most of the time. •  Keep your ecosystem open to other technologies.

We are hiring [email protected]

Questions?

Helping Travellers Make Better Hotel Choices 50...

Helping Travellers Make Better Hotel Choices 500 Million Times a Month

More Decks by Miguel Cabrera

Other Decks in Technology

Featured

Transcript