Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Helping Travellers Make Better Hotel Choices 500 Million Times a Month

Helping Travellers Make Better Hotel Choices 500 Million Times a Month

Presentation give at the Meetup Big Data Colombia (now Colombia IA) based on my PyCONDE 2016

Miguel Cabrera

March 15, 2017
Tweet

More Decks by Miguel Cabrera

Other Decks in Technology

Transcript

  1. Helping Travellers Make Better Hotel Choices 500 Million Times a

    Month Miguel Cabrera @mfcabrera https://www.flickr.com/photos/18694857@N00/5614701858/
  2. •  Neuberliner •  Ing. Sistemas e Inf. Universidad Nacional -

    Med •  M.Sc. In Informatics TUM, Hons. Technology Management. •  Work for TrustYou as Data (Scientist|Engineer| Juggler)™ •  Founder and former organizer of Munich DataGeeks ABOUT ME
  3. •  What we do •  Architecture •  Technology •  Crawling

    •  Textual Processing •  Workflow Management and Scale •  Sample Application AGENDA
  4. •  Crawling •  Natural Language Processing / Semantic Analysis • 

    Record Linkage / Deduplication •  Ranking •  Recommendation •  Classification •  Clustering Tasks
  5. Data Flow Crawling   Seman-c   Analysis    Database  

    API   Clients • Google • Kayak+ • TY Analytics
  6. Batch Layer • Hadoop • Python • Pig* • Java* Service Layer • PostgreSQL • MongoDB

    • Redis • Cassandra DATA DATA Hadoop Cluster Application Machines Stack
  7. •  Build your own web crawlers •  Extract data via

    CSS selectors, XPath, regexes, etc. •  Handles queuing, request parallelism, cookies, throttling … •  Comprehensive and well-designed •  Commercial support by http://scrapinghub.com/
  8. •  2 - 3 million new reviews/week •  Customers want

    alerts 8 - 24h after review publication! •  Smart crawl frequency & depth, but still high overhead •  Pools of constantly refreshed EC2 proxy IPs •  Direct API connections with many sites Crawling at TrustYou
  9. •  Custom framework very similar to scrapy •  Runs on

    Hadoop cluster (100 nodes) •  Not 100% suitable for MapReduce •  Nodes mostly waiting •  Coordination/messaging between nodes required: –  Distributed queue –  Rate Limiting Crawling at TrustYou
  10. Text Processing Raw  text   Setence   spli:ng   Tokenizing

      Stopwords   Stemming Topic Models Word Vectors Classification
  11. •  “great rooms” •  “great hotel” •  “rooms are terrible”

    •  “hotel is terrible” Text Processing JJ NN JJ NN NN VB JJ NN VB JJ >> nltk.pos_tag(nltk.word_tokenize("hotel is terrible")) [('hotel', 'NN'), ('is', 'VBZ'), ('terrible', 'JJ')]
  12. •  25+ languages •  Linguistic system (morphology, taggers, grammars, parsers

    …) •  Hadoop: Scale out CPU •  ~1B opinions in the database •  Python for ML & NLP libraries Semantic Analysis
  13. Luigi •  Build complex pipelines of batch jobs •  Dependency

    resolution •  Parallelism •  Resume failed jobs •  Some support for Hadoop
  14. Luigi •  Dependency definition •  Hadoop / HDFS Integration • 

    Object oriented abstraction •  Parallelism •  Resume failed jobs •  Visualization of pipelines •  Command line integration
  15. Minimal Bolerplate Code class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self):

    return InputText(date) def output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close()
  16. class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self): return InputText(date) def

    output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close() Task Parameters
  17. class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self): return InputText(date) def

    output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close() Programmatically Defined Dependencies
  18. class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self): return InputText(date) def

    output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close() Each Task produces an ouput
  19. class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self): return InputText(date) def

    output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close() Write Logic in Python
  20. Hadoop Streaming hadoop jar contrib/streaming/hadoop-*streaming*.jar \ -file /home/hduser/mapper.py -mapper /home/hduser/mapper.py

    \ -file /home/hduser/reducer.py -reducer /home/hduser/reducer.py \ -input /user/hduser/text.txt -output /user/hduser/gutenberg-output
  21. class WordCount(luigi.hadoop.JobTask): date = luigi.DateParameter() def requires(self): return InputText(date) def

    output(self): return luigi.hdfs.HdfsTarget(’%s' % self.date_interval) def mapper(self, line): for word in line.strip().split(): yield word, 1 def reducer(self, key, values): yield key, sum(values) Luigi + Hadoop/HDFS
  22. Before •  Bash scripts + Cron •  Manual cleanup • 

    Manual failure recovery •  Hard(er) to debug
  23. We use it for… •  Standalone executables •  Dump data

    from databases •  General Hadoop Streaming •  Bash Scripts / MRJob •  Pig* Scripts
  24. Snippets from Reviews “Hips don’t lie” “Maid was banging” “Beautiful

    bowl flowers” “Irish dance, I love that” “No ghost sighting” “One ghost touching” “Too much cardio, not enough squats in the gym” “it is like hugging a bony super model”
  25. from gensim.models.doc2vec import Doc2Vec class LearnModelTask(luigi.Task): # Parameters.... blah blah

    blah def output(self): return luigi.LocalTarget(os.path.join(self.output_directory, self.model_out)) def requires(self): return LearnBigramsTask() def run(self): sentences = LabeledClusterIDSentence(self.input().path) model = Doc2Vec(sentences=sentences, size=int(self.size), dm=int(self.distmem), negative=int(self.negative), workers=int(self.workers), window=int(self.window), min_count=int(self.min_count), train_words=True) model.save(self.output().path)
  26. Takeaways •  It is possible to use Python as the

    primary language for doing large data processing on Hadoop. •  It is not a perfect setup but works well most of the time. •  Keep your ecosystem open to other technologies.