Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Helping Travellers Make Better Hotel Choices 500 Million Times a Month

Helping Travellers Make Better Hotel Choices 500 Million Times a Month

Presentation give at the Meetup Big Data Colombia (now Colombia IA) based on my PyCONDE 2016

D0ab1fbc41764f8ea112824449b33e18?s=128

Miguel Cabrera

March 15, 2017
Tweet

Transcript

  1. Helping Travellers Make Better Hotel Choices 500 Million Times a

    Month Miguel Cabrera @mfcabrera https://www.flickr.com/photos/18694857@N00/5614701858/
  2. ABOUT ME

  3. •  Neuberliner •  Ing. Sistemas e Inf. Universidad Nacional -

    Med •  M.Sc. In Informatics TUM, Hons. Technology Management. •  Work for TrustYou as Data (Scientist|Engineer| Juggler)™ •  Founder and former organizer of Munich DataGeeks ABOUT ME
  4. TODAY

  5. •  What we do •  Architecture •  Technology •  Crawling

    •  Textual Processing •  Workflow Management and Scale •  Sample Application AGENDA
  6. WHAT WE DO

  7. For every hotel on the planet, provide a summary of

    traveler reviews.
  8. None
  9. None
  10. None
  11. None
  12. None
  13. None
  14. None
  15. •  Crawling •  Natural Language Processing / Semantic Analysis • 

    Record Linkage / Deduplication •  Ranking •  Recommendation •  Classification •  Clustering Tasks
  16. ARCHITECTURE

  17. Data Flow Crawling   Seman-c   Analysis    Database  

    API   Clients • Google • Kayak+ • TY Analytics
  18. Batch Layer • Hadoop • Python • Pig* • Java* Service Layer • PostgreSQL • MongoDB

    • Redis • Cassandra DATA DATA Hadoop Cluster Application Machines Stack
  19. SOME NUMBERS

  20. 25 supported languages

  21. 500,000+ Properties

  22. 30,000,000+ daily crawled reviews

  23. Deduplicated against 250,000,000+ reviews

  24. 300,000+ daily new reviews

  25. https://www.flickr.com/photos/22646823@N08/2694765397/ Lots of text

  26. TECHNOLOGY

  27. •  Numpy •  NLTK •  Scikit-Learn •  Pandas •  IPython

    / Jupyter •  Scrapy Python
  28. •  Hadoop Streaming •  MRJob •  Oozie •  Luigi • 

    … Python + Hadoop
  29. Crawling

  30. Crawling

  31. None
  32. None
  33. None
  34. None
  35. •  Build your own web crawlers •  Extract data via

    CSS selectors, XPath, regexes, etc. •  Handles queuing, request parallelism, cookies, throttling … •  Comprehensive and well-designed •  Commercial support by http://scrapinghub.com/
  36. None
  37. None
  38. None
  39. None
  40. •  2 - 3 million new reviews/week •  Customers want

    alerts 8 - 24h after review publication! •  Smart crawl frequency & depth, but still high overhead •  Pools of constantly refreshed EC2 proxy IPs •  Direct API connections with many sites Crawling at TrustYou
  41. •  Custom framework very similar to scrapy •  Runs on

    Hadoop cluster (100 nodes) •  Not 100% suitable for MapReduce •  Nodes mostly waiting •  Coordination/messaging between nodes required: –  Distributed queue –  Rate Limiting Crawling at TrustYou
  42. Text Processing

  43. Text Processing Raw  text   Setence   spli:ng   Tokenizing

      Stopwords   Stemming Topic Models Word Vectors Classification
  44. Text Processing

  45. •  “great rooms” •  “great hotel” •  “rooms are terrible”

    •  “hotel is terrible” Text Processing JJ NN JJ NN NN VB JJ NN VB JJ >> nltk.pos_tag(nltk.word_tokenize("hotel is terrible")) [('hotel', 'NN'), ('is', 'VBZ'), ('terrible', 'JJ')]
  46. •  25+ languages •  Linguistic system (morphology, taggers, grammars, parsers

    …) •  Hadoop: Scale out CPU •  ~1B opinions in the database •  Python for ML & NLP libraries Semantic Analysis
  47. Word2Vec/Doc2Vec

  48. Group of algorithms

  49. An instance of shallow learning

  50. Feature learning model

  51. Generates real-valued vectors represenation of words

  52. “king” – “man” + “woman” = “queen”

  53. Word2Vec Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

  54. Word2Vec Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

  55. Word2Vec Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

  56. Word2Vec Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

  57. Word2Vec Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

  58. Word2Vec Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

  59. Similar words/documents are nearby vectors

  60. Wor2vec offer a similarity metric of words

  61. Can be extended to paragraphs and documents

  62. A fast Python based implementation available via Gensim

  63. None
  64. Workflow Management and Scale

  65. Crawl   Extract   Clean   Stats   ML  

    ML   NLP  
  66. Luigi “ A python framework for data flow definition and

    execution ”
  67. Luigi •  Build complex pipelines of batch jobs •  Dependency

    resolution •  Parallelism •  Resume failed jobs •  Some support for Hadoop
  68. Luigi

  69. Luigi •  Dependency definition •  Hadoop / HDFS Integration • 

    Object oriented abstraction •  Parallelism •  Resume failed jobs •  Visualization of pipelines •  Command line integration
  70. Minimal Bolerplate Code class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self):

    return InputText(date) def output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close()
  71. class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self): return InputText(date) def

    output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close() Task Parameters
  72. class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self): return InputText(date) def

    output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close() Programmatically Defined Dependencies
  73. class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self): return InputText(date) def

    output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close() Each Task produces an ouput
  74. class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self): return InputText(date) def

    output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close() Write Logic in Python
  75. Hadoop

  76. None
  77. https://www.flickr.com/photos/12914838@N00/15015146343/ Hadoop = Java?

  78. Hadoop Streaming cat input.txt | ./map.py | sort | ./reduce.py

    > output.txt
  79. Hadoop Streaming hadoop jar contrib/streaming/hadoop-*streaming*.jar \ -file /home/hduser/mapper.py -mapper /home/hduser/mapper.py

    \ -file /home/hduser/reducer.py -reducer /home/hduser/reducer.py \ -input /user/hduser/text.txt -output /user/hduser/gutenberg-output
  80. class WordCount(luigi.hadoop.JobTask): date = luigi.DateParameter() def requires(self): return InputText(date) def

    output(self): return luigi.hdfs.HdfsTarget(’%s' % self.date_interval) def mapper(self, line): for word in line.strip().split(): yield word, 1 def reducer(self, key, values): yield key, sum(values) Luigi + Hadoop/HDFS
  81. Go and learn:

  82. Data Flow Visualization

  83. Data Flow Visualization

  84. Before •  Bash scripts + Cron •  Manual cleanup • 

    Manual failure recovery •  Hard(er) to debug
  85. Now •  Complex nested Luigi jobs graphs •  Automatic retries

    •  Still Hard to debug
  86. We use it for… •  Standalone executables •  Dump data

    from databases •  General Hadoop Streaming •  Bash Scripts / MRJob •  Pig* Scripts
  87. You can wrap anything

  88. Sample Application

  89. Reviews are boring…

  90. None
  91. None
  92. Source:  hGp://www.telegraph.co.uk/travel/hotels/11240430/TripAdvisor-­‐the-­‐funniest-­‐ reviews-­‐biggest-­‐controversies-­‐and-­‐best-­‐spoofs.html  

  93. Reviews highlight the individuality and personality of users

  94. Snippets from Reviews “Hips don’t lie” “Maid was banging” “Beautiful

    bowl flowers” “Irish dance, I love that” “No ghost sighting” “One ghost touching” “Too much cardio, not enough squats in the gym” “it is like hugging a bony super model”
  95. Hotel Reviews + Gensim + Python + Luigi = ?

  96. ExtractSentences LearnBigrams LearnModel ExtractClusterIds UploadEmbeddings Pig

  97. None
  98. from gensim.models.doc2vec import Doc2Vec class LearnModelTask(luigi.Task): # Parameters.... blah blah

    blah def output(self): return luigi.LocalTarget(os.path.join(self.output_directory, self.model_out)) def requires(self): return LearnBigramsTask() def run(self): sentences = LabeledClusterIDSentence(self.input().path) model = Doc2Vec(sentences=sentences, size=int(self.size), dm=int(self.distmem), negative=int(self.negative), workers=int(self.workers), window=int(self.window), min_count=int(self.min_count), train_words=True) model.save(self.output().path)
  99. Wor2vec/Doc2vec offer a similarity metric of words

  100. Similarities are useful for non- personalized recommender systems

  101. Non-personalized recommenders recommend items based on what other consumers have

    said about the items.
  102. http://demo.trustyou.com

  103. Takeaways

  104. Takeaways •  It is possible to use Python as the

    primary language for doing large data processing on Hadoop. •  It is not a perfect setup but works well most of the time. •  Keep your ecosystem open to other technologies.
  105. We are hiring miguel.cabrera@trustyou.net

  106. We are hiring miguel.cabrera@trustyou.net

  107. Questions?