PyData Berlin 2015 - Processing Hotel Reviews with Python

PyData Berlin 2015 - Processing Hotel Reviews with Python

PyData Berlin 2015 - Processing Hotel Reviews with Python
Video https://www.youtube.com/watch?v=eIEsdluhoxY

D0ab1fbc41764f8ea112824449b33e18?s=128

Miguel Cabrera

May 30, 2015
Tweet

Transcript

  1. Processing Hotel Reviews with Python Miguel Cabrera @mfcabrera & Friends

    https://www.flickr.com/photos/18694857@N00/5614701858/
  2. About Me

  3. •  Colombian •  Neuberliner •  Work for TrustYou as Data

    (Scientist|Engineer|Juggler)™ •  Python around 2 years •  Founder and former organizer of Munich DataGeeks About Me
  4. Agenda

  5. •  Problem description •  Tools •  Sample Application Agenda

  6. TrustYou

  7. None
  8. None
  9. None
  10. •  Crawling •  Natural Language Processing / Semantic Analysis • 

    Record Linkage / Deduplication •  Ranking •  Recommendation •  Classification •  Clustering Tasks
  11. Batch Layer •  Hadoop •  Python •  Pig* •  Java*

    Service Layer •  PostgreSQL •  MongoDB •  Redis •  Cassandra DATA DATA Hadoop Cluster Application Machines Stack
  12. 25 supported languages

  13. 500,000+ Properties

  14. 30,000,000+ daily crawled reviews

  15. Deduplicated against 250,000,000+ reviews

  16. 200,000+ daily new reviews

  17. https://www.flickr.com/photos/22646823@N08/2694765397/ Lots of text

  18. Clean, Filter, Join and Aggregate

  19. Crawl   Extract   Clean   Stats   ML  

    ML   NLP  
  20. Steps in different technologies

  21. Steps can be run in parallel

  22. Steps have complex dependencies among them

  23. •  Technology •  Parallel / Scale •  Dependency management /

    Orchestration Requirements
  24. Technology

  25. •  Numpy •  NLTK •  Scikit-Learn •  Pandas •  IPython

    / Jupyter Python
  26. Scaling

  27. Hadoop  

  28. https://www.flickr.com/photos/12914838@N00/15015146343/ Hadoop = Java?

  29. •  Hadoop Streaming •  MRJob •  Oozie •  Luigi • 

    … Python + Hadoop
  30. Hadoop Streaming cat input.txt | ./map.py | sort | ./reduce.py

    > output.txt
  31. Hadoop Streaming hadoop jar contrib/streaming/hadoop-*streaming*.jar \ -file /home/hduser/mapper.py -mapper /home/hduser/mapper.py

    \ -file /home/hduser/reducer.py -reducer /home/hduser/reducer.py \ -input /user/hduser/text.txt -output /user/hduser/gutenberg-output
  32. Who likes to write Bash scripts?

  33. Orchestrate

  34. Luigi “ A python framework for data flow definition and

    execution ”
  35. Luigi •  Dependency definition •  Hadoop / HDFS Integration • 

    Object oriented abstraction •  Parallelism •  Resume failed jobs •  Visualization of pipelines •  Command line integration
  36. Minimal Bolerplate Code class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self):

    return InputText(date) def output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close()
  37. class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self): return InputText(date) def

    output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close() Task Parameters
  38. class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self): return InputText(date) def

    output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close() Programmatically Defined Dependencies
  39. class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self): return InputText(date) def

    output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close() Each Task produces an ouput
  40. class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self): return InputText(date) def

    output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close() Write Logic in Python
  41. class WordCount(luigi.hadoop.JobTask): date = luigi.DateParameter() def requires(self): return InputText(date) def

    output(self): return luigi.hdfs.HdfsTarget(’%s' % self.date_interval) def mapper(self, line): for word in line.strip().split(): yield word, 1 def reducer(self, key, values): yield key, sum(values) Luigi + Hadoop/HDFS
  42. Data Flow Visualization

  43. Luigi •  Minimal bolierplate code •  Programmatically define dependencies • 

    Integration with HDFS / Hadoop •  Task Syncronization •  Can wrap anything
  44. Before •  Bash scripts + Cron •  Manual cleanup • 

    Manual failure recovery •  Hard(er) to debug
  45. Now •  Complex nested Luigi jobs graphs •  Automatic retries

    •  Still Hard to debug
  46. We use it for… •  Standalone executables •  Dump data

    from databases •  General Hadoop Streaming •  Bash Scripts / MRJob •  Pig* Scripts
  47. You can wrap anything

  48. You can wrap anything Pig

  49. The right tool for the right job

  50. Pig is a highlevel platform for creating MapReduce programs with

    Hadoop
  51. SQL SELECT f3, SUM(f2), AVG(f1) FROM relation WHERE f1 >

    500 GROUP BY f3 rel = LOAD 'relation' AS (f1: int, f2: int, f3: chararray); rel = FILTER rel f1 > 500 by_f3 = GROUP rel BY f3; result = FOREACH by_f3 GENERATE group, SUM(by_f3.f2), AVG(by_f3.f1) Pig Latin Python def map(r): if r['f1'] > 500: yield r['f3'], [r['f1'], r['f2']] def reduce(k, values): avg = 0 summ = 0 l = len(values) for r in values: summ += r[1] avg += r[0] avg = avg/float(l) yield k, [summ, avg]
  52. Pig + Python •  Data loading and transformation in Pig

    •  Other logic in Python •  Pig as a Luigi Task •  Pig UDFs defined in Python
  53. Sample Application

  54. Reviews are boring…

  55. None
  56. None
  57. Source:  h7p://www.telegraph.co.uk/travel/hotels/11240430/TripAdvisor-­‐the-­‐funniest-­‐ reviews-­‐biggest-­‐controversies-­‐and-­‐best-­‐spoofs.html  

  58. Reviews highlight the individuality and personality of users

  59. Snippets from Reviews “Hips don’t lie” “Maid was banging” “Beautiful

    bowl flowers” “Irish dance, I love that” “No ghost sighting” “One ghost touching” “Too much cardio, not enough squats in the gym” “it is like hugging a bony super model”
  60. Word2Vec

  61. Group of algorithms

  62. An instance of shallow learning

  63. Feature learning model

  64. Generates real-valued vectors represenation of words

  65. “king” – “man” + “woman” = “queen”

  66. Word2Vec Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

  67. Word2Vec Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

  68. Word2Vec Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

  69. Word2Vec Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

  70. Word2Vec Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

  71. Word2Vec Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

  72. Similar words are nearby vectors

  73. Wor2vec offer a similarity metric of words

  74. Can be extended to paragraphs and documents

  75. A fast Python based implementation available via Gensim

  76. Hotel Reviews + Gensim + Python + Luigi = ?

  77. ExtractSentences LearnBigrams LearnModel ExtractClusterIds UploadEmbeddings Pig

  78. None
  79. from gensim.models.doc2vec import Doc2Vec class LearnModelTask(luigi.Task): # Parameters.... blah blah

    blah def output(self): return luigi.LocalTarget(os.path.join(self.output_directory, self.model_out)) def requires(self): return LearnBigramsTask() def run(self): sentences = LabeledClusterIDSentence(self.input().path) model = Doc2Vec(sentences=sentences, size=int(self.size), dm=int(self.distmem), negative=int(self.negative), workers=int(self.workers), window=int(self.window), min_count=int(self.min_count), train_words=True) model.save(self.output().path)
  80. Wor2vec/Doc2vec offer a similarity metric of words

  81. Similarities are useful for non- personalized recommender systems

  82. Non-personalized recommenders recommend items based on what other consumers have

    said about the items.
  83. http://demo.trustyou.com

  84. Takeaways

  85. Takeaways •  It is possible to use Python as the

    primary language for doing large data processing on Hadoop. •  It is not a perfect setup but works well most of the time. •  Keep your ecosystem open to other technologies. •  Products reviews contain much more information than just facts.
  86. Questions?