$30 off During Our Annual Pro Sale. View Details »

PyData Berlin 2015 - Processing Hotel Reviews with Python

PyData Berlin 2015 - Processing Hotel Reviews with Python

PyData Berlin 2015 - Processing Hotel Reviews with Python
Video https://www.youtube.com/watch?v=eIEsdluhoxY

Miguel Cabrera

May 30, 2015
Tweet

More Decks by Miguel Cabrera

Other Decks in Programming

Transcript

  1. Processing Hotel Reviews
    with Python
    Miguel Cabrera
    @mfcabrera
    & Friends
    https://www.flickr.com/photos/18694857@N00/5614701858/

    View Slide

  2. About Me

    View Slide

  3. •  Colombian
    •  Neuberliner
    •  Work for TrustYou as Data (Scientist|Engineer|Juggler)™
    •  Python around 2 years
    •  Founder and former organizer of Munich DataGeeks
    About Me

    View Slide

  4. Agenda

    View Slide

  5. •  Problem description
    •  Tools
    •  Sample Application
    Agenda

    View Slide

  6. TrustYou

    View Slide

  7. View Slide

  8. View Slide

  9. View Slide

  10. •  Crawling
    •  Natural Language Processing / Semantic Analysis
    •  Record Linkage / Deduplication
    •  Ranking
    •  Recommendation
    •  Classification
    •  Clustering
    Tasks

    View Slide

  11. Batch Layer
    •  Hadoop
    •  Python
    •  Pig*
    •  Java*
    Service Layer
    •  PostgreSQL
    •  MongoDB
    •  Redis
    •  Cassandra
    DATA DATA
    Hadoop Cluster Application Machines
    Stack

    View Slide

  12. 25 supported languages

    View Slide

  13. 500,000+ Properties

    View Slide

  14. 30,000,000+ daily crawled reviews

    View Slide

  15. Deduplicated against 250,000,000+ reviews

    View Slide

  16. 200,000+ daily new reviews

    View Slide

  17. https://www.flickr.com/photos/22646823@N08/2694765397/
    Lots of text

    View Slide

  18. Clean, Filter, Join and Aggregate

    View Slide

  19. Crawl  
    Extract  
    Clean  
    Stats  
    ML  
    ML  
    NLP  

    View Slide

  20. Steps in different technologies

    View Slide

  21. Steps can be run in parallel

    View Slide

  22. Steps have complex dependencies among them

    View Slide

  23. •  Technology
    •  Parallel / Scale
    •  Dependency management / Orchestration
    Requirements

    View Slide

  24. Technology

    View Slide

  25. •  Numpy
    •  NLTK
    •  Scikit-Learn
    •  Pandas
    •  IPython / Jupyter
    Python

    View Slide

  26. Scaling

    View Slide

  27. Hadoop
     

    View Slide

  28. https://www.flickr.com/photos/12914838@N00/15015146343/
    Hadoop = Java?

    View Slide

  29. •  Hadoop Streaming
    •  MRJob
    •  Oozie
    •  Luigi
    •  …
    Python + Hadoop

    View Slide

  30. Hadoop
    Streaming
    cat input.txt | ./map.py | sort | ./reduce.py > output.txt

    View Slide

  31. Hadoop
    Streaming
    hadoop jar contrib/streaming/hadoop-*streaming*.jar \
    -file /home/hduser/mapper.py -mapper /home/hduser/mapper.py \
    -file /home/hduser/reducer.py -reducer /home/hduser/reducer.py \
    -input /user/hduser/text.txt -output /user/hduser/gutenberg-output

    View Slide

  32. Who likes to write Bash scripts?

    View Slide

  33. Orchestrate

    View Slide

  34. Luigi
    “ A python framework for data
    flow definition and execution ”

    View Slide

  35. Luigi
    •  Dependency definition
    •  Hadoop / HDFS Integration
    •  Object oriented abstraction
    •  Parallelism
    •  Resume failed jobs
    •  Visualization of pipelines
    •  Command line integration

    View Slide

  36. Minimal Bolerplate Code
    class WordCount(luigi.Task):
    date = luigi.DateParameter()
    def requires(self):
    return InputText(date)
    def output(self):
    return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
    def run(self):
    count = {}
    for f in self.input():
    for line in f.open('r'):
    for word in line.strip().split():
    count[word] = count.get(word, 0) + 1
    f = self.output().open('w')
    for word, count in six.iteritems(count):
    f.write("%s\t%d\n" % (word, count))
    f.close()

    View Slide

  37. class WordCount(luigi.Task):
    date = luigi.DateParameter()
    def requires(self):
    return InputText(date)
    def output(self):
    return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
    def run(self):
    count = {}
    for f in self.input():
    for line in f.open('r'):
    for word in line.strip().split():
    count[word] = count.get(word, 0) + 1
    f = self.output().open('w')
    for word, count in six.iteritems(count):
    f.write("%s\t%d\n" % (word, count))
    f.close()
    Task Parameters

    View Slide

  38. class WordCount(luigi.Task):
    date = luigi.DateParameter()
    def requires(self):
    return InputText(date)
    def output(self):
    return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
    def run(self):
    count = {}
    for f in self.input():
    for line in f.open('r'):
    for word in line.strip().split():
    count[word] = count.get(word, 0) + 1
    f = self.output().open('w')
    for word, count in six.iteritems(count):
    f.write("%s\t%d\n" % (word, count))
    f.close()
    Programmatically Defined Dependencies

    View Slide

  39. class WordCount(luigi.Task):
    date = luigi.DateParameter()
    def requires(self):
    return InputText(date)
    def output(self):
    return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
    def run(self):
    count = {}
    for f in self.input():
    for line in f.open('r'):
    for word in line.strip().split():
    count[word] = count.get(word, 0) + 1
    f = self.output().open('w')
    for word, count in six.iteritems(count):
    f.write("%s\t%d\n" % (word, count))
    f.close()
    Each Task produces an ouput

    View Slide

  40. class WordCount(luigi.Task):
    date = luigi.DateParameter()
    def requires(self):
    return InputText(date)
    def output(self):
    return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
    def run(self):
    count = {}
    for f in self.input():
    for line in f.open('r'):
    for word in line.strip().split():
    count[word] = count.get(word, 0) + 1
    f = self.output().open('w')
    for word, count in six.iteritems(count):
    f.write("%s\t%d\n" % (word, count))
    f.close()
    Write Logic in Python

    View Slide

  41. class WordCount(luigi.hadoop.JobTask):
    date = luigi.DateParameter()
    def requires(self):
    return InputText(date)
    def output(self):
    return luigi.hdfs.HdfsTarget(’%s' % self.date_interval)
    def mapper(self, line):
    for word in line.strip().split():
    yield word, 1
    def reducer(self, key, values):
    yield key, sum(values)
    Luigi + Hadoop/HDFS

    View Slide

  42. Data Flow Visualization

    View Slide

  43. Luigi
    •  Minimal bolierplate code
    •  Programmatically define dependencies
    •  Integration with HDFS / Hadoop
    •  Task Syncronization
    •  Can wrap anything

    View Slide

  44. Before
    •  Bash scripts + Cron
    •  Manual cleanup
    •  Manual failure recovery
    •  Hard(er) to debug

    View Slide

  45. Now
    •  Complex nested Luigi jobs graphs
    •  Automatic retries
    •  Still Hard to debug

    View Slide

  46. We use it for…
    •  Standalone executables
    •  Dump data from databases
    •  General Hadoop Streaming
    •  Bash Scripts / MRJob
    •  Pig* Scripts

    View Slide

  47. You can wrap anything

    View Slide

  48. You can wrap anything
    Pig

    View Slide

  49. The right tool for the right job

    View Slide

  50. Pig is a highlevel platform for creating
    MapReduce programs with Hadoop

    View Slide

  51. SQL
    SELECT f3, SUM(f2), AVG(f1) FROM relation WHERE f1 > 500 GROUP BY f3
    rel = LOAD 'relation' AS (f1: int, f2: int, f3: chararray);
    rel = FILTER rel f1 > 500
    by_f3 = GROUP rel BY f3;
    result = FOREACH by_f3 GENERATE group, SUM(by_f3.f2), AVG(by_f3.f1)
    Pig Latin
    Python
    def map(r):
    if r['f1'] > 500:
    yield r['f3'], [r['f1'], r['f2']]
    def reduce(k, values):
    avg = 0
    summ = 0
    l = len(values)
    for r in values:
    summ += r[1]
    avg += r[0]
    avg = avg/float(l)
    yield k, [summ, avg]

    View Slide

  52. Pig + Python
    •  Data loading and transformation in Pig
    •  Other logic in Python
    •  Pig as a Luigi Task
    •  Pig UDFs defined in Python

    View Slide

  53. Sample Application

    View Slide

  54. Reviews are boring…

    View Slide

  55. View Slide

  56. View Slide

  57. Source:  h7p://www.telegraph.co.uk/travel/hotels/11240430/TripAdvisor-­‐the-­‐funniest-­‐
    reviews-­‐biggest-­‐controversies-­‐and-­‐best-­‐spoofs.html  

    View Slide

  58. Reviews highlight the individuality
    and personality of users

    View Slide

  59. Snippets from Reviews
    “Hips don’t lie”
    “Maid was banging”
    “Beautiful bowl flowers”
    “Irish dance, I love that”
    “No ghost sighting”
    “One ghost touching”
    “Too much cardio, not enough squats in the gym”
    “it is like hugging a bony super model”

    View Slide

  60. Word2Vec

    View Slide

  61. Group of algorithms

    View Slide

  62. An instance of shallow learning

    View Slide

  63. Feature learning model

    View Slide

  64. Generates real-valued vectors represenation of
    words

    View Slide

  65. “king” – “man” + “woman” = “queen”

    View Slide

  66. Word2Vec
    Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

    View Slide

  67. Word2Vec
    Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

    View Slide

  68. Word2Vec
    Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

    View Slide

  69. Word2Vec
    Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

    View Slide

  70. Word2Vec
    Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

    View Slide

  71. Word2Vec
    Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

    View Slide

  72. Similar words are nearby vectors

    View Slide

  73. Wor2vec offer a similarity metric of
    words

    View Slide

  74. Can be extended to paragraphs and
    documents

    View Slide

  75. A fast Python based implementation
    available via Gensim

    View Slide

  76. Hotel Reviews + Gensim + Python +
    Luigi = ?

    View Slide

  77. ExtractSentences
    LearnBigrams
    LearnModel
    ExtractClusterIds
    UploadEmbeddings
    Pig

    View Slide

  78. View Slide

  79. from gensim.models.doc2vec import Doc2Vec
    class LearnModelTask(luigi.Task):
    # Parameters.... blah blah blah
    def output(self):
    return luigi.LocalTarget(os.path.join(self.output_directory,
    self.model_out))
    def requires(self):
    return LearnBigramsTask()
    def run(self):
    sentences = LabeledClusterIDSentence(self.input().path)
    model = Doc2Vec(sentences=sentences,
    size=int(self.size),
    dm=int(self.distmem),
    negative=int(self.negative),
    workers=int(self.workers),
    window=int(self.window),
    min_count=int(self.min_count),
    train_words=True)
    model.save(self.output().path)

    View Slide

  80. Wor2vec/Doc2vec offer a similarity
    metric of words

    View Slide

  81. Similarities are useful for non-
    personalized recommender systems

    View Slide

  82. Non-personalized recommenders
    recommend items based on what
    other consumers have said about the
    items.

    View Slide

  83. http://demo.trustyou.com

    View Slide

  84. Takeaways

    View Slide

  85. Takeaways
    •  It is possible to use Python as the primary language for doing
    large data processing on Hadoop.
    •  It is not a perfect setup but works well most of the time.
    •  Keep your ecosystem open to other technologies.
    •  Products reviews contain much more information than just
    facts.

    View Slide

  86. Questions?

    View Slide