Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Helping Travellers Make Better Hotel Choices 500 Million Times a Month

Helping Travellers Make Better Hotel Choices 500 Million Times a Month

Presentation give at the Meetup Big Data Colombia (now Colombia IA) based on my PyCONDE 2016

Miguel Cabrera

March 15, 2017
Tweet

More Decks by Miguel Cabrera

Other Decks in Technology

Transcript

  1. Helping Travellers Make
    Better Hotel Choices
    500 Million Times a Month
    Miguel Cabrera
    @mfcabrera
    https://www.flickr.com/photos/18694857@N00/5614701858/

    View full-size slide

  2. •  Neuberliner
    •  Ing. Sistemas e Inf. Universidad Nacional - Med
    •  M.Sc. In Informatics TUM, Hons. Technology
    Management.
    •  Work for TrustYou as Data (Scientist|Engineer|
    Juggler)™
    •  Founder and former organizer of Munich DataGeeks
    ABOUT ME

    View full-size slide

  3. •  What we do
    •  Architecture
    •  Technology
    •  Crawling
    •  Textual Processing
    •  Workflow Management and Scale
    •  Sample Application
    AGENDA

    View full-size slide

  4. For every hotel on the planet, provide
    a summary of traveler reviews.

    View full-size slide

  5. •  Crawling
    •  Natural Language Processing / Semantic
    Analysis
    •  Record Linkage / Deduplication
    •  Ranking
    •  Recommendation
    •  Classification
    •  Clustering
    Tasks

    View full-size slide

  6. ARCHITECTURE

    View full-size slide

  7. Data Flow
    Crawling  
    Seman-c  
    Analysis  
     Database   API  
    Clients
    • Google
    • Kayak+
    • TY
    Analytics

    View full-size slide

  8. Batch
    Layer
    • Hadoop
    • Python
    • Pig*
    • Java*
    Service
    Layer
    • PostgreSQL
    • MongoDB
    • Redis
    • Cassandra
    DATA DATA
    Hadoop Cluster Application
    Machines
    Stack

    View full-size slide

  9. SOME NUMBERS

    View full-size slide

  10. 25 supported languages

    View full-size slide

  11. 500,000+ Properties

    View full-size slide

  12. 30,000,000+ daily crawled
    reviews

    View full-size slide

  13. Deduplicated against 250,000,000+
    reviews

    View full-size slide

  14. 300,000+ daily new reviews

    View full-size slide

  15. https://www.flickr.com/photos/22646823@N08/2694765397/
    Lots of text

    View full-size slide

  16. •  Numpy
    •  NLTK
    •  Scikit-Learn
    •  Pandas
    •  IPython / Jupyter
    •  Scrapy
    Python

    View full-size slide

  17. •  Hadoop Streaming
    •  MRJob
    •  Oozie
    •  Luigi
    •  …
    Python + Hadoop

    View full-size slide

  18. •  Build your own web crawlers
    •  Extract data via CSS selectors, XPath,
    regexes, etc.
    •  Handles queuing, request parallelism,
    cookies, throttling …
    •  Comprehensive and well-designed
    •  Commercial support by
    http://scrapinghub.com/

    View full-size slide

  19. •  2 - 3 million new reviews/week
    •  Customers want alerts 8 - 24h after review
    publication!
    •  Smart crawl frequency & depth, but still high
    overhead
    •  Pools of constantly refreshed EC2 proxy IPs
    •  Direct API connections with many sites
    Crawling at TrustYou

    View full-size slide

  20. •  Custom framework very similar to scrapy
    •  Runs on Hadoop cluster (100 nodes)
    •  Not 100% suitable for MapReduce
    •  Nodes mostly waiting
    •  Coordination/messaging between nodes
    required:
    –  Distributed queue
    –  Rate Limiting
    Crawling at TrustYou

    View full-size slide

  21. Text Processing

    View full-size slide

  22. Text Processing
    Raw  text  
    Setence  
    spli:ng  
    Tokenizing   Stopwords  
    Stemming
    Topic Models
    Word Vectors
    Classification

    View full-size slide

  23. Text Processing

    View full-size slide

  24. •  “great rooms”
    •  “great hotel”
    •  “rooms are terrible”
    •  “hotel is terrible”
    Text Processing
    JJ NN
    JJ NN
    NN VB JJ
    NN VB JJ

    >> nltk.pos_tag(nltk.word_tokenize("hotel is
    terrible"))

    [('hotel', 'NN'), ('is', 'VBZ'), ('terrible', 'JJ')]

    View full-size slide

  25. •  25+ languages
    •  Linguistic system (morphology, taggers,
    grammars, parsers …)
    •  Hadoop: Scale out CPU
    •  ~1B opinions in the database
    •  Python for ML & NLP libraries
    Semantic Analysis

    View full-size slide

  26. Word2Vec/Doc2Vec

    View full-size slide

  27. Group of algorithms

    View full-size slide

  28. An instance of shallow learning

    View full-size slide

  29. Feature learning model

    View full-size slide

  30. Generates real-valued vectors
    represenation of words

    View full-size slide

  31. “king” – “man” + “woman” = “queen”

    View full-size slide

  32. Word2Vec
    Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

    View full-size slide

  33. Word2Vec
    Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

    View full-size slide

  34. Word2Vec
    Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

    View full-size slide

  35. Word2Vec
    Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

    View full-size slide

  36. Word2Vec
    Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

    View full-size slide

  37. Word2Vec
    Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  

    View full-size slide

  38. Similar words/documents are nearby
    vectors

    View full-size slide

  39. Wor2vec offer a similarity metric of
    words

    View full-size slide

  40. Can be extended to paragraphs and
    documents

    View full-size slide

  41. A fast Python based implementation
    available via Gensim

    View full-size slide

  42. Workflow Management and Scale

    View full-size slide

  43. Crawl  
    Extract  
    Clean  
    Stats  
    ML  
    ML  
    NLP  

    View full-size slide

  44. Luigi
    “ A python framework for data
    flow definition and execution ”

    View full-size slide

  45. Luigi
    •  Build complex pipelines of
    batch jobs
    •  Dependency resolution
    •  Parallelism
    •  Resume failed jobs
    •  Some support for Hadoop

    View full-size slide

  46. Luigi
    •  Dependency definition
    •  Hadoop / HDFS Integration
    •  Object oriented abstraction
    •  Parallelism
    •  Resume failed jobs
    •  Visualization of pipelines
    •  Command line integration

    View full-size slide

  47. Minimal Bolerplate Code
    class WordCount(luigi.Task):
    date = luigi.DateParameter()
    def requires(self):
    return InputText(date)
    def output(self):
    return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
    def run(self):
    count = {}
    for f in self.input():
    for line in f.open('r'):
    for word in line.strip().split():
    count[word] = count.get(word, 0) + 1
    f = self.output().open('w')
    for word, count in six.iteritems(count):
    f.write("%s\t%d\n" % (word, count))
    f.close()

    View full-size slide

  48. class WordCount(luigi.Task):
    date = luigi.DateParameter()
    def requires(self):
    return InputText(date)
    def output(self):
    return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
    def run(self):
    count = {}
    for f in self.input():
    for line in f.open('r'):
    for word in line.strip().split():
    count[word] = count.get(word, 0) + 1
    f = self.output().open('w')
    for word, count in six.iteritems(count):
    f.write("%s\t%d\n" % (word, count))
    f.close()
    Task Parameters

    View full-size slide

  49. class WordCount(luigi.Task):
    date = luigi.DateParameter()
    def requires(self):
    return InputText(date)
    def output(self):
    return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
    def run(self):
    count = {}
    for f in self.input():
    for line in f.open('r'):
    for word in line.strip().split():
    count[word] = count.get(word, 0) + 1
    f = self.output().open('w')
    for word, count in six.iteritems(count):
    f.write("%s\t%d\n" % (word, count))
    f.close()
    Programmatically Defined Dependencies

    View full-size slide

  50. class WordCount(luigi.Task):
    date = luigi.DateParameter()
    def requires(self):
    return InputText(date)
    def output(self):
    return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
    def run(self):
    count = {}
    for f in self.input():
    for line in f.open('r'):
    for word in line.strip().split():
    count[word] = count.get(word, 0) + 1
    f = self.output().open('w')
    for word, count in six.iteritems(count):
    f.write("%s\t%d\n" % (word, count))
    f.close()
    Each Task produces an ouput

    View full-size slide

  51. class WordCount(luigi.Task):
    date = luigi.DateParameter()
    def requires(self):
    return InputText(date)
    def output(self):
    return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
    def run(self):
    count = {}
    for f in self.input():
    for line in f.open('r'):
    for word in line.strip().split():
    count[word] = count.get(word, 0) + 1
    f = self.output().open('w')
    for word, count in six.iteritems(count):
    f.write("%s\t%d\n" % (word, count))
    f.close()
    Write Logic in Python

    View full-size slide

  52. https://www.flickr.com/photos/12914838@N00/15015146343/
    Hadoop = Java?

    View full-size slide

  53. Hadoop
    Streaming
    cat input.txt | ./map.py | sort | ./reduce.py > output.txt

    View full-size slide

  54. Hadoop
    Streaming
    hadoop jar contrib/streaming/hadoop-*streaming*.jar \
    -file /home/hduser/mapper.py -mapper /home/hduser/mapper.py \
    -file /home/hduser/reducer.py -reducer /home/hduser/reducer.py \
    -input /user/hduser/text.txt -output /user/hduser/gutenberg-output

    View full-size slide

  55. class WordCount(luigi.hadoop.JobTask):
    date = luigi.DateParameter()
    def requires(self):
    return InputText(date)
    def output(self):
    return luigi.hdfs.HdfsTarget(’%s' % self.date_interval)
    def mapper(self, line):
    for word in line.strip().split():
    yield word, 1
    def reducer(self, key, values):
    yield key, sum(values)
    Luigi + Hadoop/HDFS

    View full-size slide

  56. Go and learn:

    View full-size slide

  57. Data Flow Visualization

    View full-size slide

  58. Data Flow Visualization

    View full-size slide

  59. Before
    •  Bash scripts + Cron
    •  Manual cleanup
    •  Manual failure recovery
    •  Hard(er) to debug

    View full-size slide

  60. Now
    •  Complex nested Luigi jobs graphs
    •  Automatic retries
    •  Still Hard to debug

    View full-size slide

  61. We use it for…
    •  Standalone executables
    •  Dump data from databases
    •  General Hadoop Streaming
    •  Bash Scripts / MRJob
    •  Pig* Scripts

    View full-size slide

  62. You can wrap anything

    View full-size slide

  63. Sample Application

    View full-size slide

  64. Reviews are boring…

    View full-size slide

  65. Source:  hGp://www.telegraph.co.uk/travel/hotels/11240430/TripAdvisor-­‐the-­‐funniest-­‐
    reviews-­‐biggest-­‐controversies-­‐and-­‐best-­‐spoofs.html  

    View full-size slide

  66. Reviews highlight the individuality
    and personality of users

    View full-size slide

  67. Snippets from Reviews
    “Hips don’t lie”
    “Maid was banging”
    “Beautiful bowl flowers”
    “Irish dance, I love that”
    “No ghost sighting”
    “One ghost touching”
    “Too much cardio, not enough squats in the gym”
    “it is like hugging a bony super model”

    View full-size slide

  68. Hotel Reviews + Gensim + Python +
    Luigi = ?

    View full-size slide

  69. ExtractSentences
    LearnBigrams
    LearnModel
    ExtractClusterIds
    UploadEmbeddings
    Pig

    View full-size slide

  70. from gensim.models.doc2vec import Doc2Vec
    class LearnModelTask(luigi.Task):
    # Parameters.... blah blah blah
    def output(self):
    return luigi.LocalTarget(os.path.join(self.output_directory,
    self.model_out))
    def requires(self):
    return LearnBigramsTask()
    def run(self):
    sentences = LabeledClusterIDSentence(self.input().path)
    model = Doc2Vec(sentences=sentences,
    size=int(self.size),
    dm=int(self.distmem),
    negative=int(self.negative),
    workers=int(self.workers),
    window=int(self.window),
    min_count=int(self.min_count),
    train_words=True)
    model.save(self.output().path)

    View full-size slide

  71. Wor2vec/Doc2vec offer a similarity
    metric of words

    View full-size slide

  72. Similarities are useful for non-
    personalized recommender systems

    View full-size slide

  73. Non-personalized recommenders
    recommend items based on what
    other consumers have said about the
    items.

    View full-size slide

  74. http://demo.trustyou.com

    View full-size slide

  75. Takeaways
    •  It is possible to use Python as the primary
    language for doing large data processing on
    Hadoop.
    •  It is not a perfect setup but works well most of
    the time.
    •  Keep your ecosystem open to other
    technologies.

    View full-size slide