Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Natural Language Processing

Hakka Labs
December 04, 2014

Introduction to Natural Language Processing

By Max Sklar and Maryam Aly from Foursquare.

Video here:

Hakka Labs

December 04, 2014
Tweet

More Decks by Hakka Labs

Other Decks in Programming

Transcript

  1. Today’s Topics 1) Brief overview of Natural Language Processing 2)

    Some ways we use NLP at Foursquare 3) Introduction to NLTK (the python library for Natural Language Processing) and some time for trying it out!
  2. Scope of Natural Language Processing - Humans communicate in natural

    language (English, French, etc.) - NLP is the Interaction between human and machine communication. - Turning our Text into Data
  3. Natural Language Understanding Some lofty goals: - For machines to

    understand human language. - For machines to understand questions and answer them. - For machines to identify ambiguities in our language and know how to resolve them or appropriately ask more questions.
  4. Watson (AI System from IBM) - Defeated Jeopardy Champions in

    2011 - Currently used to in health care to assist humans in helping them develop treatments for patients. - In the game, the machine had some really smart answers and some really dumb answers.
  5. Some Common NLP Tasks - Language Detection - Entity Recognition

    (names, places, etc) - Classification (Sentiment, Spam, Topics) - Parsing and Chunking - Information Retrieval, tokenizing & stemming - Synonyms - Language Modeling
  6. Tools to Accomplish these Tasks Linguistic Analysis - Stemmers (multilingual)

    - Grammatical Rules Statistical NLP - Machine learning - Bayesian Statistics, Smoothing - Deep Learning (word2vec)
  7. We aim to be the best personalized, local recommender system.

    To accomplish this, we need to incorporate a variety of Natural Language Processing techniques.
  8. Counting Phrases MapReduce jobs give us an efficient way to

    compute counts over large datasets periodically.
  9. Wonderful + 4 Lol +2 Delicious +0 Try +0 Wicked

    -2 Lobby -2 worthless -2 Terrible -3 AFINN Word List Written for Twitter 2475 Words (Aug 2011)
  10. Most Highly Correlated Phrases Positive Phrases Highly Recommended Awesome Food

    Can't go wrong with To die for Amazing Is My Favorite ❤ Negative Phrases Worst Horrible Bad Service Not Worth Terrible Mediocre Rude
  11. Algorithms Naive Bayes - Can be inferred directly from counts

    Logistic Regression - Must be trained - 3-class sparse logistic regression - Better Performance
  12. Scoring Phrases The week is broken up into 168 hours.

    Each phrase mentioned in our checkin app Swarm is counted over each week-hour bucket. The score of a phrase (p) for a given time bucket is the log of the probability of that phrase divided by the probability of the baseline distribution.
  13. Bhattacharyya Distance - Easy to calculate - Compare phrases to

    the baseline to see if they have a timeliness component - Compare phrases to each other to see if they are relevant at similar times
  14. Bhattacharyya Coefficients WRT a baseline distribution friday lunch 0.326298016807 sunday

    breakfast 0.330937026235 happi sunday 0.365173887193 wednesday 0.665803555093 burger night 0.6728841103 breakfast spot 0.673112547539 swim lesson 0.815864282503 egg white 0.816790714628 know how0.998238840922 ha ha 0.998261847192 west 0.998483877379
  15. Tastes Using noun phrase detection, sentiment analysis, and TF-IDF TF

    = Term Frequency IDF = Inverse Document Frequency In other words, how often does a term appear at a venue, normalized by the how often it appears across all venues.