Natural Language Processing in Ruby

Natural Language Processing in Ruby

29392a12bce98d5f0de66043d17f378b?s=128

Brandon Black

April 29, 2013
Tweet

Transcript

  1. 3.

    railsconf 2013 | brandon black brandonmblack.com Somi ࣗ޷ Some facts

    about her: • She drools a lot • She snores louder than me • She’s shaped like a potato ...but who can say no to that face?
  2. 5.

    railsconf 2013 | brandon black brandonmblack.com Goals & Agenda Digging

    Deeper What’s Next? Looking Ahead, Learning More and Getting Involved Understanding the Tools What’s Available How Ruby Measures Up Bridging the Gaps An Introduction to Natural Language Processing What is It? Why is it so Difficult? Why is it Important?
  3. 6.

    railsconf 2013 | brandon black brandonmblack.com Analyzing, understanding and generating

    the language that humans use to interface with computers. What is It?
  4. 7.

    railsconf 2013 | brandon black brandonmblack.com What is It? Predictive

    Text Content Categorization Search Spell Checking Auto Summarization Much, Much More... Machine Translation
  5. 8.

    "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo." buf-fa-lo (verb)

    overawe or intimidate (someone): she didn't like being buffaloed. Buf-fa-lo (noun) an industrial city in the northwestern part of the state of New York. buf-fa-lo (noun) a heavily built wild ox with backswept horns.
  6. 9.

    railsconf 2013 | brandon black brandonmblack.com Why It’s So Hard

    No perfect solution exists • Experts often disagree on the results • Fixing one thing often causes problems elsewhere
  7. 10.
  8. 11.

    railsconf 2013 | brandon black brandonmblack.com Why It’s So Hard

    No perfect solution exists • Experts often disagree on the results • Fixing one thing often causes problems elsewhere Language itself is a moving target • Different cultures, grammar, syntax • Language is constant evolving • Technological advances influence language
  9. 12.
  10. 13.

    railsconf 2013 | brandon black brandonmblack.com Why It’s So Hard

    No perfect solution exists • Experts often disagree on the results • Fixing one thing often causes problems elsewhere Language itself is a moving target • Different cultures, grammar, syntax • Language is constant evolving • Technological advances influence language Many aspects are computationally complex • Today vs. 10 years ago • Hardware has advanced, become cheaper
  11. 14.

    AI-Complete The difficulty involved in these computational problems is considered

    to be equivalent to that of solving the central artificial intelligence problem — making computers as intelligent as people. In short, passing the Turing test...
  12. 15.

    railsconf 2013 | brandon black brandonmblack.com Why is it Important?

    39% 14% 19% 28% Email Search Collaboration Role Specific Tasks Study: MGI (2012) http://brblck.com/YfS1Bb
  13. 16.

    railsconf 2013 | brandon black brandonmblack.com Why is it Important?

    Demand/need is increasing steeply • The problem space is growing • Everyone has a “big data” problem these days Fun facts about data growth: Photos: • 4 billion in the last year alone, 4x last decade • Half found their way onto the Internet Information: • 1.8 zettabytes annually (Source: IDC 2011) • Increase 50x by 2020
  14. 17.

    railsconf 2013 | brandon black brandonmblack.com 3 Common Approaches •

    Rule-Based Analysis • Statistical Analysis • Machine Learning Some of the most effective solutions we have today rely on the human-in-the-loop approach, learning from user feedback.
  15. 18.

    railsconf 2013 | brandon black brandonmblack.com Basic Building Blocks Sentence

    Detection Word Relationships POS Tagging Chunking Tokenizing Co-Reference Resolution Word Stemming Named-Entity Recognition
  16. 19.

    railsconf 2013 | brandon black brandonmblack.com Tools & Libraries NLP

    Toolkit Leading NLP Toolkit/Framework Strong support from the academic world (SciPy, NumPy) Python was chosen for its expressiveness, ease-of-use What about Ruby?
  17. 20.

    railsconf 2013 | brandon black brandonmblack.com Tools & Libraries Chronic

    require 'chronic' Chronic.parse('tomorrow') #=> 2013-04-30 12:00:00 -0700 Chronic.parse('monday', :context => :past) #=> 2013-04-22 12:00:00 -0700 Chronic.parse('this tuesday 5:00') #=> 2013-04-30 17:00:00 -0700 Chronic.parse('august 29th 1997 at 2:14 am') #=> 2010-06-14 23:00:00 -0700 Activity Level:
  18. 21.

    railsconf 2013 | brandon black brandonmblack.com Tools & Libraries Linguistics

    require 'linguistics' Linguistics.use(:en) "spy".en.plural # => "spies" "spy".en.a # => "a spy" "spy".en.present_participle # => "spying" "spy".en.quantify(5) # => "several spys" 3.en.ordinal # => "3rd" 3.en.numwords # => "three" "be".en.conjugate( :present, :third_person_singular) # => "is" "be".en.conjugate( :present, :first_person_singular) # => "am" Activity Level:
  19. 22.

    railsconf 2013 | brandon black brandonmblack.com Tools & Libraries Punkt

    Segmenter require 'punkt-segmenter' tokenizer = Punkt::SentenceTokenizer.new(text) result = tokenizer.sentences_from_text(text) #=> [[0, 201], [203, 351], [353, 526]] trainer = Punkt::Trainer.new() trainer.train(trainning_text) Activity Level:
  20. 23.

    railsconf 2013 | brandon black brandonmblack.com Tools & Libraries Ruby

    Stemmer require 'lingua/stemmer' stemmer = Lingua::Stemmer.new(:language => "en") stemmer.stem("potatoes") #=> "potato" Activity Level:
  21. 24.

    railsconf 2013 | brandon black brandonmblack.com Tools & Libraries Treat

    Activity Level: • Extraction • Chunking • Sentence Segmentation • Stemming • Machine Learning • Inflection • Serialization “The Ruby NLP Toolkit”
  22. 27.

    +

  23. 29.

    railsconf 2013 | brandon black brandonmblack.com It’s the Ruby you

    know and love with: True Multicore Concurrency (No GIL) Portability of Java JIT-Compilation It allows you to leverage well- established, mature Java libraries from within your Ruby code. Real globals and constants No wait, it’s like this really cool thing!
  24. 30.

    railsconf 2013 | brandon black brandonmblack.com Example Summarization 1. Tokenization

    2. Remove Stop Words 3. Rank Relevant Words 4. Extract Sentences 5. Rank Sentences 6. Return Summary require 'opennlp-tools-1.5.2-incubating' require 'snowball-libstemmer' include_package 'java.io' MIN_SIZE = 4 SENTENCE_COUNT = 3 STOP_WORDS_FILE = 'stop_words.txt'
  25. 31.

    railsconf 2013 | brandon black brandonmblack.com Example Summarization 1. Tokenize

    Text 2. Remove Stop Words 3. Rank Relevant Words 4. Extract Sentences 5. Rank Sentences 6. Return Summary token_model = TokenizerModel.new( FileInputStream.new('models/en-token.bin')) include_package 'opennlp.tools.tokenize' tokenizer = TokenizerME.new(token_model) tokens = tokenizer.tokenize(text).to_a
  26. 32.

    railsconf 2013 | brandon black brandonmblack.com Example Summarization 1. Tokenize

    Text 2. Remove Stop Words 3. Rank Relevant Words 4. Extract Sentences 5. Rank Sentences 6. Return Summary def remove_stop_words(words) @@stop_words ||= File.open(STOP_WORDS_FILE).read.split words.reject! do |w| @@stop_words.include?(w.downcase) || w.size <= MIN_SIZE end end # filter stop words remove_stop_words(tokens)
  27. 33.

    railsconf 2013 | brandon black brandonmblack.com Example Summarization 1. Tokenize

    Text 2. Remove Stop Words 3. Rank Relevant Words 4. Extract Sentences 5. Rank Sentences 6. Return Summary def word_stem(word) stemmer = Java::org.tartarus.snowball. ext.englishStemmer.new stemmer.current = word stemmer.stem stemmer.current end rankings = {} words.each do |w| stem = word_stem(w) rankings[stem] = rankings.has_key?(stem) ? rankings[stem] += 1 : 1 end rankings.sory_by {|k,v| v}.reverse rankings = rankings[0..99]
  28. 34.

    railsconf 2013 | brandon black brandonmblack.com Example Summarization 1. Tokenize

    Text 2. Remove Stop Words 3. Rank Relevant Words 4. Extract Sentences 5. Rank Sentences 6. Return Summary include_package 'opennlp.tools.sentdetect' sent_model = SentenceModel.new( FileInputStream.new('models/en-sent.bin')) detector = SentenceDetectorME.new(sent_model) sentences = detector.sent_detect(text).to_a
  29. 35.

    railsconf 2013 | brandon black brandonmblack.com Example Summarization 1. Tokenize

    Text 2. Remove Stop Words 3. Rank Relevant Words 4. Extract Sentences 5. Rank Sentences 6. Return Summary sentence_ranks = {} sentences.each_with_index do |s,i| rank = 0 words.each {|w| rank += 1 if s.include?(w)} sentence_ranks[i] = rank end
  30. 36.

    railsconf 2013 | brandon black brandonmblack.com Example Summarization 1. Tokenize

    Text 2. Remove Stop Words 3. Rank Relevant Words 4. Extract Sentences 5. Rank Sentences 6. Return Summary #output summary summary = [] sentence_ranks = sentence_ranks.sort {|a,b| b[1] <=> a[1]} sentence_ranks.each do |rank| summary << sentences[rank[0]] break if summary.size >= SENTENCE_COUNT end summary.join(' ')
  31. 37.

    railsconf 2013 | brandon black brandonmblack.com What’s Next? Learn More

    • Don’t be afraid to try other languages/platforms • Leverage Coursera / Online Learning • Less TV, more books
  32. 38.
  33. 39.

    railsconf 2013 | brandon black brandonmblack.com What’s Next? Learn More

    • Don’t be afraid to try other languages/platforms • Leverage Coursera / Online Learning • Less TV, more books Contribute • Treat (https://github.com/louismullie/treat) • SciRuby Project (http://sciruby.com/) Share • Local meetups • Tech talks at your workplace • Blogs