Slide 1

Slide 1 text

railsconf 2013 | brandon black brandonmblack.com @brandonmblack

Slide 2

Slide 2 text

railsconf 2013 | brandon black brandonmblack.com +

Slide 3

Slide 3 text

railsconf 2013 | brandon black brandonmblack.com Somi ࣗ޷ Some facts about her: • She drools a lot • She snores louder than me • She’s shaped like a potato ...but who can say no to that face?

Slide 4

Slide 4 text

railsconf 2013 | brandon black brandonmblack.com ONE OF US

Slide 5

Slide 5 text

railsconf 2013 | brandon black brandonmblack.com Goals & Agenda Digging Deeper What’s Next? Looking Ahead, Learning More and Getting Involved Understanding the Tools What’s Available How Ruby Measures Up Bridging the Gaps An Introduction to Natural Language Processing What is It? Why is it so Difficult? Why is it Important?

Slide 6

Slide 6 text

railsconf 2013 | brandon black brandonmblack.com Analyzing, understanding and generating the language that humans use to interface with computers. What is It?

Slide 7

Slide 7 text

railsconf 2013 | brandon black brandonmblack.com What is It? Predictive Text Content Categorization Search Spell Checking Auto Summarization Much, Much More... Machine Translation

Slide 8

Slide 8 text

"Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo." buf-fa-lo (verb) overawe or intimidate (someone): she didn't like being buffaloed. Buf-fa-lo (noun) an industrial city in the northwestern part of the state of New York. buf-fa-lo (noun) a heavily built wild ox with backswept horns.

Slide 9

Slide 9 text

railsconf 2013 | brandon black brandonmblack.com Why It’s So Hard No perfect solution exists • Experts often disagree on the results • Fixing one thing often causes problems elsewhere

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

railsconf 2013 | brandon black brandonmblack.com Why It’s So Hard No perfect solution exists • Experts often disagree on the results • Fixing one thing often causes problems elsewhere Language itself is a moving target • Different cultures, grammar, syntax • Language is constant evolving • Technological advances influence language

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

railsconf 2013 | brandon black brandonmblack.com Why It’s So Hard No perfect solution exists • Experts often disagree on the results • Fixing one thing often causes problems elsewhere Language itself is a moving target • Different cultures, grammar, syntax • Language is constant evolving • Technological advances influence language Many aspects are computationally complex • Today vs. 10 years ago • Hardware has advanced, become cheaper

Slide 14

Slide 14 text

AI-Complete The difficulty involved in these computational problems is considered to be equivalent to that of solving the central artificial intelligence problem — making computers as intelligent as people. In short, passing the Turing test...

Slide 15

Slide 15 text

railsconf 2013 | brandon black brandonmblack.com Why is it Important? 39% 14% 19% 28% Email Search Collaboration Role Specific Tasks Study: MGI (2012) http://brblck.com/YfS1Bb

Slide 16

Slide 16 text

railsconf 2013 | brandon black brandonmblack.com Why is it Important? Demand/need is increasing steeply • The problem space is growing • Everyone has a “big data” problem these days Fun facts about data growth: Photos: • 4 billion in the last year alone, 4x last decade • Half found their way onto the Internet Information: • 1.8 zettabytes annually (Source: IDC 2011) • Increase 50x by 2020

Slide 17

Slide 17 text

railsconf 2013 | brandon black brandonmblack.com 3 Common Approaches • Rule-Based Analysis • Statistical Analysis • Machine Learning Some of the most effective solutions we have today rely on the human-in-the-loop approach, learning from user feedback.

Slide 18

Slide 18 text

railsconf 2013 | brandon black brandonmblack.com Basic Building Blocks Sentence Detection Word Relationships POS Tagging Chunking Tokenizing Co-Reference Resolution Word Stemming Named-Entity Recognition

Slide 19

Slide 19 text

railsconf 2013 | brandon black brandonmblack.com Tools & Libraries NLP Toolkit Leading NLP Toolkit/Framework Strong support from the academic world (SciPy, NumPy) Python was chosen for its expressiveness, ease-of-use What about Ruby?

Slide 20

Slide 20 text

railsconf 2013 | brandon black brandonmblack.com Tools & Libraries Chronic require 'chronic' Chronic.parse('tomorrow') #=> 2013-04-30 12:00:00 -0700 Chronic.parse('monday', :context => :past) #=> 2013-04-22 12:00:00 -0700 Chronic.parse('this tuesday 5:00') #=> 2013-04-30 17:00:00 -0700 Chronic.parse('august 29th 1997 at 2:14 am') #=> 2010-06-14 23:00:00 -0700 Activity Level:

Slide 21

Slide 21 text

railsconf 2013 | brandon black brandonmblack.com Tools & Libraries Linguistics require 'linguistics' Linguistics.use(:en) "spy".en.plural # => "spies" "spy".en.a # => "a spy" "spy".en.present_participle # => "spying" "spy".en.quantify(5) # => "several spys" 3.en.ordinal # => "3rd" 3.en.numwords # => "three" "be".en.conjugate( :present, :third_person_singular) # => "is" "be".en.conjugate( :present, :first_person_singular) # => "am" Activity Level:

Slide 22

Slide 22 text

railsconf 2013 | brandon black brandonmblack.com Tools & Libraries Punkt Segmenter require 'punkt-segmenter' tokenizer = Punkt::SentenceTokenizer.new(text) result = tokenizer.sentences_from_text(text) #=> [[0, 201], [203, 351], [353, 526]] trainer = Punkt::Trainer.new() trainer.train(trainning_text) Activity Level:

Slide 23

Slide 23 text

railsconf 2013 | brandon black brandonmblack.com Tools & Libraries Ruby Stemmer require 'lingua/stemmer' stemmer = Lingua::Stemmer.new(:language => "en") stemmer.stem("potatoes") #=> "potato" Activity Level:

Slide 24

Slide 24 text

railsconf 2013 | brandon black brandonmblack.com Tools & Libraries Treat Activity Level: • Extraction • Chunking • Sentence Segmentation • Stemming • Machine Learning • Inflection • Serialization “The Ruby NLP Toolkit”

Slide 25

Slide 25 text

railsconf 2013 | brandon black brandonmblack.com What happens when you need more?

Slide 26

Slide 26 text

railsconf 2013 | brandon black brandonmblack.com

Slide 27

Slide 27 text

+

Slide 28

Slide 28 text

Did he just say Java?

Slide 29

Slide 29 text

railsconf 2013 | brandon black brandonmblack.com It’s the Ruby you know and love with: True Multicore Concurrency (No GIL) Portability of Java JIT-Compilation It allows you to leverage well- established, mature Java libraries from within your Ruby code. Real globals and constants No wait, it’s like this really cool thing!

Slide 30

Slide 30 text

railsconf 2013 | brandon black brandonmblack.com Example Summarization 1. Tokenization 2. Remove Stop Words 3. Rank Relevant Words 4. Extract Sentences 5. Rank Sentences 6. Return Summary require 'opennlp-tools-1.5.2-incubating' require 'snowball-libstemmer' include_package 'java.io' MIN_SIZE = 4 SENTENCE_COUNT = 3 STOP_WORDS_FILE = 'stop_words.txt'

Slide 31

Slide 31 text

railsconf 2013 | brandon black brandonmblack.com Example Summarization 1. Tokenize Text 2. Remove Stop Words 3. Rank Relevant Words 4. Extract Sentences 5. Rank Sentences 6. Return Summary token_model = TokenizerModel.new( FileInputStream.new('models/en-token.bin')) include_package 'opennlp.tools.tokenize' tokenizer = TokenizerME.new(token_model) tokens = tokenizer.tokenize(text).to_a

Slide 32

Slide 32 text

railsconf 2013 | brandon black brandonmblack.com Example Summarization 1. Tokenize Text 2. Remove Stop Words 3. Rank Relevant Words 4. Extract Sentences 5. Rank Sentences 6. Return Summary def remove_stop_words(words) @@stop_words ||= File.open(STOP_WORDS_FILE).read.split words.reject! do |w| @@stop_words.include?(w.downcase) || w.size <= MIN_SIZE end end # filter stop words remove_stop_words(tokens)

Slide 33

Slide 33 text

railsconf 2013 | brandon black brandonmblack.com Example Summarization 1. Tokenize Text 2. Remove Stop Words 3. Rank Relevant Words 4. Extract Sentences 5. Rank Sentences 6. Return Summary def word_stem(word) stemmer = Java::org.tartarus.snowball. ext.englishStemmer.new stemmer.current = word stemmer.stem stemmer.current end rankings = {} words.each do |w| stem = word_stem(w) rankings[stem] = rankings.has_key?(stem) ? rankings[stem] += 1 : 1 end rankings.sory_by {|k,v| v}.reverse rankings = rankings[0..99]

Slide 34

Slide 34 text

railsconf 2013 | brandon black brandonmblack.com Example Summarization 1. Tokenize Text 2. Remove Stop Words 3. Rank Relevant Words 4. Extract Sentences 5. Rank Sentences 6. Return Summary include_package 'opennlp.tools.sentdetect' sent_model = SentenceModel.new( FileInputStream.new('models/en-sent.bin')) detector = SentenceDetectorME.new(sent_model) sentences = detector.sent_detect(text).to_a

Slide 35

Slide 35 text

railsconf 2013 | brandon black brandonmblack.com Example Summarization 1. Tokenize Text 2. Remove Stop Words 3. Rank Relevant Words 4. Extract Sentences 5. Rank Sentences 6. Return Summary sentence_ranks = {} sentences.each_with_index do |s,i| rank = 0 words.each {|w| rank += 1 if s.include?(w)} sentence_ranks[i] = rank end

Slide 36

Slide 36 text

railsconf 2013 | brandon black brandonmblack.com Example Summarization 1. Tokenize Text 2. Remove Stop Words 3. Rank Relevant Words 4. Extract Sentences 5. Rank Sentences 6. Return Summary #output summary summary = [] sentence_ranks = sentence_ranks.sort {|a,b| b[1] <=> a[1]} sentence_ranks.each do |rank| summary << sentences[rank[0]] break if summary.size >= SENTENCE_COUNT end summary.join(' ')

Slide 37

Slide 37 text

railsconf 2013 | brandon black brandonmblack.com What’s Next? Learn More • Don’t be afraid to try other languages/platforms • Leverage Coursera / Online Learning • Less TV, more books

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

railsconf 2013 | brandon black brandonmblack.com What’s Next? Learn More • Don’t be afraid to try other languages/platforms • Leverage Coursera / Online Learning • Less TV, more books Contribute • Treat (https://github.com/louismullie/treat) • SciRuby Project (http://sciruby.com/) Share • Local meetups • Tech talks at your workplace • Blogs

Slide 40

Slide 40 text

railsconf 2013 | brandon black brandonmblack.com Thanks