Natural Language Processing with JRuby and OpenNLP

Slide 1

Slide 1 text

Konstantin Tennhard Ruby Developer at flinc Hi, I’m ...

Slide 2

Slide 2 text

Natural Language Processing (NLP) with JRuby and OpenNLP

Slide 3

Slide 3 text

Motivation Language and stuff ...

Slide 4

Slide 4 text

Sharing Information Language Language is the most natural way to communicate with others. It is excellent for encoding information.

Slide 5

Slide 5 text

Flow of Information Language

Slide 6

Slide 6 text

Flow of Information Language

Slide 7

Slide 7 text

Representation Language Natural language can be represented as a series of sounds or as a series of characters.

Slide 8

Slide 8 text

Intelligent Machines Natural Language Processing With the help natural language processing methods, we enable machines to understand and process language.

Slide 9

Slide 9 text

Intermediate Processing, e.g., Automatic Translation Natural Language Processing

Slide 10

Slide 10 text

Human-to-Machine Communication Natural Language Processing

Slide 11

Slide 11 text

… we won’t talk about. Examples Machine Translation Text Summarization Opinion Mining

Slide 12

Slide 12 text

… we will talk about! Examples Named Entity Recognition Keyword Extraction

Slide 13

Slide 13 text

A Combination of Many Subjects Natural Language Processing

Slide 14

Slide 14 text

A Combination of Many Subjects Natural Language Processing

Slide 15

Slide 15 text

Linguistic Basics No Ruby, yet. Hang in there.

Slide 16

Slide 16 text

Part of Speech The part of speech or word class of a word denotes its syntactic function. Words can have multiple classes, e.g., ‘to fly’ (Verb) and ‘a fly’ (Noun).

Slide 17

Slide 17 text

Word Stem The stem of a word is the part of the word that is common to all its derived variants. The stem of a word can be an artificial construct.

Slide 18

Slide 18 text

Technology Y u no MRI ...

Slide 19

Slide 19 text

JRuby Ruby is a very expressive language with excellent string processing capabilities. The JVM is a high performance platform with true multi-threading capabilities. Excellent java libraries for natural language processing exist.

Slide 20

Slide 20 text

Machine Learning Based NLP Toolkit OpenNLP OpenNLP is solely based on machine learning methods. It uses maximum entropy classification to perform natural language processing tasks. http://opennlp.apache.org/

Slide 21

Slide 21 text

Pre-Trained Models OpenNLP Maximum entropy classifiers have to be trained before they can be utilized. Pre-trained models can be downloaded from SourceForge: http://opennlp.sourceforge.net/ models-1.5/

Slide 22

Slide 22 text

Three Steps OpenNLP 1. Load an existing model or create a new one from your own training data. 2. Initialize the classifier using this model as input. 3. Perform the actual classification task.

Slide 23

Slide 23 text

The Gems OpenNLP Minimal wrapper around the original OpenNLP implementation: • Automatic conversion between Ruby and Java datatypes • Unified Interface Separate Gems for English and German model files.

Slide 24

Slide 24 text

NLP Basics Automating linguistic analyses ...

Slide 25

Slide 25 text

String → Sequence of Logical Units Segmentation The problem of segmentation is concerned with splitting a text into a sequence of logical units. Diﬀerent instances of this problem exist.

Slide 26

Slide 26 text

Text → Sentences Sentence Detection Sentence detection is the process of segmenting a text into sentences. The problem is harder than it looks: • Ruby is awesome. Ruby is great! • “Stop it!”, Mr. Smith shouted across the yard. He was clearly angry.

Slide 27

Slide 27 text

Text → Sentences m = OpenNLP::English.sentence_detection_model d = OpenNLP::SentenceDetector.new(m) r = d.process <<-TEXT Ruby is awesome. Ruby is great! TEXT r[0] # => "Ruby is awesome." r[1] # => "Ruby is great!" Sentence Detection

Slide 28

Slide 28 text

Sentence → Words Tokenization Tokenization is the task of detecting word boundaries. Challenges: • Languages with no visual representation of word boundaries • Multiple separators

Slide 29

Slide 29 text

String → Linguistic Units m = OpenNLP::English.tokenization_model t = OpenNLP::Tokenizer.new(m) r = t.process("I shot an elephant in my pajamas.") r # => ["I", "shot", "an", "elephant", "in", "my", "pajamas", "."] Tokenization

Slide 30

Slide 30 text

Tokens → Tags Part-of-Speech Tagging Part-of-Speech tagging is concerned with identifying a word’s class in a given context. A common format for representing Part-of-Speech tags is the Penn Treebank tag set.

Slide 31

Slide 31 text

Tokens → Tags m = OpenNLP::English.pos_tagging_model t = OpenNLP::POSTagger.new(m) r = t.process(%w[Ruby is awesome]) r[0] # => NNP r[1] # => VBZ r[2] # => JJ Part-of-Speech Tagging

Slide 32

Slide 32 text

Inflected word → Word stem Stemming Stemming is the processes of applying a set of rules to remove morphological suﬀixes. Porter’s stemmer is probably the most popular stemmer.

Slide 33

Slide 33 text

Inflected word → Word stem # https://github.com/raypereda/stemmify require 'stemmify' "programming".stem # => "program" Stemming

Slide 34

Slide 34 text

Tokens → Names | Locations | … Named Entity Recognition Named entities are noun phrases that refer to individuals, organizations, locations, etc. Named Entity Recognition is concerned with identifying named entities in a given text.

Slide 35

Slide 35 text

Tokens → Names | Locations | … tokens = %w[This summer EuRuKo comes to Athens for two days on the 28th and 29th of June .] m = OpenNLP::Models. named_entity_recognition_model(:location) f = OpenNLP::NameFinder.new(m) ranges = f.process(tokens) ranges.map { |r| tokens[r] } # => ["Athens"] Named Entity Recognition

Slide 36

Slide 36 text

So ware Engineering Bringing it all together ...

Slide 37

Slide 37 text

Properties of NLP Task NLP tasks can o en be expressed as a sequence of steps that is executed linearly. Hence, we can use processing pipelines to solve NLP problems.

Slide 38

Slide 38 text

Processing Pipelines A processing pipeline is a set so ware components connected in series. The output of one component is the input of the next one.

Slide 39

Slide 39 text

t6d/composable_operations Composable Operations A flexible Ruby implementation of a processing pipeline: • Operation represents a single processing component. • ComposedOperation represents a processing pipeline, but can also be used as a component in an other pipeline.

Slide 40

Slide 40 text

Pre-Processing Pipeline Sentence Detection Tokenization POS Tagging Stemming / Lemmatization Clean Up Advanced Tasks

Slide 41

Slide 41 text

Definition require 'composable_operations' include ComposableOperations class PreProcessing < ComposedOperation use SentenceDetection use Tokenization use POSTagging end Pre-Processing Pipeline

Slide 42

Slide 42 text

Sentence Detection Component require 'opennlp' require 'opennlp-english' require 'opennlp-german' require 'composable_operations' include ComposableOperations class SentenceDetection < Operation processes :text property :language, default: :en, converts: :to_sym, required: true, accepts: [:en, :de] def execute detector = OpenNLP::SentenceDetector.new(model) detector.process(text) end protected def model case language when :en OpenNLP::English.sentence_detection_model when :de OpenNLP::German.sentence_detection_model end end end Pre-Processing Pipeline

Slide 43

Slide 43 text

Tokenization Component require 'opennlp' require 'opennlp-english' require 'opennlp-german' require 'composable_operations' include ComposableOperations class Tokenization < Operation processes :sentences property :language, default: :en, converts: :to_sym, required: true, accepts: [:en, :de] def execute tokenizer = OpenNLP::Tokenizer.new(model) Array(sentences).map do |sentence| tokenizer.process(sentence) end end protected def model # ... end end Pre-Processing Pipeline

Slide 44

Slide 44 text

POS Tagging Component require 'opennlp' require 'opennlp-english' require 'opennlp-german' require 'composable_operations' include ComposableOperations class POSTagging < Operation processes :sentences property :language, default: :en, converts: :to_sym, required: true, accepts: [:en, :de] def execute tagger = OpenNLP::POSTagger.new(model) sentences.map.with_index do |sent, sent_idx| tags = tagger.process(sent) tags.map.with_index do |tag, tkn_idx| [sentences[sent_idx][tkn_idx], tag] end end end protected def model # ... end end Pre-Processing Pipeline

Slide 45

Slide 45 text

Execution PreProcessing.perform("Ruby is awesome. Ruby is great!") # Returns: # # [ # [ # ["Ruby", "NNP"], # ["is", "VBZ"], # ["awesome", "JJ"], # [".", "."] # ], # [ # ["Ruby", "NNP"], # ["is", "VBZ"], # ["great", "JJ"], # ["!", "."] # ] # ] Pre-Processing Pipeline

Slide 46

Slide 46 text

Keyword Extraction Let’s talk about the good stuff ...

Slide 47

Slide 47 text

TextRank TextRank is a graph-based algorithm heavily inspired by Google’s PageRank algorithm. The algorithm was published by Rada Mihalcea and Paul Tarau: http:// acl.ldc.upenn.edu/acl2004/emnlp/ pdf/Mihalcea.pdf

Slide 48

Slide 48 text

Cooccurrence Linguistics ... again!

Slide 49

Slide 49 text

... Ruby is awesome ... Word window Cooccurrence

Slide 50

Slide 50 text

Keyword Extraction Pipeline Preprocessing Sentence Detection, Tokenization, POS Tagging, Normalization through Stemming, Token Filtering Cooccurrence Calculation Coocurrence Graph Construction Text Rank Calculation Sorting and Extracting Nodes 1 2 3 4 5

Slide 51

Slide 51 text

class KeywordRanking < ComposedOperation use PreProcessingPipeline, filter: [/^NN/, /^JJ/] use CooccurrenceCalculation use CooccurrenceGraphConstruction use PageRankCalculation use NodeSortingAndExtraction end KeywordRanking.perform(...) Keyword Extraction Pipeline

Slide 52

Slide 52 text

The code can be found on Github: https://github.com/t6d/keyword_extractor Be nice, it’s just some demo code – not for use in production. ;) Code

Slide 53

Slide 53 text

Natural Language Processing with JRuby and OpenNLP by Konstantin Tennhard GitHub: t6d Twitter: t6d Code can be found on GitHub: * http://github.com/t6d/opennlp * http://github.com/t6d/opennlp-english * http://github.com/t6d/opennlp-german * http://github.com/t6d/opennlp-examples * http://github.com/t6d/keyword_extractor * http://github.com/t6d/composable_operations * http://github.com/t6d/smart_properties Any questions? Feel free to approach me anytime throughout the conference or send me a tweet, if that’s what you prefer. Summary

Slide 54

Slide 54 text

_____ _ _ _ _ _ _ ______ |_ _| | | | / \ | \ | | |/ / ___| | | | |_| | / _ \ | \| | ' /\___ \ | | | _ |/ ___ \| |\ | . \ ___) | |_| |_| |_/_/ \_\_| \_|_|\_\____/ Summary