Taming Text
Dev.Talk
September 2015
Marcel Körtgen
Slide 2
Slide 2 text
Popular Systems
• IBM Watson
• won “Jeopardy”
• Siri (2010), Google Now (2012), Cortana (2014)
• multiple languages
•“Eugene Goostman” passing Turing Test
• 5 min. text based conversation
• fooled 33% of judges it was human
Slide 3
Slide 3 text
Why roll your own search - on
Open Source?
• Flexibility
• Cost of development
• Who knows your content better than you?
• Price → Can afford to scale
Slide 4
Slide 4 text
Search & Relevance
SheetMusicPlus.com “piano”. What docs to return?
•Relevance Scoring
• Boolean model, TF/IDF, Vector space model
→ Lucene’s practical scoring function
•Plus “secret sauce”
• constantly adjusted term boost + custom rules
• Who is your user?
Slide 5
Slide 5 text
NLP basics
Typical operations applied on the token level
•Case alterations
•Stopword removal
•Expansion
•Part-of-speech tagging
• Penn Treebank Project
•Stemming
Slide 6
Slide 6 text
NLP basics
•Extracting text
• “Garbage in, garbage out”
• Apache Tika, OCR
•Much covered in Lucene, SOLR, ELK
Slide 7
Slide 7 text
Further ingredients
•“Fuzzy” string matching, N-grams
• Identifying people, places & things (OpenNLP)
• Clustering text (Mahout)
• Classification, categorization & tagging (Mahout)
Untamed Text. What’s next?
• Most critical problems today are global
• Need better comm. across multiple languages
• Not only search, but understand & translate
•Manual translation highly accurate but doesn’t scale
• Use computers to supplement
• e.g. statistical models on “aligned text”
• Example: UN reports published in 23+
languages
Slide 10
Slide 10 text
Untamed Text. What’s next?
•Detecting emotions in content
• Higher order NLP
• semantics, discourse, pragmatics
• e.g. how to translate phrases
Slide 11
Slide 11 text
Summary
• Most search engines fail at understanding the user
• cannot tweak proprietary “black boxes”
•Succeed by constantly
• incorporate user feed and
• tweak search results
•Open Source allows you to do that
• Ultimate “white box”
Slide 12
Slide 12 text
Summary
•Book “Taming Text” a great introduction
• targeted towards Java
• gets you going real “quick”
•Last chapter contains fact-based Q&A kickstarter
• think “light-weight” IBM Watson
• obvious where to optimize from there
•Invitation to contribute on Open Source!
Slide 13
Slide 13 text
Some other references
• SE Radio #214: Grant Ingersoll on Taming Text
• Theory behind Relevance Scoring (ELK docs)
• Map-Reduce for Machine Learning on Multicore
• https://github.com/tamingtext/book