Slide 1

Slide 1 text

Taming Text Dev.Talk September 2015 Marcel Körtgen

Slide 2

Slide 2 text

Popular Systems • IBM Watson • won “Jeopardy” • Siri (2010), Google Now (2012), Cortana (2014) • multiple languages •“Eugene Goostman” passing Turing Test • 5 min. text based conversation • fooled 33% of judges it was human

Slide 3

Slide 3 text

Why roll your own search - on Open Source? • Flexibility • Cost of development • Who knows your content better than you? • Price → Can afford to scale

Slide 4

Slide 4 text

Search & Relevance SheetMusicPlus.com “piano”. What docs to return? •Relevance Scoring • Boolean model, TF/IDF, Vector space model → Lucene’s practical scoring function •Plus “secret sauce” • constantly adjusted term boost + custom rules • Who is your user?

Slide 5

Slide 5 text

NLP basics Typical operations applied on the token level •Case alterations •Stopword removal •Expansion •Part-of-speech tagging • Penn Treebank Project •Stemming

Slide 6

Slide 6 text

NLP basics •Extracting text • “Garbage in, garbage out” • Apache Tika, OCR •Much covered in Lucene, SOLR, ELK

Slide 7

Slide 7 text

Further ingredients •“Fuzzy” string matching, N-grams • Identifying people, places & things (OpenNLP) • Clustering text (Mahout) • Classification, categorization & tagging (Mahout)

Slide 8

Slide 8 text

Open Source Projects • Lucene → Solr, ElasticSearch • OpenNLP → POS, Entities • Mahout → Clustering, Classification, Scale, ... • Hadoop • Spark

Slide 9

Slide 9 text

Untamed Text. What’s next? • Most critical problems today are global • Need better comm. across multiple languages • Not only search, but understand & translate •Manual translation highly accurate but doesn’t scale • Use computers to supplement • e.g. statistical models on “aligned text” • Example: UN reports published in 23+ languages

Slide 10

Slide 10 text

Untamed Text. What’s next? •Detecting emotions in content • Higher order NLP • semantics, discourse, pragmatics • e.g. how to translate phrases

Slide 11

Slide 11 text

Summary • Most search engines fail at understanding the user • cannot tweak proprietary “black boxes” •Succeed by constantly • incorporate user feed and • tweak search results •Open Source allows you to do that • Ultimate “white box”

Slide 12

Slide 12 text

Summary •Book “Taming Text” a great introduction • targeted towards Java • gets you going real “quick” •Last chapter contains fact-based Q&A kickstarter • think “light-weight” IBM Watson • obvious where to optimize from there •Invitation to contribute on Open Source!

Slide 13

Slide 13 text

Some other references • SE Radio #214: Grant Ingersoll on Taming Text • Theory behind Relevance Scoring (ELK docs) • Map-Reduce for Machine Learning on Multicore • https://github.com/tamingtext/book

Slide 14

Slide 14 text

Thank You Time for Questions!