Upgrade to Pro — share decks privately, control downloads, hide ads and more …

On the book "Taming Text"

On the book "Taming Text"

Slides supporting an introductory talk on the book "Taming Text" by Grant Ingersoll et al.

Awesome Incremented

September 04, 2015
Tweet

More Decks by Awesome Incremented

Other Decks in Technology

Transcript

  1. Popular Systems • IBM Watson • won “Jeopardy” • Siri

    (2010), Google Now (2012), Cortana (2014) • multiple languages •“Eugene Goostman” passing Turing Test • 5 min. text based conversation • fooled 33% of judges it was human
  2. Why roll your own search - on Open Source? •

    Flexibility • Cost of development • Who knows your content better than you? • Price → Can afford to scale
  3. Search & Relevance SheetMusicPlus.com “piano”. What docs to return? •Relevance

    Scoring • Boolean model, TF/IDF, Vector space model → Lucene’s practical scoring function •Plus “secret sauce” • constantly adjusted term boost + custom rules • Who is your user?
  4. NLP basics Typical operations applied on the token level •Case

    alterations •Stopword removal •Expansion •Part-of-speech tagging • Penn Treebank Project •Stemming
  5. NLP basics •Extracting text • “Garbage in, garbage out” •

    Apache Tika, OCR •Much covered in Lucene, SOLR, ELK
  6. Further ingredients •“Fuzzy” string matching, N-grams • Identifying people, places

    & things (OpenNLP) • Clustering text (Mahout) • Classification, categorization & tagging (Mahout)
  7. Open Source Projects • Lucene → Solr, ElasticSearch • OpenNLP

    → POS, Entities • Mahout → Clustering, Classification, Scale, ... • Hadoop • Spark
  8. Untamed Text. What’s next? • Most critical problems today are

    global • Need better comm. across multiple languages • Not only search, but understand & translate •Manual translation highly accurate but doesn’t scale • Use computers to supplement • e.g. statistical models on “aligned text” • Example: UN reports published in 23+ languages
  9. Untamed Text. What’s next? •Detecting emotions in content • Higher

    order NLP • semantics, discourse, pragmatics • e.g. how to translate phrases
  10. Summary • Most search engines fail at understanding the user

    • cannot tweak proprietary “black boxes” •Succeed by constantly • incorporate user feed and • tweak search results •Open Source allows you to do that • Ultimate “white box”
  11. Summary •Book “Taming Text” a great introduction • targeted towards

    Java • gets you going real “quick” •Last chapter contains fact-based Q&A kickstarter • think “light-weight” IBM Watson • obvious where to optimize from there •Invitation to contribute on Open Source!
  12. Some other references • SE Radio #214: Grant Ingersoll on

    Taming Text • Theory behind Relevance Scoring (ELK docs) • Map-Reduce for Machine Learning on Multicore • https://github.com/tamingtext/book