Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Text Mining 101

Text Mining 101

Most of the data that exists is unstructured text data. Prominent examples of this are social media posts, scientific publications or news. In order for classical data mining techniques to make any use of it, this data needs to be structured. Text Mining is the science that tries to structure texts and to use the gained insights to synthesize new knowledge and make new predictions. This talk will give an introduction to the problem text mining is tackling, the standard approaches to text mining, the application possibilities of machine learning algorithms within text mining systems and possible practical applications of text mining.

MunichDataGeeks

March 24, 2015
Tweet

More Decks by MunichDataGeeks

Other Decks in Science

Transcript

  1. Alzheimer‘s: 107.050 Articles 8.467 Average Words per article 250 Words

    per Minute (Average Reading Speed) 8 Hours per Day 249 Workdays per year
  2. Alzheimer‘s: 107.050 Articles 8.467 Average Words per article 250 Words

    per Minute (Average Reading Speed) 8 Hours per Day 249 Workdays per year Ca. 30,3 years of just reading
  3. Alzheimer‘s: 777,8 Articles per Month (current publication rate) 283.069,9 New

    articles in 30,3 years Ca. 80,2 more years of just reading
  4. Untangling Text Data Mining Marti Hearst, 1999 “a process […]

    that leads to the discovery of heretofore unknown information, or to answers for questions for which the answer is not currently known. ” “Mining implies extracting precious nuggets of ore from otherwise worthless rock.”
  5. Information Extraction Pipeline Sentence Splitting Tokenization POS Tagging Sentence Analysis

    Event Extraction NER endothelin-1-mediated vasoconstriction endothelin-1 -mediated vasoconstriction
  6. Information Extraction Pipeline Sentence Splitting Tokenization POS Tagging Sentence Analysis

    Event Extraction NER Labeling each token in the sentence with its part-of-speech VBG DT NN IN DT NN IN PRP$ NN
  7. Information Extraction Pipeline Sentence Splitting Tokenization POS Tagging Sentence Analysis

    Event Extraction NER Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning
  8. ... ... ... ... ... ... Shared Representation POS Tagger

    Sentence Analyzer Named Entity Recognizer Deep Multitask Learning
  9. Bonus material: Since some people asked, here is the unsupervised

    feature learning on youtube videos paper: http://static.googleusercontent.com/media/research.google.com/de//ar chive/unsupervised_icml2012.pdf