Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kun Lu

Kun Lu

This talk is about applying text-mining on academic publications to extract information such as knowledge graphs of related concepts. The overall goal is to help researchers to better and faster explore a large amount of text documents which in this case are academic publications. Given the text of thousands or millions of publications, I will show 1) how to use natural language processing techniques to extract concepts 2) how to use statistics and bayersian theory to identify related concepts 3) how to use a short-long-term memory mechanism to learn related concepts in a sequential manner. In the end, we will be able to obtain a "knowledge graph" among the concepts through this text-mining process. Part of the results could be found at: www.neuronbit.io. The result of this work can also be useful for semantic search, document classification, information retrieval, etc.

MunichDataGeeks

September 08, 2016
Tweet

More Decks by MunichDataGeeks

Other Decks in Programming

Transcript

  1. non-negative matrix factorization sparse coding independent component analysis (ICA) elastic

    bunch graph matching sparseness constraints face recognition predictive coding fisher linear discriminant model visual cortex rodent brain Gabor feature based classification … #. papers
  2. non-negative matrix factorization independent component analysis (ICA) elastic bunch graph

    matching sparse coding sparseness constraints rodent brain predictive coding visual cortex Gabor feature based classification face recognition fisher linear discriminant model
  3. non-negative matrix factorization sparse coding independent component analysis (ICA) elastic

    bunch graph matching sparseness constraints face recognition … … Year 06 00 05
  4. Statistical Modelling Natural Language Processing Part-of-Speech (POS) tagging Concepts/Term identification

    Bayersian inference Document frequency Contextual doc. frequency (title-abstract, neighbourhood)
  5. Part-of-speech Tagging We need to find out “Noun Phrases”: …

    One approach to understanding such response properties of visual neurons has been to consider their relationship to the statistical structure of natural images in terms of efficient coding.
  6. Part-of-speech Tagging We need to find out “Noun Phrases”: …

    One approach to understanding such response properties of visual neurons has been to consider their relationship to the statistical structure of natural images in terms of efficient coding.
  7. Part-of-speech Tagging Different approaches: Simply delimited by stop-words (noise words

    like “a”, “of”, etc.) Python NLTK (NLP toolkit) • Not always correct ◦ 'mechanism linking ...', 'alter gene expression', 'whereas’ all tagged as nouns Train a prediction model • Define features and calculate the statistics Look-up table based (using online dictionary)
  8. This paper provides an introduction to mixed-effects models for the

    analysis of repeated measurement data with subjects and items as crossed random effects. A worked-out example of how to use recent software for mixed-effects modeling is provided. Simulation studies illustrate the advantages offered by mixed-effects analyses compared to traditional analyses based on quasi-F tests, by-subjects analyses, combined by-subjects and by-items analyses, and random regression. Applications and possibilities across a range of domains of inquiry are discussed.
  9. “the -” “- is” “- that” end-of-sentence “-tion” Elementary Features

    => Learned Combinatorial Features => Selection criterion - if a feature increases the certainty Example: I(f3) = I({f1,f2}) - I({f1,f2,f3}), take “f3” is I(f3)>0 i.e. learning is based on “Positive Information” or “decreased Entropy”
  10. Part-of-speech Tagging POS look-up table: Crawl online dictionary Example: “drive”

    - [v, n], “driven” - [past participle] “generate” - [v]
  11. Part-of-speech Tagging • Summary ◦ Do not aim at 100%

    accuracy now ▪ (high recall, low precision) ◦ Statistics will help later ▪ (high recall, high precision)
  12. Term Identification (Adv* Adj* Noun+)+ Use POS look-up table for

    pruning, example: • If last word W has ending “-ed” and Lookup(W) = {past participle}, then drop it
  13. Statistical Modelling Natural Language Processing Part-of-Speech (POS) tagging Concepts/Term identification

    Bayersian inference Document frequency Contextual doc. frequency (title-abstract, neighbourhood)
  14. Co-occurrence statistics Along these lines, a number of studies have

    attempted to train unsupervised learning algorithms on natural images in the hope of developing receptive fields with similar properties, but none has succeeded in producing a full set that spans the image space and contains all three of the above properties. Data structure: a dictionary in Python
  15. Contextual co-occurrence Not just bag-of-words or bag-of-terms: Terms in TITLE

    ⇒ Terms in ABSTRACT Neighborhood: within left/right X words/sentences
  16. What you need to know What you might need to

    know What you dont know Thank you!
  17. Deep learning allows computational models that are composed of multiple

    processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.
  18. [('red', 1.0), ('algal', 0.287), ('yellow', 0.211), ('blue', 0.190), ('panda', 0.188),

    ('orange', 0.176), ('dwarfs', 0.168), ('pink', 0.168), ('pigments', 0.167), ('algae', 0.156), ('green', 0.143), ('hemoglobin', 0.109), ('colors', 0.108), ('colours', 0.104), ('chloroplasts', 0.102), ('fox', 0.100), ('chloroplast', 0.096), ('wines', 0.091), ('white', 0.087), ('colour', 0.082), ('deer', 0.082466379091601122), ('grape', 0.081), ('black', 0.072), ('wore', 0.071), ('brown', 0.069), ('color', 0.069), ('flag', 0.069), ('oak', 0.068), ('gray', 0.068), ('blood', 0.067), ('wine', 0.054), ('flowers', 0.046), ('cameras', 0.046), ('light', 0.037), ('star', 0.035), ('cells', 0.030812517156189954), ('gold', 0.028138331669238448), ('skin', 0.0218), ('silver', 0.0197), ('arms', 0.015723270440251572), ('cell', 0.01505955665184512), ('symbol', 0.014), ('variety', 0.0125), ('turn', 0.012), ('derived', 0.0109), ('fish', 0.011), ('top', 0.0106), ('species', 0.0097), ('seen', 0.0086708334838686372), ('produce', 0.0084), ('plants', 0.00819), ('volume', 0.00812), ('wide', 0.00802), ('small', 0.007995), ('line', 0.006218), ('called', 0.00572), ('typically', 0.0005), ('described', 0.00052), ('short', 0.00051), ('low', 0.00047), ('region', 0.000460), ('right', 0.00036), ('common', 0.00033), ('large', 0.00032), ('high', 0.000322), ('found', 0.000315), ('word', 0.000307), ('study', 0.00030636490654032251), ('type', 0.000302), ('made', 0.00028), ('name', 0.000271), ('example', 0.00025246098934666376), ('known', 0.000245), ('years', 0.000227), ('different', 0.000212), ('part', 0.000208), ('use', 0.000179), ('people', 0.00015285), ('used', 0.000141), ('one', 0.000126)]