$30 off During Our Annual Pro Sale. View Details »

Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks

Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks

Experimental work done regarding the use of Topic Models to implement and to improve some common tasks of
Information Retrieval and Word Sense Disambiguation.

Leonardo Di Donato

January 15, 2013
Tweet

More Decks by Leonardo Di Donato

Other Decks in Science

Transcript

  1. Topic Modeling for Information Retrieval
    and Word Sense Disambiguation tasks
    Università degli Studi di Milano - Bicocca
    Di Donato Leonardo
    Text Mining Course - Prof. Fabio Stella

    View Slide

  2. Introduction
    Di Donato Leonardo, Università degli Studi di Milano - Bicocca
    super abundant amount of digital unstructured information
    it continues to grow at an astonishing rate (it doubles every two years)
    man can not manage it: information overload.
    problems: crawling, representing, storing, summarizing, clustering, searching ...
    (general rule: every problem is an opportunity)
    opportunity: automatically extract value from chaos
    what value? how to do it?

    View Slide

  3. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
    Goals
    the value that we want to extract is: clusters of semantically related
    documents
    our purpose is [1] the unsupervised clustering of a text dataset
    [2] the implementation of information retrieval procedures that exploit the
    representation of documents at the topic level
    [3] the modeling of the ability to computationally identify the meaning of
    words in context (word sense disambiguation)
    our documents collection: a partition of the Associated Press dataset
    ~ 2300 english textual news (dating back to the '90s)
    characteristic of any text document: it is often messy, has flaws and noise
    we need to clean the data
    we need a structured representation of the data
    Dataset

    View Slide

  4. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
    Pre-Processing
    google refine [ link ]
    [1] replacement of abbreviations and common entities with expressions that
    normalize them (e.g., {dlrs, dlr, $, ...} → {dollar}, {mln, mlns, ...} → {million})
    [2] adjustment of flaws and [3] stripping metadata entities through regular
    expressions
    mallet [ link ]
    [1] make all the characters lowercase
    [2] tokenization [3] stop-word removal
    [4] vocabulary proportional cut-off, with threshold 0.03
    [5] term-frequency representation of each document
    corpus is a unique file, every line is a document with this format:
    results: |W| = 32349 token types, 241908 words

    View Slide

  5. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
    Topic Models
    probabilistic generative models for uncovering the underlying semantic
    structure of a document collection based on a Bayesian analysis of the
    original texts [ Blei, 2003 ]
    goal: discover patterns of word-use and connect documents that exhibit
    similar patterns
    idea: documents are mixtures of topics (assignments) and each topic is a
    multinomial probability distribution over words
    which are the topics have generated the given corpus of documents with
    the maximum likelihood ?
    we have to infer 3 latent variables: [1] the word distribution over topics [2]
    the topics distribution over documents [3] the word-topic assignments
    [1] Φ(j) = P(W|Z = j) [2] Θ(d) = P (Z|D = d) [3] P(Z|W)

    View Slide

  6. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
    Topic Models
    Latent Dirichlet Allocation (LDA) model associates with [2] and [1] two
    smoothing hyper-parameters α and β.
    the number of times a topic j which has been selected for a document is
    indicated by α
    j

    1
    , ..., α
    T
    are the parameters of a prior Dirichlet)
    β is the parameter of a prior Dirichlet which indicates the count of
    extracted words from a topic (before observing any corpus document)
    To estimate them we can use different methods (e.g.; Gibbs Sampling)
    we need to estimate the distributions Φ and Θ: it is possible compute them
    directly through the matrixes of counts

    View Slide

  7. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
    Tuning
    which are the best value for hyper-parameters ? usually α = 50/T and β =
    0.01 are those that give the best results [ Steyvers and Griffiths, 2007 ]
    which is the optimal number of topics T ? and the number of iterations I ?
    it depends on the specific problem, it's an open problem
    we have set T = 35 and T = 40
    there are topics evaluation techniques that try to face this problem ...
    we have used one of those techniques (i.e., the topic coherence metric, which
    evaluates the semantic coherence of a topic) to compare two model
    configurations: symmetric α versus asymmetric α

    View Slide

  8. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
    Symmetric α versus Asymmetric α
    an asymmetric configuration (AS) for the alpha hyper-parameters serves to
    calibrate with more flexibility the degree of topics sparseness
    has been empirically demonstrated that optimizing Dirichlet hyper-
    parameters (α
    i
    , ..., α
    T
    ) for topics-document distribution makes a huge
    difference: topics are not dominated by very common words and they are
    more stable as their number increase [ Wallach, 2009 ]
    it has not been verified by our experimentation: the topic's average
    coherence for AS configuration was worse than SS configuration
    why ? in our corpus there isn’t a topic that tends to occur
    in each document (or the optimal number of T may be greater, or simply
    the answer is more trivial ...)

    View Slide

  9. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
    Top topics for symmetric α and T = 35

    View Slide

  10. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
    Post-Processing - Information Retrieval
    why should we use topic models to improve information retrieval tasks ?
    [1] we can cluster queries according the extracted topics
    [2] two documents which share no common words can be measured as
    similar
    query likelihood model is a basic approach for information retrieval
    in this context (generative model) we can evaluate how well a document
    matches a query specifying how the words of the query may have been
    generated by a language model
    we derive a language model for each document (a mixture of topics)
    so, the relevant documents will have a topic distribution that is likely may
    generated the set of words contained in (or associated with) the query
    → documents similarity

    View Slide

  11. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
    Documents Similarity
    two approaches to compute the similarity between documents
    [1] probabilistic query approach
    [2] comparison of topics distribution of documents
    how ? through divergence metrics (e.g., symmetrised Kullback-Leibler,
    Jenson-Shannon)

    View Slide

  12. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
    Similar documents for query "forest fire"
    AP880727-0015 X Fire-spitting helicopters were dispatched to Yellowstone National Park on
    Tuesday to help protect the Old Faithful geyser area from a 6,000-acre blaze ...

    View Slide

  13. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
    Post-Processing - Word Sense Disambiguation
    the ability to identify the meaning of words in context in a computational
    manner is usually referred as the Word Sense Disambiguation
    four elements: [1] selection of word senses (i.e., the classes) [2] use of
    external knowledge sources [3] representation of context [4] selection of an
    automatic classification method
    input: a user specified context document d
    c
    that contains the word w
    x
    to be
    disambiguated
    [1] → given s most similar words for w
    x
    , for each of this we build a sense document
    capturing synsets, glosses, example phrases, and other relevant relations from
    WordNet
    [2] → WordNet as external knowledge sources to create the sense documents d
    s
    [3] → the topical and the semantic features
    [4] → comparison of document d
    c
    with each of the s d
    s
    document (with one of the two
    approaches presented): the most similar will be the sense of word w
    x
    in context d
    c

    View Slide

  14. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
    Words similarity
    two possible approaches to compute the similarity between words:
    [1] associative relation
    [2] comparison of (topics-words) P(Z|W) distribution

    View Slide

  15. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
    Words similar to token "arab"

    View Slide

  16. Di Donato Leonardo, Università degli Studi di Milano - Bicocca
    Future Work
    topic modeling →
    ● train an LDA model with asymmetric α for increasing values of T and evaluate the
    resulting quality of topics
    ● train an LDA model with asymmetric α on a vocabulary on which has not been
    performed any proportional cut-off
    ● investigate a possible implementation of a multiple chain model to obtain topics more
    stable
    ● use other metric of topic evaluation
    information retrieval →
    ● assess and fine-tune the prior probability of a document in the query likelihood model
    ● use other high-frequency metrics (e.g., α-skew) in relation to the comparison of
    distributions
    word sense disambiguation →
    ● implement and evaluate other methods to compare context document and sense
    documents (e.g., compute P(d
    c
    , d
    s
    ) under the assumption that they are conditionally
    independent, given the topic variable)
    ● refine the mechanism of sense selection (e.g., choosing each of the s most probable words
    into probability interval in order to minimize the risk that all the most similar words
    refer to meanings really strictly correlated)

    View Slide

  17. Thank you for your attention.
    Di Donato Leonardo, Università degli Studi di Milano - Bicocca

    View Slide