Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Incremental and Hierarchcial Document Clustering

Incremental and Hierarchcial Document Clustering

PhD. Thesis Proposal - September 2011


Rui Encarnação

September 13, 2011


  1. PhD Thesis Proposal Doctoral Program in Information Sciences and Technologies

    Incremental and Hierarchical Document Clustering Rui Alberto Cardoso da Encarnação Advisor: Professor Paulo Gomes Department of Informatics Engineering Faculty of Science and Technology University of Coimbra September 2011
  2. OUTLINE Motivation Problem Statement & Research Goals Background Related Work

    Proposed Approach & Expected Contributions 2
  3. MOTIVATION Why document clustering? Growing gap between the rate of

    documents generation and our ability to use them demands automatic tools Why hierarchical clustering? no need to specify the number of clusters searches at different levels, satisfying different needs Why incremental clustering? document repositories are constantly being updated 3
  4. PROBLEM STATEMENT To develop a hierarchical document clustering algorithm that

    is fully incremental and unsupervised. 4
  5. RESEARCH QUESTIONS What is the most suitable document representation for

    incremental clustering? What is the best dimensionality reduction technique to overcome the curse of dimensionality? What changes must be done to conceptual clustering algorithms in order to use them with documents? 5
  6. RESEARCH GOALS The main goal of this research is the

    creation of a new document clustering algorithm with these requirements: Hierarchical Incremental Unsupervised Conceptual clustering Minimum performance 6
  7. BACKGROUND Clustering is “the art of finding groups in data”

    Clustering is unsupervised High intra-cluster similarity and low inter-cluster similarity Document clustering is the automatic organization of a set of documents into clusters Types of document clustering algorithms 7
  8. BACKGROUND Document representation Vector Space Model, bag-of-words Document-term matrix Compound

    words, character n-grams Weights used term frequency, term occurrence, tf-idf Similarity measures Euclidean distance Cosine measure 8
  9. BACKGROUND “Curse of dimensionality” Dimensionality Reduction Techniques Feature selection Feature

    transformation Preprocessing Evaluation of clustering (internal and external measures) 9
  10. RELATED WORK: HIERARCHICAL CLUSTERING Creates a tree, with a all-inclusive

    cluster at the top and clusters of individual documents at the bottom Agglomerative - starts with single nodes and builds the hierarchy bottom-up, joining one pair in each step Divisive - splits a global node until only remain single nodes Pros: No need to specify the number of clusters Cons: No adjustments and not scalable 10
  11. RELATED WORK: PARTITIONING CLUSTERING (K-MEANS) Builds a flat partition of

    K (predefined) clusters represented by the centroids Chooses K initial random centroids and assigns documents to the closest. Then new centroids are computed and documents are reassigned until stability Many variants: K-Medoids, Bisecting K-Means Pros: Relatively efficient, find all clusters at once Cons: Need to specify K and sensitivity to initial centroids and outliers 11
  12. RELATED WORK: CONCEPTUAL CLUSTERING Build a hierarchy of probabilistic concepts

    COBWEB (Fisher,1987) Use Category Utility 4 operators (place in cluster, create cluster, merge, split) Nominal attributes CLASSIT (Genary, Langley & Fisher, 1989) Numerical attributes (presumes normal distribution) Introduces two parameters: acuity and cutoff 12

    clustering to documents was done by Sahoo (2006 and 2009) CLASSIT without changes can’t be used with text Replace Normal distribution by Katz’s distribution TF-IDF can’t be used in an incremental environment Confusion between TF-IDF and TF in distributions 13
  14. PROPOSED APPROACH Study of recent literature in this area Implementation

    of a preliminary version of the algorithm Develop a global framework for incremental document clustering Implementation of the final version of the algorithm Experimentations and evaluation of the algorithm 14
  15. SCHEDULE 15 Duração (meses) 2011 2012 2013 2014 4Q 1Q

    2Q 3Q 4Q 1Q 2Q 3Q 4Q 1Q 2Q 3Q Additional literature review 6 Preliminary version of the algorithm 6 Document representation issues 9 Algorithm adaptation and improvement 9 Experimentations 24 Writing of papers and technical reports 27 Writing of the PhD thesis 9
  16. EXPECTED CONTRIBUTIONS A new incremental and hierarchical document clustering algorithm;

    The adaptation of concept clustering algorithms to texts; A document representation and dimensionality reduction techniques suitable for incremental clustering; Better mechanisms of backtracking and tree reorganization; New measures for evaluate incremental clustering quality; 16
  17. TARGET CONFERENCES Generalist conferences on AI: IJCAI, AAAI, ECAI, EPIA

    Data Mining: ICDM, CIKM, ICML, PKDD, SDM, SIGKDD Information Retrieval: SIGIR,TREC, ECIR, JCDL 17
  18. CONCLUSION This project is an answer to the growing demand

    for automatic document organization tools. This work can create a framework for future incremental and hierarchical document clustering research. We do believe that our system can be an invaluable tool to prevent us from staying overwhelmed with documents. 18