Incremental and Hierarchcial Document Clustering

PhD Thesis Proposal Doctoral Program in Information Sciences and Technologies
Incremental and Hierarchical Document Clustering Rui Alberto Cardoso da Encarnação Advisor: Professor Paulo Gomes Department of Informatics Engineering Faculty of Science and Technology University of Coimbra September 2011

OUTLINE Motivation Problem Statement & Research Goals Background Related Work
Proposed Approach & Expected Contributions 2

MOTIVATION Why document clustering? Growing gap between the rate of
documents generation and our ability to use them demands automatic tools Why hierarchical clustering? no need to specify the number of clusters searches at different levels, satisfying different needs Why incremental clustering? document repositories are constantly being updated 3

PROBLEM STATEMENT To develop a hierarchical document clustering algorithm that
is fully incremental and unsupervised. 4

RESEARCH QUESTIONS What is the most suitable document representation for
incremental clustering? What is the best dimensionality reduction technique to overcome the curse of dimensionality? What changes must be done to conceptual clustering algorithms in order to use them with documents? 5

RESEARCH GOALS The main goal of this research is the
creation of a new document clustering algorithm with these requirements: Hierarchical Incremental Unsupervised Conceptual clustering Minimum performance 6

BACKGROUND Clustering is “the art of finding groups in data”
Clustering is unsupervised High intra-cluster similarity and low inter-cluster similarity Document clustering is the automatic organization of a set of documents into clusters Types of document clustering algorithms 7

BACKGROUND Document representation Vector Space Model, bag-of-words Document-term matrix Compound
words, character n-grams Weights used term frequency, term occurrence, tf-idf Similarity measures Euclidean distance Cosine measure 8

BACKGROUND “Curse of dimensionality” Dimensionality Reduction Techniques Feature selection Feature
transformation Preprocessing Evaluation of clustering (internal and external measures) 9

RELATED WORK: HIERARCHICAL CLUSTERING Creates a tree, with a all-inclusive
cluster at the top and clusters of individual documents at the bottom Agglomerative - starts with single nodes and builds the hierarchy bottom-up, joining one pair in each step Divisive - splits a global node until only remain single nodes Pros: No need to specify the number of clusters Cons: No adjustments and not scalable 10

RELATED WORK: PARTITIONING CLUSTERING (K-MEANS) Builds a flat partition of
K (predefined) clusters represented by the centroids Chooses K initial random centroids and assigns documents to the closest. Then new centroids are computed and documents are reassigned until stability Many variants: K-Medoids, Bisecting K-Means Pros: Relatively efficient, find all clusters at once Cons: Need to specify K and sensitivity to initial centroids and outliers 11

RELATED WORK: CONCEPTUAL CLUSTERING Build a hierarchy of probabilistic concepts
COBWEB (Fisher,1987) Use Category Utility 4 operators (place in cluster, create cluster, merge, split) Nominal attributes CLASSIT (Genary, Langley & Fisher, 1989) Numerical attributes (presumes normal distribution) Introduces two parameters: acuity and cutoff 12

RELATED WORK: CONCEPTUAL CLUSTERING WITH DOCUMENTS First application of conceptual
clustering to documents was done by Sahoo (2006 and 2009) CLASSIT without changes can’t be used with text Replace Normal distribution by Katz’s distribution TF-IDF can’t be used in an incremental environment Confusion between TF-IDF and TF in distributions 13

PROPOSED APPROACH Study of recent literature in this area Implementation
of a preliminary version of the algorithm Develop a global framework for incremental document clustering Implementation of the final version of the algorithm Experimentations and evaluation of the algorithm 14

SCHEDULE 15 Duração (meses) 2011 2012 2013 2014 4Q 1Q
2Q 3Q 4Q 1Q 2Q 3Q 4Q 1Q 2Q 3Q Additional literature review 6 Preliminary version of the algorithm 6 Document representation issues 9 Algorithm adaptation and improvement 9 Experimentations 24 Writing of papers and technical reports 27 Writing of the PhD thesis 9

EXPECTED CONTRIBUTIONS A new incremental and hierarchical document clustering algorithm;
The adaptation of concept clustering algorithms to texts; A document representation and dimensionality reduction techniques suitable for incremental clustering; Better mechanisms of backtracking and tree reorganization; New measures for evaluate incremental clustering quality; 16

TARGET CONFERENCES Generalist conferences on AI: IJCAI, AAAI, ECAI, EPIA
Data Mining: ICDM, CIKM, ICML, PKDD, SDM, SIGKDD Information Retrieval: SIGIR,TREC, ECIR, JCDL 17

CONCLUSION This project is an answer to the growing demand
for automatic document organization tools. This work can create a framework for future incremental and hierarchical document clustering research. We do believe that our system can be an invaluable tool to prevent us from staying overwhelmed with documents. 18

Incremental and Hierarchcial Document Clustering

Incremental and Hierarchcial Document Clustering

Rui Encarnação

Other Decks in Research

Featured

Transcript

PhD Thesis Proposal Doctoral Program in Information Sciences and Technologies

OUTLINE Motivation Problem Statement & Research Goals Background Related Work

MOTIVATION Why document clustering? Growing gap between the rate of

PROBLEM STATEMENT To develop a hierarchical document clustering algorithm that

RESEARCH QUESTIONS What is the most suitable document representation for

RESEARCH GOALS The main goal of this research is the

BACKGROUND Clustering is “the art of finding groups in data”

BACKGROUND Document representation Vector Space Model, bag-of-words Document-term matrix Compound

BACKGROUND “Curse of dimensionality” Dimensionality Reduction Techniques Feature selection Feature

RELATED WORK: HIERARCHICAL CLUSTERING Creates a tree, with a all-inclusive

RELATED WORK: PARTITIONING CLUSTERING (K-MEANS) Builds a flat partition of

RELATED WORK: CONCEPTUAL CLUSTERING Build a hierarchy of probabilistic concepts

RELATED WORK: CONCEPTUAL CLUSTERING WITH DOCUMENTS First application of conceptual

PROPOSED APPROACH Study of recent literature in this area Implementation

SCHEDULE 15 Duração (meses) 2011 2012 2013 2014 4Q 1Q

EXPECTED CONTRIBUTIONS A new incremental and hierarchical document clustering algorithm;

TARGET CONFERENCES Generalist conferences on AI: IJCAI, AAAI, ECAI, EPIA

CONCLUSION This project is an answer to the growing demand