Slide 1

Slide 1 text

The Number of Topics Optimization: Clustering Approach Anastasiia Sen, Fedor Krasnov March 21, 2019

Slide 2

Slide 2 text

Presentation outline • Introduction; • Methodology: o Topic model; o Metrics; o Scheme; • Experiments: o Experiment 1: scientific articles; o Experiment 2: patents; • Conclusion.

Slide 3

Slide 3 text

Introduction Existing estimates: • Perplexity: for complete set of topics; • HDP : for the whole collection of documents; • Renyi and Tsallis entropies: for the big collection; • Matrix approach; • Partition coefficient, Dunn index, Davies Bouldin Index, silhouette coefficient: quality of clusters. Regularization dense representation cDBI

Slide 4

Slide 4 text

Methodology. Topic model Distribution of topics: 1) uniform 2) sparse for main and dense for supporting

Slide 5

Slide 5 text

Methodology. Metrics Internal metrics: • Core: = { ∈ ≥ ℎ • Size of core: | | • Purity: = ∈ (|) • Contrast: 1 || ∈ (|) • Coherence: ℎ = 2 (−1) =1 −1 =1 , , where , = log Algorithm for cDBI: ∶ = ((, , )) ∈ : ∶= ∈ () ∶= 1 dim ∈ ⋅ () | | ∶= 1 dim ∈

Slide 6

Slide 6 text

Methodology. Scheme Corpus preparation Change the order of documents Perplexity minimization ARTM learning Calculation of the metrics for different number of topics Writing result to the database Dictionary creation Search to optimal regularization parameters Transformation of sparse topics representation to dense representation

Slide 7

Slide 7 text

First experiment Collection of 1695 scientific articles Dense presentation projection of the topics with preservation of distances. Dependencies of the main internal metrics of the quality on the number of topics

Slide 8

Slide 8 text

First experiment Cluster Validation Metrics: 1) Silhouette Coefficient; 2) Calinski-Harabaz index

Slide 9

Slide 9 text

First experiment Cosine Davies Bouldin index

Slide 10

Slide 10 text

Second experiment Collection of 50 000 patents Average kernel size Dependencies of the main internal metrics of the quality on the number of topics

Slide 11

Slide 11 text

Second experiment Cluster Validation Metrics: 1) Silhouette Coefficient; 2) Calinski-Harabaz index

Slide 12

Slide 12 text

Second experiment Cosine Davies Bouldin index

Slide 13

Slide 13 text

Conclusion • Method od determination the optimal number of topic; • Metric of approach optimization – cosine Davies Bouldin index; • Tests on two collections:  Small (science article);  Big (patents). • Using.