The Number of Topics Optimization: Clustering Approach

5206c19df417b8876825b5561344c1a0?s=47 Exactpro
March 21, 2019

The Number of Topics Optimization: Clustering Approach

MACSPro'2019 - Modeling and Analysis of Complex Systems and Processes, Vienna
21 - 23 March 2019

Anastasiia Sen, Fedor Krasnov

Conference website http://macspro.club/

Website https://exactpro.com/
Linkedin https://www.linkedin.com/company/exactpro-systems-llc
Instagram https://www.instagram.com/exactpro/
Twitter https://twitter.com/exactpro
Facebook https://www.facebook.com/exactpro/
Youtube Channel https://www.youtube.com/c/exactprosystems

5206c19df417b8876825b5561344c1a0?s=128

Exactpro

March 21, 2019
Tweet

Transcript

  1. 2.

    Presentation outline • Introduction; • Methodology: o Topic model; o

    Metrics; o Scheme; • Experiments: o Experiment 1: scientific articles; o Experiment 2: patents; • Conclusion.
  2. 3.

    Introduction Existing estimates: • Perplexity: for complete set of topics;

    • HDP : for the whole collection of documents; • Renyi and Tsallis entropies: for the big collection; • Matrix approach; • Partition coefficient, Dunn index, Davies Bouldin Index, silhouette coefficient: quality of clusters. Regularization dense representation cDBI
  3. 5.

    Methodology. Metrics Internal metrics: • Core: = { ∈ ≥

    ℎ • Size of core: | | • Purity: = ∈ (|) • Contrast: 1 || ∈ (|) • Coherence: ℎ = 2 (−1) =1 −1 =1 , , where , = log Algorithm for cDBI: ∶ = ((, , )) ∈ : ∶= ∈ () ∶= 1 dim ∈ ⋅ () | | ∶= 1 dim ∈
  4. 6.

    Methodology. Scheme Corpus preparation Change the order of documents Perplexity

    minimization ARTM learning Calculation of the metrics for different number of topics Writing result to the database Dictionary creation Search to optimal regularization parameters Transformation of sparse topics representation to dense representation
  5. 7.

    First experiment Collection of 1695 scientific articles Dense presentation projection

    of the topics with preservation of distances. Dependencies of the main internal metrics of the quality on the number of topics
  6. 10.

    Second experiment Collection of 50 000 patents Average kernel size

    Dependencies of the main internal metrics of the quality on the number of topics
  7. 13.

    Conclusion • Method od determination the optimal number of topic;

    • Metric of approach optimization – cosine Davies Bouldin index; • Tests on two collections:  Small (science article);  Big (patents). • Using.