Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FuzzyTM: a Python package for state-of-the-art ...

Marketing OGZ
September 19, 2022
13

FuzzyTM: a Python package for state-of-the-art Topic Modeling

Marketing OGZ

September 19, 2022
Tweet

Transcript

  1. The COVIDA consortium is a collaboration between Dutch research institutes

    2 2 Eindhoven University of Technology Prof. dr. ir. Uzay Kaymak MSc. Emil Rijcken University Medical Center Utrecht Prof. dr. Floortje Scheepers Universiteit Utrecht Dr. Pablo Mosteiro Leiden University Medical Center Prof. dr. Marco Spruit Dr. Kalliopi Zervanou
  2. 4

  3. 5

  4. 6

  5. Takeaways • Our new topic modeling algorithm (FLSA-W) is state-of-the-art

    • Our Python package (FuzzyTM) is easy to use and flexible 8
  6. Topic Models are a group of unsupervised algorithms that extract

    abstract topics in texts. 9 Input: - collection of written texts - the number of topics Output: two matrices 𝒕𝟏 𝒕𝟐 … 𝒕𝑪 𝒘𝟏 0,14 0,40 0,90 0,87 𝒘𝟏 0,36 0,18 0,11 0,48 … 0,89 0,76 0,34 0,59 𝒘𝑴 0,10 0,13 0,76 0,88 The degree to which a word belongs to a topic 𝒅𝟏 𝒅𝟐 … 𝒅𝑵 𝒕𝟏 0,47 0,09 0,02 0,07 𝒕𝟐 0,08 0,78 0,36 0,91 … 0,07 0,02 0,41 0,85 𝒕𝑪 0,56 0,88 0,14 0,87 The degree to which a topic belongs to a document
  7. [(0,'0.0219*“real" + 0.0143*“life" + 0.0126*“fantasy" + 0.0071*“reality" + 0.0067*“galileo" +

    0.0055*“poor" + 0.0051*“boy" + 0.005*“mama" + 0.005*“mia" + 0.005*“matters“’), (1,'0.0167*“submarine" + 0.011*“yellow" + 0.0065*“sailed" + 0.0064*“sun" + 0.0062*“sea" + 0.0055*“waves" + 0.0047*“friends" + 0.0046*“sky" + 0.0046*“live" + 0.0045*“sailed" ‘), (2,'0.0064*“piano" + 0.0045*“man" + 0.0035*“crowd" + 0.0035*“saturday" + 0.0031*“tonic" + 0.0031*“gin" + 0.0023*“bar" + 0.0018*“drinking" + 0.0017*“beer" + 0.0016*“melody“+ 0.0015*“sitting" ’)] 10 A fictional example of a topic model Output (3 topics and 10 words) Input (song lyrics) 100101011010101010 010101010101010101 101001001010101011 Topic Model
  8. Topic Models have a wide variety of applications 11 •

    Topic discovery • Text summarization • Text categorization • Text similarity
  9. Topic models can be used as a text embedding for

    other downstream tasks 12 Start with a corpus 1001010110101010 1001010101010101 0101101001001010 Train a topic model Perform classification Calculate topic embeddings
  10. We have developed new, fuzzy-logic-based, topic modeling algorithms 13 Rijcken,

    E., Kaymak, U., Scheepers, F., Mosteiro, P., Zervanou, K., & Spruit, M. (2022). Topic Modeling for Interpretable Text Classification From EHRs. Frontiers in Big Data, 5.
  11. Our best-performing algorithm, FLSA-W, has a simple pipeline 14 Rijcken,

    E., Scheepers, F., Mosteiro, P., Zervanou, K., Spruit, M., & Kaymak, U. (2021, December). A comparative study of fuzzy topic models and LDA in terms of interpretability. In 2021 IEEE Symposium Series on Computational Intelligence (SSCI) (pp. 1-8). IEEE.
  12. FLSA-W, one of the fuzzy topic models, outperforms other state-of-the-art

    topic models on open datasets in most settings 15 Dataset: M10 (scientific publications from 10 research areas) Experimental setup: Train each algorithm 5 times on the number of topics of 10, 20, … , 100 and report the average coherence-, diversity- and interpretability score. Coherence: Intra-topic quality Diversity: Inter-topic quality Interpret: coherence × diversity Rijcken, E., Mosteiro, P., Zervanou, K., Spruit, M., Scheepers, F., & Kaymak, U. (2022). FuzzyTM: a Software Package for Fuzzy Topic Modeling. In IEEE WCCI 2022: FUZZ-IEEE.
  13. “ ” FuzzyTM: a Python package for state-of-the-art (fuzzy) topic

    modeling • First Python package in the world for fuzzy topic modeling • User-friendly pipelines • Modular design allows for in-depth modification
  14. We use pyFUME for fuzzy clustering 17 Fuchs, C., Spolaor,

    S., Nobile, M. S., & Kaymak, U. (2020, July). pyFUME: a Python package for fuzzy model estimation. In 2020 IEEE international conference on fuzzy systems (FUZZ-IEEE) (pp. 1-8). IEEE.
  15. A fuzzy topic model can be trained with three lines

    of code 20 Three lines of code!
  16. We used topic models to analyze patient feedback forms 21

    Arendsen, J., Rijcken, E., Zervanou, K., Rietjens, K., Vlems, F., & Kaymak, U. (2022). Analyzing Patient Feedback Data with Topic Modeling. In International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (pp. 248-258). Springer, Cham.
  17. To be expected • Paper with experimental results on open

    datasets. (spoiler: FLSA-W outperforms all other algorithms on all data sets) • Gensim implementation (https://radimrehurek.com/gensim/) • A new algorithm: FLSA-E 22
  18. Takeaways • Our new topic modeling algorithm (FLSA-W) is state-of-the-art

    • Our Python package (FuzzyTM) is easy to use and flexible 23