Slide 1

Slide 1 text

1 FuzzyTM: a Python package for state-of-the-art Topic Modeling Emil Rijcken

Slide 2

Slide 2 text

The COVIDA consortium is a collaboration between Dutch research institutes 2 2 Eindhoven University of Technology Prof. dr. ir. Uzay Kaymak MSc. Emil Rijcken University Medical Center Utrecht Prof. dr. Floortje Scheepers Universiteit Utrecht Dr. Pablo Mosteiro Leiden University Medical Center Prof. dr. Marco Spruit Dr. Kalliopi Zervanou

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

4

Slide 5

Slide 5 text

5

Slide 6

Slide 6 text

6

Slide 7

Slide 7 text

“ ” Topic Models may help us!

Slide 8

Slide 8 text

Takeaways • Our new topic modeling algorithm (FLSA-W) is state-of-the-art • Our Python package (FuzzyTM) is easy to use and flexible 8

Slide 9

Slide 9 text

Topic Models are a group of unsupervised algorithms that extract abstract topics in texts. 9 Input: - collection of written texts - the number of topics Output: two matrices 𝒕𝟏 𝒕𝟐 … 𝒕𝑪 𝒘𝟏 0,14 0,40 0,90 0,87 𝒘𝟏 0,36 0,18 0,11 0,48 … 0,89 0,76 0,34 0,59 𝒘𝑴 0,10 0,13 0,76 0,88 The degree to which a word belongs to a topic 𝒅𝟏 𝒅𝟐 … 𝒅𝑵 𝒕𝟏 0,47 0,09 0,02 0,07 𝒕𝟐 0,08 0,78 0,36 0,91 … 0,07 0,02 0,41 0,85 𝒕𝑪 0,56 0,88 0,14 0,87 The degree to which a topic belongs to a document

Slide 10

Slide 10 text

[(0,'0.0219*“real" + 0.0143*“life" + 0.0126*“fantasy" + 0.0071*“reality" + 0.0067*“galileo" + 0.0055*“poor" + 0.0051*“boy" + 0.005*“mama" + 0.005*“mia" + 0.005*“matters“’), (1,'0.0167*“submarine" + 0.011*“yellow" + 0.0065*“sailed" + 0.0064*“sun" + 0.0062*“sea" + 0.0055*“waves" + 0.0047*“friends" + 0.0046*“sky" + 0.0046*“live" + 0.0045*“sailed" ‘), (2,'0.0064*“piano" + 0.0045*“man" + 0.0035*“crowd" + 0.0035*“saturday" + 0.0031*“tonic" + 0.0031*“gin" + 0.0023*“bar" + 0.0018*“drinking" + 0.0017*“beer" + 0.0016*“melody“+ 0.0015*“sitting" ’)] 10 A fictional example of a topic model Output (3 topics and 10 words) Input (song lyrics) 100101011010101010 010101010101010101 101001001010101011 Topic Model

Slide 11

Slide 11 text

Topic Models have a wide variety of applications 11 • Topic discovery • Text summarization • Text categorization • Text similarity

Slide 12

Slide 12 text

Topic models can be used as a text embedding for other downstream tasks 12 Start with a corpus 1001010110101010 1001010101010101 0101101001001010 Train a topic model Perform classification Calculate topic embeddings

Slide 13

Slide 13 text

We have developed new, fuzzy-logic-based, topic modeling algorithms 13 Rijcken, E., Kaymak, U., Scheepers, F., Mosteiro, P., Zervanou, K., & Spruit, M. (2022). Topic Modeling for Interpretable Text Classification From EHRs. Frontiers in Big Data, 5.

Slide 14

Slide 14 text

Our best-performing algorithm, FLSA-W, has a simple pipeline 14 Rijcken, E., Scheepers, F., Mosteiro, P., Zervanou, K., Spruit, M., & Kaymak, U. (2021, December). A comparative study of fuzzy topic models and LDA in terms of interpretability. In 2021 IEEE Symposium Series on Computational Intelligence (SSCI) (pp. 1-8). IEEE.

Slide 15

Slide 15 text

FLSA-W, one of the fuzzy topic models, outperforms other state-of-the-art topic models on open datasets in most settings 15 Dataset: M10 (scientific publications from 10 research areas) Experimental setup: Train each algorithm 5 times on the number of topics of 10, 20, … , 100 and report the average coherence-, diversity- and interpretability score. Coherence: Intra-topic quality Diversity: Inter-topic quality Interpret: coherence × diversity Rijcken, E., Mosteiro, P., Zervanou, K., Spruit, M., Scheepers, F., & Kaymak, U. (2022). FuzzyTM: a Software Package for Fuzzy Topic Modeling. In IEEE WCCI 2022: FUZZ-IEEE.

Slide 16

Slide 16 text

“ ” FuzzyTM: a Python package for state-of-the-art (fuzzy) topic modeling • First Python package in the world for fuzzy topic modeling • User-friendly pipelines • Modular design allows for in-depth modification

Slide 17

Slide 17 text

We use pyFUME for fuzzy clustering 17 Fuchs, C., Spolaor, S., Nobile, M. S., & Kaymak, U. (2020, July). pyFUME: a Python package for fuzzy model estimation. In 2020 IEEE international conference on fuzzy systems (FUZZ-IEEE) (pp. 1-8). IEEE.

Slide 18

Slide 18 text

Topic embeddings can be created with the top words selected by a number or percentile

Slide 19

Slide 19 text

A fuzzy topic model can be trained with three lines of code 19

Slide 20

Slide 20 text

A fuzzy topic model can be trained with three lines of code 20 Three lines of code!

Slide 21

Slide 21 text

We used topic models to analyze patient feedback forms 21 Arendsen, J., Rijcken, E., Zervanou, K., Rietjens, K., Vlems, F., & Kaymak, U. (2022). Analyzing Patient Feedback Data with Topic Modeling. In International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (pp. 248-258). Springer, Cham.

Slide 22

Slide 22 text

To be expected • Paper with experimental results on open datasets. (spoiler: FLSA-W outperforms all other algorithms on all data sets) • Gensim implementation (https://radimrehurek.com/gensim/) • A new algorithm: FLSA-E 22

Slide 23

Slide 23 text

Takeaways • Our new topic modeling algorithm (FLSA-W) is state-of-the-art • Our Python package (FuzzyTM) is easy to use and flexible 23

Slide 24

Slide 24 text

Let’s connect Repository Get started