Slide 1

Slide 1 text

Ensemble Topic Modelling Leland McInnes

Slide 2

Slide 2 text

Model a corpus of documents in terms of underlying “topics”

Slide 3

Slide 3 text

Topic Modelling as Matrix Factorization

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

LDA and pLSA are probabilistic matrix factorization methods

Slide 9

Slide 9 text

(Ensembles of) pLSA

Slide 10

Slide 10 text

Performance?

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

Quality?

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

Instability?

Slide 15

Slide 15 text

These are hard optimization problems

Slide 16

Slide 16 text

Topics vary from one run to another

Slide 17

Slide 17 text

What are the stable topics? Inspired by https://github.com/RaRe-Technologies/gensim/pull/2282

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Each cluster represents a stable topic

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

• Greater stability • Determines number of topics automatically • Embarrassingly parallel computation

Slide 22

Slide 22 text

Implementation

Slide 23

Slide 23 text

sklearn API

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

https://github.com/lmcinnes/enstop

Slide 26

Slide 26 text

pip install enstop