Slide 4
Slide 4 text
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Pre-Processing
google refine [ link ]
[1] replacement of abbreviations and common entities with expressions that
normalize them (e.g., {dlrs, dlr, $, ...} → {dollar}, {mln, mlns, ...} → {million})
[2] adjustment of flaws and [3] stripping metadata entities through regular
expressions
mallet [ link ]
[1] make all the characters lowercase
[2] tokenization [3] stop-word removal
[4] vocabulary proportional cut-off, with threshold 0.03
[5] term-frequency representation of each document
corpus is a unique file, every line is a document with this format:
results: |W| = 32349 token types, 241908 words