Slide 70
Slide 70 text
✓ Clean-up text (remove mentions, links, etc)
✓ Run language detection
✓ If unknown/low weight, pretend it’s English, else:
✓ If not a character set-determined language, try harder:
✓ Tokenize into words
✓ Difference with English vocabulary
✓ If words remain, run parts-of-speech tagger on each
✓ For NNS, VBZ, and VBD run stemming algorithm
✓ If result is in English vocabulary, remove from remaining
✓ If remaining list is not empty, calculate:
unusual_word_ratio = size(remaining)/size(words)
✓ If ratio < 20%, pretend it’s English
EnglishNotEnglish
A lot of this is heuristic-based, after some trial-and-error.
Seems to help with my corpus.