Filtering n-grams using Machine Learning

Slide 1

Slide 1 text

Filtering n-‐grams using Machine Learning

Slide 2

Slide 2 text

Unsorted unigrams. 13M closetohome CMX309FLC AZ3 Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai

Slide 3

Slide 3 text

Filtered with regexps, 10M closetohome lehanga Mirabadi phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai

Slide 4

Slide 4 text

Filtered with SVM, 2.5M closetohome lehanga Mirabadi phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot

Slide 5

Slide 5 text

Data •  Good data: wikaonary words •  Bad data: words ﬁltered out by regexps •  Features – length of word – count of uppercase chars (excluding ﬁrst one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency

Slide 6

Slide 6 text

Details •  scikit-‐learn – python library for machine learning •  SVM with Gaussian kernel •  O(# of features * N2) – O(# of features * N3) •  100k items in training data => 5 min on 2 Ghz •  F1 = 0.98

Slide 7

Slide 7 text

Thank you! Roman Vorushin, Grammarly Inc. hZp://vorushin.ru hZp://twiZer.com/vorushin