Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Filtering n-grams using Machine Learning

Filtering n-grams using Machine Learning

My lightning talk from first Kiev AI/NLP group meeting.

vorushin

April 06, 2012
Tweet

Other Decks in Programming

Transcript

  1. Unsorted  unigrams.  13M   closetohome   CMX309FLC   AZ3  

    Lehanga   indexterm.endofrang     NIC3   N1NB   Mirabadi     phantomd     ANOTHER.EXAMPLE   awful63   Zabolotsky   Dispencer   cremonesi   kind.The   ECOOP'97   4.499E   OrbitzSaver   jellying   ENr313   paulxcs   Campaoré   überschreibt   PüZmann   nomalized   Profesje   Blogzerk   imnot   getPluginPreferencesF lag         backgroundCorrect   DEDeutschland   at'ai  
  2. Filtered  with  regexps,  10M   closetohome   lehanga   Mirabadi

      phantomd   Zabolotsky   Dispencer   cremonesi   0   jellying   paulxcs   Campaoré   überschreibt   PüZmann   nomalized   Profesje   Blogzerk   imnot   DEDeutschland   at'ai  
  3. Filtered  with  SVM,  2.5M   closetohome   lehanga   Mirabadi

      phantomd   Zabolotsky   Dispencer   cremonesi   0   jellying   paulxcs   nomalized   Profesje   Blogzerk   imnot    
  4. Data     •  Good  data:  wikaonary  words   • 

    Bad  data:  words  filtered  out  by  regexps   •  Features   – length  of  word   – count  of  uppercase  chars  (excluding  first  one)   – count  of  non-­‐alpha  chars   – probability  of  word  given  2-­‐char  n-­‐grams   – unigram  frequency  
  5. Details   •  scikit-­‐learn  –  python  library  for  machine  

    learning   •  SVM  with  Gaussian  kernel   •  O(#  of  features  *  N2)  –  O(#  of  features  *  N3)   •  100k  items  in  training  data  =>  5  min  on  2  Ghz   •  F1  =  0.98  
  6. Thank  you!         Roman  Vorushin,  Grammarly  Inc.

      hZp://vorushin.ru   hZp://twiZer.com/vorushin