Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
TDD 実践ミニトーク
contour_gara
0
130
LLMは麻雀を知らなすぎるから俺が教育してやる
po3rin
3
2.2k
『リコリス・リコイル』に学ぶ!! 〜キャリア戦略における計画的偶発性理論と変わる勇気の重要性〜
wanko_it
1
580
WebAssemblyインタプリタを書く ~Component Modelを添えて~
ruccho
1
890
Langfuseと歩む生成AI活用推進
licux
3
290
Honoアップデート 2025年夏
yusukebe
1
830
Claude Code と OpenAI o3 で メタデータ情報を作る
laket
0
130
未来を拓くAI技術〜エージェント開発とAI駆動開発〜
leveragestech
2
170
新世界の理解
koriym
0
140
なぜ今、Terraformの本を書いたのか? - 著者陣に聞く!『Terraformではじめる実践IaC』登壇資料
fufuhu
4
640
パスタの技術
yusukebe
1
400
学習を成果に繋げるための個人開発の考え方 〜 「学習のための個人開発」のすすめ / personal project for leaning
panda_program
1
110
Featured
See All Featured
RailsConf 2023
tenderlove
30
1.2k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
161
15k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
251
21k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
358
30k
Reflections from 52 weeks, 52 projects
jeffersonlam
351
21k
Visualization
eitanlees
146
16k
Docker and Python
trallard
45
3.5k
What’s in a name? Adding method to the madness
productmarketing
PRO
23
3.6k
BBQ
matthewcrist
89
9.8k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
234
17k
Become a Pro
speakerdeck
PRO
29
5.5k
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin