Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
Kiroで始めるAI-DLC
kaonash
2
610
より安全で効率的な Go コードへ: Protocol Buffers Opaque API の導入
shwatanap
1
180
Introducing ReActionView: A new ActionView-compatible ERB Engine @ Rails World 2025, Amsterdam
marcoroth
0
700
概念モデル→論理モデルで気をつけていること
sunnyone
2
280
2025 年のコーディングエージェントの現在地とエンジニアの仕事の変化について
azukiazusa1
24
12k
個人軟體時代
ethanhuang13
0
330
複雑なフォームに立ち向かう Next.js の技術選定
macchiitaka
2
170
AIを活用し、今後に備えるための技術知識 / Basic Knowledge to Utilize AI
kishida
22
5.8k
パッケージ設計の黒魔術/Kyoto.go#63
lufia
3
440
Improving my own Ruby thereafter
sisshiki1969
1
160
意外と簡単!?フロントエンドでパスキー認証を実現する WebAuthn
teamlab
PRO
2
770
1から理解するWeb Push
dora1998
7
1.9k
Featured
See All Featured
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
15
1.7k
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
Site-Speed That Sticks
csswizardry
10
820
Intergalactic Javascript Robots from Outer Space
tanoku
272
27k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
9
810
The Cult of Friendly URLs
andyhume
79
6.6k
A Tale of Four Properties
chriscoyier
160
23k
The World Runs on Bad Software
bkeepers
PRO
70
11k
Fireside Chat
paigeccino
39
3.6k
Code Reviewing Like a Champion
maltzj
525
40k
How STYLIGHT went responsive
nonsquared
100
5.8k
The Invisible Side of Design
smashingmag
301
51k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin