Lock in $30 Savings on PRO—Offer Ends Soon! ⏳
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
ZOZOにおけるAI活用の現在 ~モバイルアプリ開発でのAI活用状況と事例~
zozotech
PRO
9
5.7k
AWS CDKの推しポイントN選
akihisaikeda
1
240
C-Shared Buildで突破するAI Agent バックテストの壁
po3rin
0
390
ID管理機能開発の裏側 高速にSaaS連携を実現したチームのAI活用編
atzzcokek
0
230
AIコーディングエージェント(Gemini)
kondai24
0
230
【CA.ai #3】ワークフローから見直すAIエージェント — 必要な場面と“選ばない”判断
satoaoaka
0
250
AIエージェントを活かすPM術 AI駆動開発の現場から
gyuta
0
420
著者と進める!『AIと個人開発したくなったらまずCursorで要件定義だ!』
yasunacoffee
0
140
Why Kotlin? 電子カルテを Kotlin で開発する理由 / Why Kotlin? at Henry
agatan
2
7.3k
AIコーディングエージェント(NotebookLM)
kondai24
0
200
안드로이드 9년차 개발자, 프론트엔드 주니어로 커리어 리셋하기
maryang
1
120
Giselleで作るAI QAアシスタント 〜 Pull Requestレビューに継続的QAを
codenote
0
190
Featured
See All Featured
Side Projects
sachag
455
43k
Java REST API Framework Comparison - PWX 2021
mraible
34
9k
Speed Design
sergeychernyshev
33
1.4k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
12
1.3k
The World Runs on Bad Software
bkeepers
PRO
72
12k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
16
1.8k
Balancing Empowerment & Direction
lara
5
800
Code Reviewing Like a Champion
maltzj
527
40k
Typedesign – Prime Four
hannesfritz
42
2.9k
Optimizing for Happiness
mojombo
379
70k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
1.8k
Why Our Code Smells
bkeepers
PRO
340
57k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin