Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
そのpreloadは必要?見過ごされたpreloadが技術的負債として爆発した日
mugitti9
2
3.1k
タスクの特性や不確実性に応じた最適な作業スタイルの選択(ペアプロ・モブプロ・ソロプロ)と実践 / Optimal Work Style Selection: Pair, Mob, or Solo Programming.
honyanya
3
140
開発者への寄付をアプリ内課金として実装する時の気の使いどころ
ski
0
360
高度なUI/UXこそHotwireで作ろう Kaigi on Rails 2025
naofumi
4
3.6k
Serena MCPのすすめ
wadakatu
4
920
The Past, Present, and Future of Enterprise Java
ivargrimstad
0
190
階層構造を表現するデータ構造とリファクタリング 〜1年で10倍成長したプロダクトの変化と課題〜
yuhisatoxxx
3
940
なぜあの開発者はDevRelに伴走し続けるのか / Why Does That Developer Keep Running Alongside DevRel?
nrslib
3
380
monorepo の Go テストをはやくした〜い!~最小の依存解決への道のり~ / faster-testing-of-monorepos
convto
2
430
株式会社 Sun terras カンパニーデック
sunterras
0
250
iOSエンジニア向けの英語学習アプリを作る!
yukawashouhei
0
180
Go言語の特性を活かした公式MCP SDKの設計
hond0413
1
200
Featured
See All Featured
GraphQLの誤解/rethinking-graphql
sonatard
73
11k
RailsConf 2023
tenderlove
30
1.2k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
33
2.5k
Designing for Performance
lara
610
69k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
49
3.1k
Learning to Love Humans: Emotional Interface Design
aarron
274
40k
Navigating Team Friction
lara
189
15k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
36
2.5k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
127
53k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
52
5.6k
YesSQL, Process and Tooling at Scale
rocio
173
14k
A Modern Web Designer's Workflow
chriscoyier
697
190k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin