Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
570
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
AWS CDKの推しポイント 〜CloudFormationと比較してみた〜
akihisaikeda
3
320
システム成長を止めない!本番無停止テーブル移行の全貌
sakawe_ee
1
160
5つのアンチパターンから学ぶLT設計
narihara
1
160
Flutterで備える!Accessibility Nutrition Labels完全ガイド
yuukiw00w
0
150
今ならAmazon ECSのサービス間通信をどう選ぶか / Selection of ECS Interservice Communication 2025
tkikuc
21
3.9k
AIと”コードの評価関数”を共有する / Share the "code evaluation function" with AI
euglena1215
1
130
20250628_非エンジニアがバイブコーディングしてみた
ponponmikankan
0
650
『自分のデータだけ見せたい!』を叶える──Laravel × Casbin で複雑権限をスッキリ解きほぐす 25 分
akitotsukahara
2
620
Composerが「依存解決」のためにどんな工夫をしているか #phpcon
o0h
PRO
1
250
What Spring Developers Should Know About Jakarta EE
ivargrimstad
0
410
関数型まつりレポート for JuliaTokai #22
antimon2
0
160
20250704_教育事業におけるアジャイルなデータ基盤構築
hanon52_
5
620
Featured
See All Featured
How STYLIGHT went responsive
nonsquared
100
5.6k
Git: the NoSQL Database
bkeepers
PRO
430
65k
Building Applications with DynamoDB
mza
95
6.5k
Building Adaptive Systems
keathley
43
2.6k
The Invisible Side of Design
smashingmag
301
51k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
30
2.1k
Embracing the Ebb and Flow
colly
86
4.7k
Into the Great Unknown - MozCon
thekraken
39
1.9k
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
53
2.8k
Large-scale JavaScript Application Architecture
addyosmani
512
110k
A Tale of Four Properties
chriscoyier
160
23k
Gamification - CAS2011
davidbonilla
81
5.3k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin