Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
私の後悔をAWS DMSで解決した話
hiramax
4
210
1から理解するWeb Push
dora1998
7
1.9k
知っているようで知らない"rails new"の世界 / The World of "rails new" You Think You Know but Don't
luccafort
PRO
1
170
CloudflareのChat Agent Starter Kitで簡単!AIチャットボット構築
syumai
2
500
Design Foundational Data Engineering Observability
sucitw
3
200
Swift Updates - Learn Languages 2025
koher
2
490
パッケージ設計の黒魔術/Kyoto.go#63
lufia
3
440
デザイナーが Androidエンジニアに 挑戦してみた
874wokiite
0
520
FindyにおけるTakumi活用と脆弱性管理のこれから
rvirus0817
0
520
概念モデル→論理モデルで気をつけていること
sunnyone
2
280
楽して成果を出すためのセルフリソース管理
clipnote
0
170
AIを活用し、今後に備えるための技術知識 / Basic Knowledge to Utilize AI
kishida
22
5.8k
Featured
See All Featured
Navigating Team Friction
lara
189
15k
Bash Introduction
62gerente
615
210k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
34
6k
It's Worth the Effort
3n
187
28k
The Straight Up "How To Draw Better" Workshop
denniskardys
236
140k
Building Applications with DynamoDB
mza
96
6.6k
The Language of Interfaces
destraynor
161
25k
How GitHub (no longer) Works
holman
315
140k
Keith and Marios Guide to Fast Websites
keithpitt
411
22k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
53k
Raft: Consensus for Rubyists
vanstee
140
7.1k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
23
1.4k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin