Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
2
580
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Tweet
Share
Other Decks in Programming
See All in Programming
Devoxx BE - Local Development in the AI Era
kdubois
0
150
The Past, Present, and Future of Enterprise Java
ivargrimstad
0
410
実践Claude Code:20の失敗から学ぶAIペアプログラミング
takedatakashi
18
9.2k
CSC509 Lecture 07
javiergs
PRO
0
250
20251016_Rails News ~Rails 8.1の足音を聴く~
morimorihoge
3
890
Introduce Hono CLI
yusukebe
6
3.2k
釣り地図SNSにおける有料機能の実装
nokonoko1203
0
200
Dive into Triton Internals
appleparan
0
260
他言語経験者が Golangci-lint を最初のコーディングメンターにした話 / How Golangci-lint Became My First Coding Mentor: A Story from a Polyglot Programmer
uma31
0
480
モテるデスク環境
mozumasu
3
1.4k
Temporal Knowledge Graphで作る! 時間変化するナレッジを扱うAI Agentの世界
po3rin
5
1k
組込みだけじゃない!TinyGo で始める無料クラウド開発入門
otakakot
2
380
Featured
See All Featured
We Have a Design System, Now What?
morganepeng
53
7.9k
Rails Girls Zürich Keynote
gr2m
95
14k
BBQ
matthewcrist
89
9.9k
Done Done
chrislema
185
16k
Writing Fast Ruby
sferik
630
62k
The Illustrated Children's Guide to Kubernetes
chrisshort
51
51k
Unsuck your backbone
ammeep
671
58k
What's in a price? How to price your products and services
michaelherold
246
12k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
23
1.5k
StorybookのUI Testing Handbookを読んだ
zakiyama
31
6.3k
Building a Modern Day E-commerce SEO Strategy
aleyda
44
7.9k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
9
1k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin