Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
vorushin
April 06, 2012
Programming
590
2
Share
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Other Decks in Programming
See All in Programming
Swiftのレキシカルスコープ管理
kntkymt
0
190
AIエージェントの隔離技術の徹底比較
kawayu
0
430
Zod v4 Codec でスキーマに型変換を埋め込む REST API 設計 #TSKaigi2026
ryutaro_yako
0
150
The Past, Present, and Future of Enterprise Java
ivargrimstad
0
320
Kubernetesを使わない環境にもCloud Nativeなデプロイを実現する / Enabling Cloud Native deployments without the complexity of Kubernetes
linyows
3
560
Why Laravel apps break—Mastering the fundamentals to keep them maintainable
kentaroutakeda
1
210
Stage 3 Decorators でできること / できないこと / TSKaigi 2026
susisu
1
1.1k
[BalkanRuby 2026] Drop your app/services!
palkan
3
690
Are We Really Coding 10× Faster with AI?
kohzas
0
230
20260514 - build with ai 2026 - build LINE Bot with Gemini CLI
line_developers_tw
PRO
0
470
Composerを使ったサプライチェーン攻撃の様子を眺めてみる #phpstudy
o0h
PRO
2
150
自動レビューエンジンの実装と運用 ~レビューのない世界へ~
kurukuru1999
2
260
Featured
See All Featured
Leadership Guide Workshop - DevTernity 2021
reverentgeek
1
290
The Illustrated Guide to Node.js - THAT Conference 2024
reverentgeek
1
360
Abbi's Birthday
coloredviolet
2
7.7k
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
333
22k
The agentic SEO stack - context over prompts
schlessera
0
790
Navigating the Design Leadership Dip - Product Design Week Design Leaders+ Conference 2024
apolaine
1
320
The Cost Of JavaScript in 2023
addyosmani
55
9.9k
Java REST API Framework Comparison - PWX 2021
mraible
34
9.3k
コードの90%をAIが書く世界で何が待っているのか / What awaits us in a world where 90% of the code is written by AI
rkaga
61
44k
Mobile First: as difficult as doing things right
swwweet
225
10k
The World Runs on Bad Software
bkeepers
PRO
72
12k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin