Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Filtering n-grams using Machine Learning
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
vorushin
April 06, 2012
Programming
590
2
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Filtering n-grams using Machine Learning
My lightning talk from first Kiev AI/NLP group meeting.
vorushin
April 06, 2012
Other Decks in Programming
See All in Programming
肥大化するレガシーコードに立ち向かうためのインターフェース分離と依存の逆転 / JJUG CCC 2026 Spring
hirokunimaeta
0
550
JavaDoc 再入門
nagise
1
340
フロントエンドとバックエンドで「1文字」を揃えよう
youkidearitai
PRO
0
680
Signal Forms: Details & Live Coding @enterJS 2026 in Mannheim
manfredsteyer
PRO
0
130
Composerを使ったサプライチェーン攻撃の様子を眺めてみる #phpstudy
o0h
PRO
2
250
「エンジニアインターン、どうやって取った?」準備のリアルを語るLT会 Progate BAR
akiomatic
0
130
Hunting Vulnerabilities in Symfony with LLMs
vinceamstoutz
0
540
A2UI という光を覗いてみる
satohjohn
1
130
Javaの型とAI時代に型が大事な理由 / java types and type in AI era
kishida
2
130
生成AI時代にこそ効くGo | Why Go Works in the Age of Generative AI
mom0tomo
8
3.2k
Vite+ Unified Toolchain for the Web
naokihaba
0
300
TSKaigi Night Talks 2026_TypeScriptでサプライチェーンの整合性を型に閉じ込める
geekplus_tech
0
350
Featured
See All Featured
Jess Joyce - The Pitfalls of Following Frameworks
techseoconnect
PRO
1
170
Raft: Consensus for Rubyists
vanstee
141
7.5k
What's in a price? How to price your products and services
michaelherold
247
13k
The Pragmatic Product Professional
lauravandoore
37
7.3k
Balancing Empowerment & Direction
lara
6
1.2k
VelocityConf: Rendering Performance Case Studies
addyosmani
333
25k
Bash Introduction
62gerente
615
220k
GraphQLの誤解/rethinking-graphql
sonatard
75
12k
Leadership Guide Workshop - DevTernity 2021
reverentgeek
1
300
How to Build an AI Search Optimization Roadmap - Criteria and Steps to Take #SEOIRL
aleyda
1
2.1k
Effective software design: The role of men in debugging patriarchy in IT @ Voxxed Days AMS
baasie
0
410
Speed Design
sergeychernyshev
33
1.8k
Transcript
Filtering n-‐grams using Machine Learning
Unsorted unigrams. 13M closetohome CMX309FLC AZ3
Lehanga indexterm.endofrang NIC3 N1NB Mirabadi phantomd ANOTHER.EXAMPLE awful63 Zabolotsky Dispencer cremonesi kind.The ECOOP'97 4.499E OrbitzSaver jellying ENr313 paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot getPluginPreferencesF lag backgroundCorrect DEDeutschland at'ai
Filtered with regexps, 10M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs Campaoré überschreibt PüZmann nomalized Profesje Blogzerk imnot DEDeutschland at'ai
Filtered with SVM, 2.5M closetohome lehanga Mirabadi
phantomd Zabolotsky Dispencer cremonesi 0 jellying paulxcs nomalized Profesje Blogzerk imnot
Data • Good data: wikaonary words •
Bad data: words filtered out by regexps • Features – length of word – count of uppercase chars (excluding first one) – count of non-‐alpha chars – probability of word given 2-‐char n-‐grams – unigram frequency
Details • scikit-‐learn – python library for machine
learning • SVM with Gaussian kernel • O(# of features * N2) – O(# of features * N3) • 100k items in training data => 5 min on 2 Ghz • F1 = 0.98
Thank you! Roman Vorushin, Grammarly Inc.
hZp://vorushin.ru hZp://twiZer.com/vorushin