Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
20250226 NLP colloquium: "SoftMatcha: 10億単語規模コー...
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Hiroyuki Deguchi
February 26, 2025
Research
770
1
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
20250226 NLP colloquium: "SoftMatcha: 10億単語規模コーパス検索のための柔らかくも高速なパターンマッチャー"
Hiroyuki Deguchi
February 26, 2025
More Decks by Hiroyuki Deguchi
See All by Hiroyuki Deguchi
20240820: Minimum Bayes Risk Decoding for High-Quality Text Generation Beyond High-Probability Text
de9uch1
0
350
サブセット探索を用いた高速なkNNニューラル機械翻訳
de9uch1
0
170
20240226_AAMT-Japio
de9uch1
0
190
Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM’s Translation Capability
de9uch1
0
160
Paper Reading: Sampling-Based Approximations to Minimum Bayes Risk Decoding for Neural Machine Translation
de9uch1
0
220
My Research Environmental Setup
de9uch1
0
340
Nearest Neighbor Machine Translation
de9uch1
0
280
Paper Reading - Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation
de9uch1
0
310
paper reading - Tree Transformer
de9uch1
0
280
Other Decks in Research
See All in Research
Anthropic が提案する LLM の内部状態を自然言語で説明可能にした Natural Language Autoencoders / Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
shunk031
0
120
[チュートリアル] 電波マップ構築入門 :研究動向と課題設定の勘所
k_sato
0
470
機械学習で作った ポケモン対戦bot で 遊ぼう!
fufufukakaka
0
250
Using our influence and power for patient safety
helenbevan
0
360
Cross-Media Information Spaces and Architectures
signer
PRO
0
300
Model Discovery and Graph Simulation: A Lightweight Gateway to Chaos Engineering
anatolykr
0
190
世界モデルにおける分布外データ対応の方法論
koukyo1994
7
2.2k
Unified Audio Source Separation (Defense Slides)
kohei_1979
1
610
[BlackHatAsia2026] Hidden Telemetry: Uncovering TraceLogging ETW Providers You're Not Using (Yet)
asuna_jp
1
510
「車1割削減、渋滞半減、公共交通2倍」を 熊本から岡山へ@RACDA設立30周年記念都市交通フォーラム2026
trafficbrain
1
1.1k
業界横断 副業コンプライアンス調査 三者(副業者・本業先・発注者)におけるトラブル認知ギャップの構造分析
fkske
0
1.3k
2026年3月1日(日)福島「除染土」の公共利用をかんがえる
atsukomasano2026
0
620
Featured
See All Featured
svc-hook: hooking system calls on ARM64 by binary rewriting
retrage
2
290
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
9
1.4k
How STYLIGHT went responsive
nonsquared
100
6.2k
16th Malabo Montpellier Forum Presentation
akademiya2063
PRO
0
140
Neural Spatial Audio Processing for Sound Field Analysis and Control
skoyamalab
0
320
Have SEOs Ruined the Internet? - User Awareness of SEO in 2025
akashhashmi
0
360
How to Talk to Developers About Accessibility
jct
2
220
The Curious Case for Waylosing
cassininazir
1
380
Speed Design
sergeychernyshev
33
1.8k
How to Get Subject Matter Experts Bought In and Actively Contributing to SEO & PR Initiatives.
livdayseo
0
130
[RailsConf 2023] Rails as a piece of cake
palkan
59
6.7k
Navigating Team Friction
lara
192
16k
Transcript
None
◼ ⚫ ⚫ ⚫ ⚫ ⚫ ◼ ⚫ ⚫ ⚫
◼ ◼ ◼ ◼
◼ ◼ ◼ ◼
◼ ◼ ◼ ◼
𝑤 ◼ 𝑤 ◼ ⚫ 𝑤 ⚫ 𝑤 ◼ ⚫
◼ ⚫ ◼ ⚫ ◼ ⚫ (Radovanovic+, JMLR2010) ⚫ ▶
Wang+, (arxiv) 2022, “Text Embeddings by Weakly-Supervised Contrastive Pre-training”. Radovanovic+, JMLR 2010, “Hubs in space: Popular nearest neighbors in high-dimensional data”.
◼ ◼ ◼ ◼
◼ ⚫ 𝐩 = 𝑝1 , … , 𝑝𝑀 ∈
Σ∗ ⚫ 𝐭 = 𝑡1 , … , 𝑡𝑁 ∈ Σ∗ ▶ Σ∗ ◼ ⚫ ⚫ ◼ ◼
◼ ⚫ 𝑤 ∈ 𝒱 𝐷 ⚫ 𝐯𝑤 ∈ ℝ𝐷
≔ 𝑤 ⚫ ▶ cos 𝐯person , 𝐯people > cos 𝐯person , 𝐯bird ▶
◼ ◼ ⚫ ※
◼ ⚫ 𝑡𝑖 = 𝑝𝑗 cos 𝐯𝑡𝑖 , 𝐯𝑝𝑗 ≥
𝛼 ▶ 𝛼 = 1.0 ※ 𝛼 = 0.7 ◼ ⚫ ⚫
𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 𝑡6 𝑡7 𝑡8 𝑡9 𝑡10
𝑡11 𝑡12 𝑡13 ◼ ◼ ◼ ⚫
◼ 𝒮𝑤 ≔ 𝑣 ∈ 𝒱 cos 𝐯𝑣 ⊤𝐯𝑤 ≥
𝛼 ⚫ 𝑤 𝒮we 𝒮talk 𝒮about
⇔ 𝒮we , 𝒮talk , 𝒮about ⇔ 𝒮we , 𝒮talk
, 𝒮about 𝑖, 𝑖 + 1, 𝑖 + 2 𝑖 𝒮we 𝒮talk 𝒮about
𝒮we 𝒮talk 𝒮about 𝒮we ℳ ℳ ← 1,10,6 𝒮talk ℳ
ℳ′ ← 2 − 1, 11 − 1,7 − 1 = 1,10,6 ℳ ← ℳ ∩ ℳ′ = 1,10,6 𝒮about ℳ ℳ′ ← 8 − 2, 12 − 2 = 6,10 ℳ ← ℳ ∩ ℳ′ = 6,10 ℳ
𝒮we 𝒮talk 𝒮about 𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 𝑡6 𝑡7
𝑡8 𝑡9 𝑡10 𝑡11 𝑡12 𝑡13
𝒮𝑝1 𝐼𝒮𝑝1 𝒮𝑝𝑀 𝐼𝒮𝑝𝑀 𝒮𝑝1 𝐼𝒮𝑝1 ℳ ℳ ← 𝐼𝒮𝑝1
𝑘 = 2, … , 𝑀 ℳ′ ← 𝑖 − 𝑘 + 1 𝑖 ∈ 𝐼𝒮𝑝𝑘 ℳ ← ℳ ∩ ℳ′ ℳ 𝐩 = 𝑝1 , … , 𝑝𝑀
◼ ⚫ ▶ ▶ ⚫
◼ ◼ ⚫ (Wang+, 2024) ⚫ (Douze+, 2024) (Malkov &
Yashunin, IEEE TPAMI, 2018) ◼ ⚫ ▶ 𝛼 = 0.55 (Pennington+, EMNLP2014) ▶ 𝛼 = 0.50 (Grave+, arXiv:1802.06893) Wang+, arXiv:2402.05672, “Multilingual E5 Text Embeddings: A Technical Report”. Douze+, arXiv:2401.08281, “The Faiss library”. Malkov & Yashunin, IEEE TPAMI, 2018, “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs”. Pennington+, EMNLP2014, “GloVe: Global Vectors for Word Representation”. Grave+, arXiv:1802.06893, “Learning Word Vectors for 157 Languages”.
◼ ⚫ ▶ ⚫
◼ ⚫ ⚫ ◼ ⚫
◼ ⚫ (Crane, IJDL 2023) ⚫ (Bothwell+, EMNLP2023) ◼ Crane,
IJDL 2023, “The Perseus Digital Library and the future of libraries.”. Bothwell+, EMNLP2023, “Introducing Rhetorical Parallelism Detection: A New Task with Datasets, Metrics, and Baselines”.
◼ ⚫ ⚫ ⚫ ◼ ◼
◼ ⚫ ◼ ⚫ ▶ ◼ ⚫ ▶ ▶ ⚫
⚫
◼ ⚫ 𝐼𝒮𝑝𝑘 ▶ ⚫ ◼ ⚫
◼ ◼ ⚫ ⚫ ⚫ 𝑂 1 ▶ ◼ ⚫
⚫ ⚫ 𝑂 log |𝐵| ▶
◼ ⚫ ⚫ ◼ ⚫ ⚫ ▶ ⚫