Upgrade to Pro — share decks privately, control downloads, hide ads and more …

20250226 NLP colloquium: "SoftMatcha: 10億単語規模コー...

20250226 NLP colloquium: "SoftMatcha: 10億単語規模コーパス検索のための柔らかくも高速なパターンマッチャー"

Hiroyuki Deguchi

February 26, 2025
Tweet

More Decks by Hiroyuki Deguchi

Other Decks in Research

Transcript

  1. ◼ ⚫ ◼ ⚫ ◼ ⚫ (Radovanovic+, JMLR2010) ⚫ ▶

    Wang+, (arxiv) 2022, “Text Embeddings by Weakly-Supervised Contrastive Pre-training”. Radovanovic+, JMLR 2010, “Hubs in space: Popular nearest neighbors in high-dimensional data”.
  2. ◼ ⚫ 𝐩 = 𝑝1 , … , 𝑝𝑀 ∈

    Σ∗ ⚫ 𝐭 = 𝑡1 , … , 𝑡𝑁 ∈ Σ∗ ▶ Σ∗ ◼ ⚫ ⚫ ◼ ◼
  3. ◼ ⚫ 𝑤 ∈ 𝒱 𝐷 ⚫ 𝐯𝑤 ∈ ℝ𝐷

    ≔ 𝑤 ⚫ ▶ cos 𝐯person , 𝐯people > cos 𝐯person , 𝐯bird ▶
  4. ◼ ⚫ 𝑡𝑖 = 𝑝𝑗 cos 𝐯𝑡𝑖 , 𝐯𝑝𝑗 ≥

    𝛼 ▶ 𝛼 = 1.0 ※ 𝛼 = 0.7 ◼ ⚫ ⚫
  5. ◼ 𝒮𝑤 ≔ 𝑣 ∈ 𝒱 cos 𝐯𝑣 ⊤𝐯𝑤 ≥

    𝛼 ⚫ 𝑤 𝒮we 𝒮talk 𝒮about
  6. ⇔ 𝒮we , 𝒮talk , 𝒮about ⇔ 𝒮we , 𝒮talk

    , 𝒮about 𝑖, 𝑖 + 1, 𝑖 + 2 𝑖 𝒮we 𝒮talk 𝒮about
  7. 𝒮we 𝒮talk 𝒮about 𝒮we ℳ ℳ ← 1,10,6 𝒮talk ℳ

    ℳ′ ← 2 − 1, 11 − 1,7 − 1 = 1,10,6 ℳ ← ℳ ∩ ℳ′ = 1,10,6 𝒮about ℳ ℳ′ ← 8 − 2, 12 − 2 = 6,10 ℳ ← ℳ ∩ ℳ′ = 6,10 ℳ
  8. 𝒮𝑝1 𝐼𝒮𝑝1 𝒮𝑝𝑀 𝐼𝒮𝑝𝑀 𝒮𝑝1 𝐼𝒮𝑝1 ℳ ℳ ← 𝐼𝒮𝑝1

    𝑘 = 2, … , 𝑀 ℳ′ ← 𝑖 − 𝑘 + 1 𝑖 ∈ 𝐼𝒮𝑝𝑘 ℳ ← ℳ ∩ ℳ′ ℳ 𝐩 = 𝑝1 , … , 𝑝𝑀
  9. ◼ ◼ ⚫ (Wang+, 2024) ⚫ (Douze+, 2024) (Malkov &

    Yashunin, IEEE TPAMI, 2018) ◼ ⚫ ▶ 𝛼 = 0.55 (Pennington+, EMNLP2014) ▶ 𝛼 = 0.50 (Grave+, arXiv:1802.06893) Wang+, arXiv:2402.05672, “Multilingual E5 Text Embeddings: A Technical Report”. Douze+, arXiv:2401.08281, “The Faiss library”. Malkov & Yashunin, IEEE TPAMI, 2018, “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs”. Pennington+, EMNLP2014, “GloVe: Global Vectors for Word Representation”. Grave+, arXiv:1802.06893, “Learning Word Vectors for 157 Languages”.
  10. ◼ ⚫ (Crane, IJDL 2023) ⚫ (Bothwell+, EMNLP2023) ◼ Crane,

    IJDL 2023, “The Perseus Digital Library and the future of libraries.”. Bothwell+, EMNLP2023, “Introducing Rhetorical Parallelism Detection: A New Task with Datasets, Metrics, and Baselines”.
  11. ◼ ◼ ⚫ ⚫ ⚫ 𝑂 1 ▶ ◼ ⚫

    ⚫ ⚫ 𝑂 log |𝐵| ▶