Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Multi lingual Vector search using LLM

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for Manabu TERADA Manabu TERADA
December 15, 2023

Multi lingual Vector search using LLM

「PyCon TH 2023 / LT」
2023-12-15

Avatar for Manabu TERADA

Manabu TERADA

December 15, 2023
Tweet

More Decks by Manabu TERADA

Other Decks in Technology

Transcript

  1. copyright © 2023 CMS Comunications Inc. all rights reserved. Multi

    lingual Vector search using LLM Manabu TERADA (@terapyon) 「PyCon TH 2023 / LT」 2023-12-15
  2. copyright © 2023 CMS Comunications Inc. all rights reserved. Self

    introduction Manabu TERADA (寺田 学) • from Tokyo, Japan • Python Engineer • PSF Fellow • Board of PyCon JP (Japan) Assosiation • Plone Foundation Ambassador • Owner of CMScom (Japanese company)
  3. copyright © 2023 CMS Comunications Inc. all rights reserved. About

    Vector search • Generating vectors from text documents. • Storing it into Vector DB. • Generate a vector from a search text. • Comparing/Searching items by vector with similarity algorithm.
  4. copyright © 2023 CMS Comunications Inc. all rights reserved. Vector

    search using LLM embedding model • I made PoC system for vector search. • I chose Hugging Face model for embedding. No OpenAI is involved.
  5. copyright © 2023 CMS Comunications Inc. all rights reserved. Embedding

    model • Important to select a model • How to select: ◦ Massive Text Embedding Benchmark (MTEB) Leaderboard. ◦ https://huggingface.co/spaces/mteb/leaderboard • I’m using "intfloat/multilingual-e5-large" ◦ Supports 100 languages, Japanese, Thai and more.
  6. copyright © 2023 CMS Comunications Inc. all rights reserved. Another

    situation • I want Intranet Plone to have higher functionality of search. • Not only words but also sentences to be searched. • Not use of OpenAI, Intranet data should not got out beyond the boundary.
  7. copyright © 2023 CMS Comunications Inc. all rights reserved. Structure

    of my sample package • I made a sample package • Vector search for Plone site
  8. copyright © 2023 CMS Comunications Inc. all rights reserved. Technical

    Feature • A new Index class reference from ZCTextIndex • Adding the Index on portal_catalog for auto indexing. • Embedding model is "intfloat/multilingual-e5-large", No OpenAI is involved. • As a consequence, a new keyword args are added on portal_catalog for search
  9. copyright © 2023 CMS Comunications Inc. all rights reserved. Added

    index and interface for Plone Added Index on ZCatalog Added interface on portal_catalog
  10. copyright © 2023 CMS Comunications Inc. all rights reserved. Problem?

    • The sample package requires a GPU • Tried to run it on my MacBook Air (M1), but it did not work • How to evaluate?