Slide 1

Slide 1 text

copyright © 2023 CMS Comunications Inc. all rights reserved. Multi lingual Vector search using LLM Manabu TERADA (@terapyon) 「PyCon TH 2023 / LT」 2023-12-15

Slide 2

Slide 2 text

copyright © 2023 CMS Comunications Inc. all rights reserved. Self introduction Manabu TERADA (寺田 学) ● from Tokyo, Japan ● Python Engineer ● PSF Fellow ● Board of PyCon JP (Japan) Assosiation ● Plone Foundation Ambassador ● Owner of CMScom (Japanese company)

Slide 3

Slide 3 text

copyright © 2023 CMS Comunications Inc. all rights reserved. PyCon TH 3rd time

Slide 4

Slide 4 text

copyright © 2023 CMS Comunications Inc. all rights reserved. About Vector search ● Generating vectors from text documents. ● Storing it into Vector DB. ● Generate a vector from a search text. ● Comparing/Searching items by vector with similarity algorithm.

Slide 5

Slide 5 text

copyright © 2023 CMS Comunications Inc. all rights reserved. Vector search using LLM embedding model ● I made PoC system for vector search. ● I chose Hugging Face model for embedding. No OpenAI is involved.

Slide 6

Slide 6 text

copyright © 2023 CMS Comunications Inc. all rights reserved. Embedding model ● Important to select a model ● How to select: ○ Massive Text Embedding Benchmark (MTEB) Leaderboard. ○ https://huggingface.co/spaces/mteb/leaderboard ● I’m using "intfloat/multilingual-e5-large" ○ Supports 100 languages, Japanese, Thai and more.

Slide 7

Slide 7 text

copyright © 2023 CMS Comunications Inc. all rights reserved. Another situation ● I want Intranet Plone to have higher functionality of search. ● Not only words but also sentences to be searched. ● Not use of OpenAI, Intranet data should not got out beyond the boundary.

Slide 8

Slide 8 text

copyright © 2023 CMS Comunications Inc. all rights reserved. Structure of my sample package ● I made a sample package ● Vector search for Plone site

Slide 9

Slide 9 text

copyright © 2023 CMS Comunications Inc. all rights reserved. Technical Feature ● A new Index class reference from ZCTextIndex ● Adding the Index on portal_catalog for auto indexing. ● Embedding model is "intfloat/multilingual-e5-large", No OpenAI is involved. ● As a consequence, a new keyword args are added on portal_catalog for search

Slide 10

Slide 10 text

copyright © 2023 CMS Comunications Inc. all rights reserved. Added index and interface for Plone Added Index on ZCatalog Added interface on portal_catalog

Slide 11

Slide 11 text

copyright © 2023 CMS Comunications Inc. all rights reserved. Problem? ● The sample package requires a GPU ● Tried to run it on my MacBook Air (M1), but it did not work ● How to evaluate?

Slide 12

Slide 12 text

copyright © 2023 CMS Comunications Inc. all rights reserved. Thank you! Manabu TERADA (@terapyon)