ScalaMatsuri2024: How to Use Embedding Models from Scala

Embedding Under the Hood: What We Need To Use Japanese
Embedding Models in Scala

WHOAMI SWE @ [REDACTED] OSS Contributor Get In Touch GitHub:
i10416 X(Twitter): @by110416 Speaker Deck: i10416 You can call me “Ito-san”

What You'll Learn Overview of Embedding in NLP How to
Call Embedding Models from Scala(& other JVM languages)

Overview of Embedding

Overview: What Is Embedding "Embedding" refers to a process to
extract natural language characteristics into representation easy to manipulate mathematically and programmatically. Typical representation is a real-valued vector.

Note There are various ways of obtaining embedding from text,
but in this presentation, "embedding" refers to context-aware feature vector obtained from large language models, more specifically, Hugging Face Transformers models.

Overview: Embedding Use Cases For example, how different Scala, Kotlin,
Go and Python are? We can naively compare words by levenshtein distance, but it is not context-aware. One big advantage of LLMs is its context-awareness. For example, the vector of cats (Scala FP library) is distant from the vector of cats (animal). This is because model weights in network are adjusted to knowledge in training data.

Converting words into vectors using LLM, we can measure semantic
distance between words.

Overview: Embedding Use Cases Semantic search Classification Clustering Recommendation

Overview: Steps to Get Embedding 1. Find a model from
Hugging Face 2. Pre-process 3. Encode 4. Model Application Utilize BERT(-like) Model Output Hidden State Use Sentence Transformer Models

Example: How to Get Embeddings in Python with BERT tokenizer
= BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained("bert-base-uncased") encoded_input = tokenizer("...", return_tensors='pt') output = model(**encoded_input) output.last_hidden_state It is known that the last hidden state of [CLS] token represents sentence characteristics (to some extent).

Example: How to Get Embeddings in Python with Sentence Transformers
from sentence_transformers import SentenceTransformer sentences = ["..."] model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') embeddings = model.encode(sentences) Sentence transformers models maps a sentence directly to a dense vector.

In Python, Hugging Face Transformers package encapsulates the transformation pipeline
However, we have to strip the abstraction away to understand the internal.

It goes through the following transformation pipeline under the hood.
String // 1: pre-process => String // cleanup・normalization // 2: tokenize => Seq[String] | Seq[Byte] // (sub-)word segmentation・special tokens assignment // 3: encoding => Seq[Number] // map tokens to ids => Tensor // adapt input ids to expected tensor shape // 4: model application => Tensor

Overview: Tokenizer & Model Are Two Important Concepts of Transformers
Package

Overview: Model Is Neural Network: Tensor => Tensor

Overview: Tokenizer Is an Adaptor for Each Natural Language In
general, BERT-like models take real-valued tensors as an input. We need to transform text into numerical value so that we can feed them into a model. This is where tokenizer comes in.

Overview: Tokenization 1: original "This is a great example" 2:
word segmentation [This, is, a, great, example] 3: subwording [This, is, a, gre, ##at, ex, ##ample] 4: encoding [42, 24, 6, 8, 4, 16, 110] Tokenization splits sentence into tokens. Tokens are either a word or a part of a word. Encoding assigns unique number to each token.

How to Use Embedding Models in Scala? ... and what
makes it difficult especially in Japanese?

Models Are Easier to Use from JVM than Tokenizers because
Model interface is decoupled from natural language domain knowledge There is ONNX, which is language-agnostic machine learning model format There is ONNX runtime, which can run ONNX model in various languages

What Is ONNX？ ONNX is an open format built to
represent machine learning models. ONNX includes ONNX serialization format(protobuf) ONNX runtime for C, Java, Python, JavaScript, C#, etc. ONNX language(operators) and extensions

What Is ONNX Model? A model is a kind of
graph and graph nodes are composition of ONNX operators.

What Is ONNX Model? ONNX runtime is pluggable in a
way that developers can add custom operators to perform various tasks including mathematical computation, tokenization for NLP, and even api calls. In the same way as JavaScript APIs differ between runtimes, ONNX runtime can be extended to perform various operations such as complex numerical calculation, NLP tasks and even API calls.

ONNX Limitation We must use sentence transformers model, that converts
text directly into embedding, because we cannot extract hidden states from ONNX model result.

Examples of SentenceTransformer supporting Japanese Some models do not work
well with Japanese. We need models that supports Japanese text. intfloat/multilingual-e5-large sentence-transformers/paraphrase-xlm-r-multilingual-v1 cl-nagoya/sup-simcse-ja-base pkshatech/GLuCoSE-base-ja

Show Me The Code! This is the outline of getting
embedding in Scala. There are some blanks to fill in this example. //> using dep "com.microsoft.onnxruntime:onnxruntime:1.18.0" import ai.onnxruntime.* val env: OrtEnvironment = OrtEnvironment.getEnvironment() val sess: OrtSession = env.createSession("path/to/model.onnx") val input: Map[String, OnnxTensor] = Map( "input_ids" -> ???, "attention_mask" -> ???, ... ) val result = sess.run(input.asJava)

Question: How to Get model.onnx ? val sess: OrtSession =
env.createSession("path/to/model.onnx") We need model files not tied to training & inference framework such as torch .

Convert a Hugging Face Model to ONNX There are some
options to convert model to ONNX, but optimum is the easiest one. Options optimum torch.onnx.export convert_graph_to_onnx

Convert a Hugging Face Model into ONNX with optimum Hugging
Face provides optimum, a tool to convert a model into ONNX. optimum-cli export onnx --model {model identifier} {out dir} optimum-cli generates {out dir}/model.onnx .

Detour: Convert a Hugging Face Model into ONNX with torch.onnx.export
We can use torch.onnx.export instead of optimum when you need to add some layers to model. import torch.onnx model.eval() dummy_input = ??? torch.onnx.export( model, dummy_input, "model.onnx", export_params=True, opset_version=10, do_constant_folding=True, input_names = ['input_ids', "token_type_ids", "attention_mask"], output_names = ['output'] )

Detour: Convert a Hugging Face Model into ONNX(Legacy) With Hugging
Face Transformers Version <= 4, there is a convert_graph_to_onnx module. Avoid this API if possible as this is marked deprecated . from transformers.convert_graph_to_onnx import convert model_location: str = "path/to/model/dir" output_location: str = "path/to/output/dir" convert( pipeline_name="...", framework="pt", model=model_location, tokenizer=model_location, output=Path(output_location), opset=12 )

Once you export a model as ONNX, you can restore
it from ONNX Runtime

Question: How to Get Model Input & Output ? val
input: Map[String, OnnxTensor] = Map( "input_ids" -> ???, "attention_mask" -> ???, ... )

Use getInputInfo and getOutputInfo API val sess: OrtSession = env.createSession("path/to/model.onnx")
val (i, o) = (sess.getInputInfo, sess.getOutputInfo)

Visit https://netron.app/ and upload model.onnx to check model input and
output

Question: How To Get Input Tensor? val input: Map[String, OnnxTensor]
= Map( "input_ids" -> ???, "attention_mask" -> ???, ... )

Use HF Tokenizers If Possible Deep Java Library has Hugging
Face Tokenizers module. It can load some tokenizers from Hugging Face. //> using dep "ai.djl.huggingface:tokenizers:0.28.0" val t = HuggingFaceTokenizer .newInstance("sentence-transformers/paraphrase-xlm-r-multilingual-v1") val encoding = t.encode("これはテストです") val (inputIds, typeIds, attentionMask) = (encoding.getIds(),encoding.getTypeIds(),encoding.getAttentionMask())

What Does Tokenizer Actually Do? pre-process: (some tokenizers) perform cleanup
and/or normalization tokenize: split text into words and sub-words, which are referred to as "token" special token handling: add special tokens such as [CLS] , [SEP] or [PAD] encoding: map tokens: Seq[String] | Seq[Byte] to ids: Seq[Number] decoding: map ids: Seq[Number] to tokens: Seq[String] | Seq[Byte] back There are some Japanese embedding models that perform tasks specific to Japanese text at pre-process phase and tokenize phase which are difficult for HuggingFaceTokenizer to support.

Difficulties specific to Japanese Tokenization (& Embedding) Japanese tokenization is
more complicated than English. It often requires special treatment, which makes a model less portable. The difficulties are ambiguity in word boundary the variety of characters in use

Ambiguity in Word Boundary すもももももももものうち => [すもも, も, もも, も,
もも, の, うち] It is a well-known tongue twister. It means "Both Japanese plum and peach are a kind of peaches". It is not easy even for native speakers to find correct boundaries at a glance.

Ambiguity in Word Boundary In Japanese, there's no explicit word
boundary except punctuation("、 ", "。 ") in contrast of English, where whitespace acts as a word boundary. Therefore, BertTokenizer , which uses whitespace as a word boundary, does not work well with Japanese text.

For example, these models do not work well with Japanese
text distiluse-base-multilingual-cased-v2 quora-distilbert-multilingual LaBSE because they use BertTokenizer .

Ambiguity in Word Boundary We often rely on a dictionary-based
morpheme analyzer to detect Japanese word boundary(e.g. BertJapaneseTokenizer ). Unfortunately, it brings additional dependencies and is not easy to incorporate it into portable model format in a programming-language-agnostic way.

The Variety of Characters in Use 日本語の文章にはひらがな・カタカナ・漢字や 123 などの半角英数字、１２３などの全角英数字、
一二、三などの漢数字、ABC などのアルファベット、記号や絵文字などが含まれる。 Japanese sentences include hiragana(ひらがな), katakana(カタカナ), kanji(漢字), half-width alphanumeric characters such as 123, full-width alphanumeric characters such as １２３, Chinese numerals such as "一", "二", and "三", and alphabetic characters such as ABC as well as various other symbols and emojis . Tokenizers should care about the fact that some order and combination of character kinds are more likely and other order and combination are less likely. See https://github.com/tanreinama/Japanese-BPEEncoder_V2/blob/master/README.md for more details.

Must Choose Right Tokenizer & Model Combination for Preciseness For
preciseness, we must choose right tokenizer and model combination because... a model assumes inputs come from the right tokenizer with the right configuration it is usual that one tokenizer use a sub-wording algorithm different from another(wordpiece, BPE, sentencepiece, etc.) token id of a token also differs depending on tokenizer

Examples of Model & Tokenizer Combination Model Tokenizer Optimized for
Japanese sentence-transformers/paraphrase-xlm-r- multilingual-v1 SentencePiece No intfloat/multilingual-e5-large SentencePiece No pkshatech/GLuCoSE-base-ja MLukeTokenizer Yes cl-nagoya/sup-simcse-ja-base MeCab+wordpiece Yes

Must Choose Right Tokenizer & Model Combination for Portability For
portability, we must choose right tokenizer and model combination because... some tokenizers are available in multiple programming languages while others are not some tokenizers depend on less common implementation dedicated for Japanese

Examples of Non-Python Tokenizers Rust: Huggingface Tokenizers and its binding
in Scala Java: DJL WordPiece Tokenizer and DJL SentencePiece Binding Java: DJL HuggingFace Tokenizers ONNX: NLP operators

Some Tokenizers Are Available From JVM //> using dep "com.lihaoyi::pprint:0.9.0"
//> using dep "ai.djl.huggingface:tokenizers:0.28.0" import pprint.* import ai.djl.huggingface.tokenizers.HuggingFaceTokenizer val modelName = "sentence-transformers/paraphrase-xlm-r-multilingual-v1" val tknz = HuggingFaceTokenizer.newInstance(modelName) val encoding = tknz.encode("これはテストです.") pprintln(encoding.getTokens()) //=> Array("<s>", "▁これは", "テスト", "です", ".", "</s>") pprintln(tknz.decode(encoding.getIds())) //=> "<s> これはテストです.</s>"

Some Japanese Tokenizers Are NOT Portable # Python tokenizer =
AutoTokenizer.from_pretrained("cl-nagoya/sup-simcse-ja-base") Here, it reads tokenizer configuration from either remote or local files using tokenizer model and vocabulary files. It also sets up MeCab morpheme analyzer for Japanese word segmentation which is only available from Python.

Cannot Instantiate Some HF Tokenizers val tknz = HuggingFaceTokenizer.newInstance("cl-nagoya/sup-simcse-ja-base") //=>
Error

Hack: Implement Tokenization & Encoding For example, cl-nagoya/sup-simcse-ja-base model uses
MeCab morpheme analyzer in combination with Wordpiece. { ... "subword_tokenizer_type": "wordpiece", "sudachi_kwargs": null, "tokenizer_class": "BertJapaneseTokenizer", "unk_token": "[UNK]", "word_tokenizer_type": "mecab" }

On JVM, MeCab is not available, but Sudachi is available.
We can use Wordpiece implementation in Deep Java Library. It increases the ratio of [UNK] tokens to use Sudachi instead of MeCab because Wordpiece expects input to be segmented by MeCab.

Hack: Setup Wordpiece //> using dep "ai.djl:api:0.28.0" //> using dep
"com.worksap.nlp:sudachi:0.7.3" import ai.djl.modality.nlp.DefaultVocabulary import ai.djl.modality.nlp.bert.WordpieceTokenizer val voc = DefaultVocabulary .builder() // download vocab.txt from // https://huggingface.co/tohoku-nlp/bert-base-japanese-v3/blob/main/vocab.txt .addFromTextFile(Path.of("vocab.txt")) .optUnknownToken("[UNK]") .build() val wp = WordpieceTokenizer( vocabulary = voc, unknown = "[UNK]", maxInputChars = 512 )

Hack: Setup Morpheme Analyzer import com.worksap.nlp.sudachi import com.worksap.nlp.sudachi.DictionaryFactory // download
system_core.dic from // http://sudachi.s3-website-ap-northeast-1.amazonaws.com/sudachidict/ val settings = s"""|{ | "systemDict" : "system_core.dic", | "oovProviderPlugin" : [ | { | "class" : "com.worksap.nlp.sudachi.SimpleOovProviderPlugin", | "oovPOS" : [ "名詞", "普通名詞", "一般", "*", "*", "*"] | } | ] |}""".stripMargin val sudachiToknizer = (new DictionaryFactory) .create( sudachi.Config.fromJsonString( settings, sudachi.PathAnchor.none() ) ).create()

Hack: Get Model Inputs val words = sudachiToknizer .tokenize(SplitMode.C, "これはテストです")
.asScala .map(_.surface()) val inputIds = words.flatMap(wp.tokenize.andThen(_.asScala)).map(voc.getIndex) // We must prepare token type ids, attention mask and paddings.

Create OnnxTensor "input_ids" -> OnnxTensor.createTensor( env, LongBuffer.wrap(inputIds), Array(1, inputIds.length) ),
... )

Get Embedding val result = sess.run( Map( "input_ids" -> OnnxTensor.createTensor(
env, LongBuffer.wrap(inputIds), Array(1, inputIds.length.toLong) ), "attention_mask" -> ... ).asJava ) val emb = result .get("sentence_embedding") .toScala .get .getValue() .asInstanceOf[Array[Array[Float]]]

Summary: How to Use Japanese Embedding in Scala? 1. find
an appropriate model 2. convert the model to ONNX 3. load the ONNX model on ONNX Runtime 4. encode input text to tensor using HuggingFaceTransformer from DJL or manually implement custom encoder 5. feed the tensor to the model

Summary: Tokenizer & Model Requirements Tokenizer should be available from
JVM(pure JVM language or via FFI binding) use morpheme analyzer available from JVM in combination with Wordpiece or SentencePiece tokenizer The model should return embedding be converted to ONNX

Summary: What We Need To Use Japanese Embedding Models Java
ONNX Runtime Deep Java Library(or Equivalent Tokenizer Bindings) Classical NLP Toolings(e.g. Text Normalizer, Morpheme Analyzer) Python・ optimum-cli

References & Learning Materials Deep Java Library supports various machine
learning tasks as well as model training and tuning. https://docs.djl.ai/

References & Learning Materials ONNX https://onnx.ai/onnx/intro/ https://onnxruntime.ai/docs/ https://huggingface.co/docs/optimum/index

References & Learning Materials Articles focusing on Japanese NLP https://medium.com/axinc/bertjapanesetokenizer-日本語
bert向けトークナイザ -7b54120aa245 https://tech.yellowback.net/posts/sentence-transformers-japanese-models https://speakerdeck.com/nttcom/exploring-publicly-available-japanese-embedding- models

References & Learning Materials Code is available at https://github.com/i10416/embedding-demo

ScalaMatsuri2024: How to Use Embedding Models f...

ScalaMatsuri2024: How to Use Embedding Models from Scala

More Decks by 110416

Featured

Transcript