Slide 1

Slide 1 text

Embedding Under the Hood: What We Need To Use Japanese Embedding Models in Scala

Slide 2

Slide 2 text

WHOAMI SWE @ [REDACTED] OSS Contributor Get In Touch GitHub: i10416 X(Twitter): @by110416 Speaker Deck: i10416 You can call me “Ito-san”

Slide 3

Slide 3 text

What You'll Learn Overview of Embedding in NLP How to Call Embedding Models from Scala(& other JVM languages)

Slide 4

Slide 4 text

Overview of Embedding

Slide 5

Slide 5 text

Overview: What Is Embedding "Embedding" refers to a process to extract natural language characteristics into representation easy to manipulate mathematically and programmatically. Typical representation is a real-valued vector.

Slide 6

Slide 6 text

Note There are various ways of obtaining embedding from text, but in this presentation, "embedding" refers to context-aware feature vector obtained from large language models, more specifically, Hugging Face Transformers models.

Slide 7

Slide 7 text

Overview: Embedding Use Cases For example, how different Scala, Kotlin, Go and Python are? We can naively compare words by levenshtein distance, but it is not context-aware. One big advantage of LLMs is its context-awareness. For example, the vector of cats (Scala FP library) is distant from the vector of cats (animal). This is because model weights in network are adjusted to knowledge in training data.

Slide 8

Slide 8 text

Converting words into vectors using LLM, we can measure semantic distance between words.

Slide 9

Slide 9 text

Overview: Embedding Use Cases Semantic search Classification Clustering Recommendation

Slide 10

Slide 10 text

Overview: Steps to Get Embedding 1. Find a model from Hugging Face 2. Pre-process 3. Encode 4. Model Application Utilize BERT(-like) Model Output Hidden State Use Sentence Transformer Models

Slide 11

Slide 11 text

Example: How to Get Embeddings in Python with BERT tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained("bert-base-uncased") encoded_input = tokenizer("...", return_tensors='pt') output = model(**encoded_input) output.last_hidden_state It is known that the last hidden state of [CLS] token represents sentence characteristics (to some extent).

Slide 12

Slide 12 text

Example: How to Get Embeddings in Python with Sentence Transformers from sentence_transformers import SentenceTransformer sentences = ["..."] model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') embeddings = model.encode(sentences) Sentence transformers models maps a sentence directly to a dense vector.

Slide 13

Slide 13 text

In Python, Hugging Face Transformers package encapsulates the transformation pipeline However, we have to strip the abstraction away to understand the internal.

Slide 14

Slide 14 text

It goes through the following transformation pipeline under the hood. String // 1: pre-process => String // cleanup・normalization // 2: tokenize => Seq[String] | Seq[Byte] // (sub-)word segmentation・special tokens assignment // 3: encoding => Seq[Number] // map tokens to ids => Tensor // adapt input ids to expected tensor shape // 4: model application => Tensor

Slide 15

Slide 15 text

Overview: Tokenizer & Model Are Two Important Concepts of Transformers Package

Slide 16

Slide 16 text

Overview: Model Is Neural Network: Tensor => Tensor

Slide 17

Slide 17 text

Overview: Tokenizer Is an Adaptor for Each Natural Language In general, BERT-like models take real-valued tensors as an input. We need to transform text into numerical value so that we can feed them into a model. This is where tokenizer comes in.

Slide 18

Slide 18 text

Overview: Tokenization 1: original "This is a great example" 2: word segmentation [This, is, a, great, example] 3: subwording [This, is, a, gre, ##at, ex, ##ample] 4: encoding [42, 24, 6, 8, 4, 16, 110] Tokenization splits sentence into tokens. Tokens are either a word or a part of a word. Encoding assigns unique number to each token.

Slide 19

Slide 19 text

How to Use Embedding Models in Scala? ... and what makes it difficult especially in Japanese?

Slide 20

Slide 20 text

Models Are Easier to Use from JVM than Tokenizers because Model interface is decoupled from natural language domain knowledge There is ONNX, which is language-agnostic machine learning model format There is ONNX runtime, which can run ONNX model in various languages

Slide 21

Slide 21 text

What Is ONNX? ONNX is an open format built to represent machine learning models. ONNX includes ONNX serialization format(protobuf) ONNX runtime for C, Java, Python, JavaScript, C#, etc. ONNX language(operators) and extensions

Slide 22

Slide 22 text

What Is ONNX Model? A model is a kind of graph and graph nodes are composition of ONNX operators.

Slide 23

Slide 23 text

What Is ONNX Model? ONNX runtime is pluggable in a way that developers can add custom operators to perform various tasks including mathematical computation, tokenization for NLP, and even api calls. In the same way as JavaScript APIs differ between runtimes, ONNX runtime can be extended to perform various operations such as complex numerical calculation, NLP tasks and even API calls.

Slide 24

Slide 24 text

ONNX Limitation We must use sentence transformers model, that converts text directly into embedding, because we cannot extract hidden states from ONNX model result.

Slide 25

Slide 25 text

Examples of SentenceTransformer supporting Japanese Some models do not work well with Japanese. We need models that supports Japanese text. intfloat/multilingual-e5-large sentence-transformers/paraphrase-xlm-r-multilingual-v1 cl-nagoya/sup-simcse-ja-base pkshatech/GLuCoSE-base-ja

Slide 26

Slide 26 text

Show Me The Code! This is the outline of getting embedding in Scala. There are some blanks to fill in this example. //> using dep "com.microsoft.onnxruntime:onnxruntime:1.18.0" import ai.onnxruntime.* val env: OrtEnvironment = OrtEnvironment.getEnvironment() val sess: OrtSession = env.createSession("path/to/model.onnx") val input: Map[String, OnnxTensor] = Map( "input_ids" -> ???, "attention_mask" -> ???, ... ) val result = sess.run(input.asJava)

Slide 27

Slide 27 text

Question: How to Get model.onnx ? val sess: OrtSession = env.createSession("path/to/model.onnx") We need model files not tied to training & inference framework such as torch .

Slide 28

Slide 28 text

Convert a Hugging Face Model to ONNX There are some options to convert model to ONNX, but optimum is the easiest one. Options optimum torch.onnx.export convert_graph_to_onnx

Slide 29

Slide 29 text

Convert a Hugging Face Model into ONNX with optimum Hugging Face provides optimum, a tool to convert a model into ONNX. optimum-cli export onnx --model {model identifier} {out dir} optimum-cli generates {out dir}/model.onnx .

Slide 30

Slide 30 text

Detour: Convert a Hugging Face Model into ONNX with torch.onnx.export We can use torch.onnx.export instead of optimum when you need to add some layers to model. import torch.onnx model.eval() dummy_input = ??? torch.onnx.export( model, dummy_input, "model.onnx", export_params=True, opset_version=10, do_constant_folding=True, input_names = ['input_ids', "token_type_ids", "attention_mask"], output_names = ['output'] )

Slide 31

Slide 31 text

Detour: Convert a Hugging Face Model into ONNX(Legacy) With Hugging Face Transformers Version <= 4, there is a convert_graph_to_onnx module. Avoid this API if possible as this is marked deprecated . from transformers.convert_graph_to_onnx import convert model_location: str = "path/to/model/dir" output_location: str = "path/to/output/dir" convert( pipeline_name="...", framework="pt", model=model_location, tokenizer=model_location, output=Path(output_location), opset=12 )

Slide 32

Slide 32 text

Once you export a model as ONNX, you can restore it from ONNX Runtime

Slide 33

Slide 33 text

Question: How to Get Model Input & Output ? val input: Map[String, OnnxTensor] = Map( "input_ids" -> ???, "attention_mask" -> ???, ... )

Slide 34

Slide 34 text

Use getInputInfo and getOutputInfo API val sess: OrtSession = env.createSession("path/to/model.onnx") val (i, o) = (sess.getInputInfo, sess.getOutputInfo)

Slide 35

Slide 35 text

Visit https://netron.app/ and upload model.onnx to check model input and output

Slide 36

Slide 36 text

Question: How To Get Input Tensor? val input: Map[String, OnnxTensor] = Map( "input_ids" -> ???, "attention_mask" -> ???, ... )

Slide 37

Slide 37 text

Use HF Tokenizers If Possible Deep Java Library has Hugging Face Tokenizers module. It can load some tokenizers from Hugging Face. //> using dep "ai.djl.huggingface:tokenizers:0.28.0" val t = HuggingFaceTokenizer .newInstance("sentence-transformers/paraphrase-xlm-r-multilingual-v1") val encoding = t.encode("これはテストです") val (inputIds, typeIds, attentionMask) = (encoding.getIds(),encoding.getTypeIds(),encoding.getAttentionMask())

Slide 38

Slide 38 text

What Does Tokenizer Actually Do? pre-process: (some tokenizers) perform cleanup and/or normalization tokenize: split text into words and sub-words, which are referred to as "token" special token handling: add special tokens such as [CLS] , [SEP] or [PAD] encoding: map tokens: Seq[String] | Seq[Byte] to ids: Seq[Number] decoding: map ids: Seq[Number] to tokens: Seq[String] | Seq[Byte] back There are some Japanese embedding models that perform tasks specific to Japanese text at pre-process phase and tokenize phase which are difficult for HuggingFaceTokenizer to support.

Slide 39

Slide 39 text

Difficulties specific to Japanese Tokenization (& Embedding) Japanese tokenization is more complicated than English. It often requires special treatment, which makes a model less portable. The difficulties are ambiguity in word boundary the variety of characters in use

Slide 40

Slide 40 text

Ambiguity in Word Boundary すもももももももものうち => [すもも, も, もも, も, もも, の, う ち] It is a well-known tongue twister. It means "Both Japanese plum and peach are a kind of peaches". It is not easy even for native speakers to find correct boundaries at a glance.

Slide 41

Slide 41 text

Ambiguity in Word Boundary In Japanese, there's no explicit word boundary except punctuation("、 ", "。 ") in contrast of English, where whitespace acts as a word boundary. Therefore, BertTokenizer , which uses whitespace as a word boundary, does not work well with Japanese text.

Slide 42

Slide 42 text

For example, these models do not work well with Japanese text distiluse-base-multilingual-cased-v2 quora-distilbert-multilingual LaBSE because they use BertTokenizer .

Slide 43

Slide 43 text

Ambiguity in Word Boundary We often rely on a dictionary-based morpheme analyzer to detect Japanese word boundary(e.g. BertJapaneseTokenizer ). Unfortunately, it brings additional dependencies and is not easy to incorporate it into portable model format in a programming-language-agnostic way.

Slide 44

Slide 44 text

The Variety of Characters in Use 日本語の文章にはひらがな・カタカナ・漢字や 123 などの半角英数字、123 などの全角英数字、 一二、三などの漢数字、ABC などのアルファベット、記号や絵文字 などが含まれる。 Japanese sentences include hiragana(ひらがな), katakana(カタカナ), kanji(漢字), half-width alphanumeric characters such as 123, full-width alphanumeric characters such as 123, Chinese numerals such as "一", "二", and "三", and alphabetic characters such as ABC as well as various other symbols and emojis . Tokenizers should care about the fact that some order and combination of character kinds are more likely and other order and combination are less likely. See https://github.com/tanreinama/Japanese-BPEEncoder_V2/blob/master/README.md for more details.

Slide 45

Slide 45 text

Must Choose Right Tokenizer & Model Combination for Preciseness For preciseness, we must choose right tokenizer and model combination because... a model assumes inputs come from the right tokenizer with the right configuration it is usual that one tokenizer use a sub-wording algorithm different from another(wordpiece, BPE, sentencepiece, etc.) token id of a token also differs depending on tokenizer

Slide 46

Slide 46 text

Examples of Model & Tokenizer Combination Model Tokenizer Optimized for Japanese sentence-transformers/paraphrase-xlm-r- multilingual-v1 SentencePiece No intfloat/multilingual-e5-large SentencePiece No pkshatech/GLuCoSE-base-ja MLukeTokenizer Yes cl-nagoya/sup-simcse-ja-base MeCab+wordpiece Yes

Slide 47

Slide 47 text

Must Choose Right Tokenizer & Model Combination for Portability For portability, we must choose right tokenizer and model combination because... some tokenizers are available in multiple programming languages while others are not some tokenizers depend on less common implementation dedicated for Japanese

Slide 48

Slide 48 text

Examples of Non-Python Tokenizers Rust: Huggingface Tokenizers and its binding in Scala Java: DJL WordPiece Tokenizer and DJL SentencePiece Binding Java: DJL HuggingFace Tokenizers ONNX: NLP operators

Slide 49

Slide 49 text

Some Tokenizers Are Available From JVM //> using dep "com.lihaoyi::pprint:0.9.0" //> using dep "ai.djl.huggingface:tokenizers:0.28.0" import pprint.* import ai.djl.huggingface.tokenizers.HuggingFaceTokenizer val modelName = "sentence-transformers/paraphrase-xlm-r-multilingual-v1" val tknz = HuggingFaceTokenizer.newInstance(modelName) val encoding = tknz.encode("これはテストです.") pprintln(encoding.getTokens()) //=> Array("", "▁これは", "テスト", "です", ".", "") pprintln(tknz.decode(encoding.getIds())) //=> " これはテストです."

Slide 50

Slide 50 text

Some Japanese Tokenizers Are NOT Portable # Python tokenizer = AutoTokenizer.from_pretrained("cl-nagoya/sup-simcse-ja-base") Here, it reads tokenizer configuration from either remote or local files using tokenizer model and vocabulary files. It also sets up MeCab morpheme analyzer for Japanese word segmentation which is only available from Python.

Slide 51

Slide 51 text

Cannot Instantiate Some HF Tokenizers val tknz = HuggingFaceTokenizer.newInstance("cl-nagoya/sup-simcse-ja-base") //=> Error

Slide 52

Slide 52 text

Hack: Implement Tokenization & Encoding For example, cl-nagoya/sup-simcse-ja-base model uses MeCab morpheme analyzer in combination with Wordpiece. { ... "subword_tokenizer_type": "wordpiece", "sudachi_kwargs": null, "tokenizer_class": "BertJapaneseTokenizer", "unk_token": "[UNK]", "word_tokenizer_type": "mecab" }

Slide 53

Slide 53 text

On JVM, MeCab is not available, but Sudachi is available. We can use Wordpiece implementation in Deep Java Library. It increases the ratio of [UNK] tokens to use Sudachi instead of MeCab because Wordpiece expects input to be segmented by MeCab.

Slide 54

Slide 54 text

Hack: Setup Wordpiece //> using dep "ai.djl:api:0.28.0" //> using dep "com.worksap.nlp:sudachi:0.7.3" import ai.djl.modality.nlp.DefaultVocabulary import ai.djl.modality.nlp.bert.WordpieceTokenizer val voc = DefaultVocabulary .builder() // download vocab.txt from // https://huggingface.co/tohoku-nlp/bert-base-japanese-v3/blob/main/vocab.txt .addFromTextFile(Path.of("vocab.txt")) .optUnknownToken("[UNK]") .build() val wp = WordpieceTokenizer( vocabulary = voc, unknown = "[UNK]", maxInputChars = 512 )

Slide 55

Slide 55 text

Hack: Setup Morpheme Analyzer import com.worksap.nlp.sudachi import com.worksap.nlp.sudachi.DictionaryFactory // download system_core.dic from // http://sudachi.s3-website-ap-northeast-1.amazonaws.com/sudachidict/ val settings = s"""|{ | "systemDict" : "system_core.dic", | "oovProviderPlugin" : [ | { | "class" : "com.worksap.nlp.sudachi.SimpleOovProviderPlugin", | "oovPOS" : [ "名詞", "普通名詞", "一般", "*", "*", "*"] | } | ] |}""".stripMargin val sudachiToknizer = (new DictionaryFactory) .create( sudachi.Config.fromJsonString( settings, sudachi.PathAnchor.none() ) ).create()

Slide 56

Slide 56 text

Hack: Get Model Inputs val words = sudachiToknizer .tokenize(SplitMode.C, "これはテストです") .asScala .map(_.surface()) val inputIds = words.flatMap(wp.tokenize.andThen(_.asScala)).map(voc.getIndex) // We must prepare token type ids, attention mask and paddings.

Slide 57

Slide 57 text

Create OnnxTensor "input_ids" -> OnnxTensor.createTensor( env, LongBuffer.wrap(inputIds), Array(1, inputIds.length) ), ... )

Slide 58

Slide 58 text

Get Embedding val result = sess.run( Map( "input_ids" -> OnnxTensor.createTensor( env, LongBuffer.wrap(inputIds), Array(1, inputIds.length.toLong) ), "attention_mask" -> ... ).asJava ) val emb = result .get("sentence_embedding") .toScala .get .getValue() .asInstanceOf[Array[Array[Float]]]

Slide 59

Slide 59 text

Summary: How to Use Japanese Embedding in Scala? 1. find an appropriate model 2. convert the model to ONNX 3. load the ONNX model on ONNX Runtime 4. encode input text to tensor using HuggingFaceTransformer from DJL or manually implement custom encoder 5. feed the tensor to the model

Slide 60

Slide 60 text

Summary: Tokenizer & Model Requirements Tokenizer should be available from JVM(pure JVM language or via FFI binding) use morpheme analyzer available from JVM in combination with Wordpiece or SentencePiece tokenizer The model should return embedding be converted to ONNX

Slide 61

Slide 61 text

Summary: What We Need To Use Japanese Embedding Models Java ONNX Runtime Deep Java Library(or Equivalent Tokenizer Bindings) Classical NLP Toolings(e.g. Text Normalizer, Morpheme Analyzer) Python・ optimum-cli

Slide 62

Slide 62 text

References & Learning Materials Deep Java Library supports various machine learning tasks as well as model training and tuning. https://docs.djl.ai/

Slide 63

Slide 63 text

References & Learning Materials ONNX https://onnx.ai/onnx/intro/ https://onnxruntime.ai/docs/ https://huggingface.co/docs/optimum/index

Slide 64

Slide 64 text

References & Learning Materials Articles focusing on Japanese NLP https://medium.com/axinc/bertjapanesetokenizer-日本語 bert向けトークナイ ザ -7b54120aa245 https://tech.yellowback.net/posts/sentence-transformers-japanese-models https://speakerdeck.com/nttcom/exploring-publicly-available-japanese-embedding- models

Slide 65

Slide 65 text

References & Learning Materials Code is available at https://github.com/i10416/embedding-demo