Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ScalaMatsuri2024: How to Use Embedding Models from Scala

June 08, 2024

ScalaMatsuri2024: How to Use Embedding Models from Scala

Recent natural language processing(NLP) often uses embedding models. In this area, there are tons of sample codes written in Python, but there are few, if any, of them in other programming languages.

Have you imagine what happens behind the scene when you run those sample codes of embedding models? How can we use these models, in particular Japanese embedding models, from Scala(or other JVM languages)?
In this session, I will give you overview of Japanese embedding models and explain how to call these models from Scala(and other JVM languages).


June 08, 2024


  1. WHOAMI SWE @ [REDACTED] OSS Contributor Get In Touch GitHub:

    i10416 X(Twitter): @by110416 Speaker Deck: i10416 You can call me “Ito-san”
  2. What You'll Learn Overview of Embedding in NLP How to

    Call Embedding Models from Scala(& other JVM languages)
  3. Overview: What Is Embedding "Embedding" refers to a process to

    extract natural language characteristics into representation easy to manipulate mathematically and programmatically. Typical representation is a real-valued vector.
  4. Note There are various ways of obtaining embedding from text,

    but in this presentation, "embedding" refers to context-aware feature vector obtained from large language models, more specifically, Hugging Face Transformers models.
  5. Overview: Embedding Use Cases For example, how different Scala, Kotlin,

    Go and Python are? We can naively compare words by levenshtein distance, but it is not context-aware. One big advantage of LLMs is its context-awareness. For example, the vector of cats (Scala FP library) is distant from the vector of cats (animal). This is because model weights in network are adjusted to knowledge in training data.
  6. Overview: Steps to Get Embedding 1. Find a model from

    Hugging Face 2. Pre-process 3. Encode 4. Model Application Utilize BERT(-like) Model Output Hidden State Use Sentence Transformer Models
  7. Example: How to Get Embeddings in Python with BERT tokenizer

    = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained("bert-base-uncased") encoded_input = tokenizer("...", return_tensors='pt') output = model(**encoded_input) output.last_hidden_state It is known that the last hidden state of [CLS] token represents sentence characteristics (to some extent).
  8. Example: How to Get Embeddings in Python with Sentence Transformers

    from sentence_transformers import SentenceTransformer sentences = ["..."] model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') embeddings = model.encode(sentences) Sentence transformers models maps a sentence directly to a dense vector.
  9. In Python, Hugging Face Transformers package encapsulates the transformation pipeline

    However, we have to strip the abstraction away to understand the internal.
  10. It goes through the following transformation pipeline under the hood.

    String // 1: pre-process => String // cleanup・normalization // 2: tokenize => Seq[String] | Seq[Byte] // (sub-)word segmentation・special tokens assignment // 3: encoding => Seq[Number] // map tokens to ids => Tensor // adapt input ids to expected tensor shape // 4: model application => Tensor
  11. Overview: Tokenizer Is an Adaptor for Each Natural Language In

    general, BERT-like models take real-valued tensors as an input. We need to transform text into numerical value so that we can feed them into a model. This is where tokenizer comes in.
  12. Overview: Tokenization 1: original "This is a great example" 2:

    word segmentation [This, is, a, great, example] 3: subwording [This, is, a, gre, ##at, ex, ##ample] 4: encoding [42, 24, 6, 8, 4, 16, 110] Tokenization splits sentence into tokens. Tokens are either a word or a part of a word. Encoding assigns unique number to each token.
  13. How to Use Embedding Models in Scala? ... and what

    makes it difficult especially in Japanese?
  14. Models Are Easier to Use from JVM than Tokenizers because

    Model interface is decoupled from natural language domain knowledge There is ONNX, which is language-agnostic machine learning model format There is ONNX runtime, which can run ONNX model in various languages
  15. What Is ONNX? ONNX is an open format built to

    represent machine learning models. ONNX includes ONNX serialization format(protobuf) ONNX runtime for C, Java, Python, JavaScript, C#, etc. ONNX language(operators) and extensions
  16. What Is ONNX Model? A model is a kind of

    graph and graph nodes are composition of ONNX operators.
  17. What Is ONNX Model? ONNX runtime is pluggable in a

    way that developers can add custom operators to perform various tasks including mathematical computation, tokenization for NLP, and even api calls. In the same way as JavaScript APIs differ between runtimes, ONNX runtime can be extended to perform various operations such as complex numerical calculation, NLP tasks and even API calls.
  18. ONNX Limitation We must use sentence transformers model, that converts

    text directly into embedding, because we cannot extract hidden states from ONNX model result.
  19. Examples of SentenceTransformer supporting Japanese Some models do not work

    well with Japanese. We need models that supports Japanese text. intfloat/multilingual-e5-large sentence-transformers/paraphrase-xlm-r-multilingual-v1 cl-nagoya/sup-simcse-ja-base pkshatech/GLuCoSE-base-ja
  20. Show Me The Code! This is the outline of getting

    embedding in Scala. There are some blanks to fill in this example. //> using dep "com.microsoft.onnxruntime:onnxruntime:1.18.0" import ai.onnxruntime.* val env: OrtEnvironment = OrtEnvironment.getEnvironment() val sess: OrtSession = env.createSession("path/to/model.onnx") val input: Map[String, OnnxTensor] = Map( "input_ids" -> ???, "attention_mask" -> ???, ... ) val result = sess.run(input.asJava)
  21. Question: How to Get model.onnx ? val sess: OrtSession =

    env.createSession("path/to/model.onnx") We need model files not tied to training & inference framework such as torch .
  22. Convert a Hugging Face Model to ONNX There are some

    options to convert model to ONNX, but optimum is the easiest one. Options optimum torch.onnx.export convert_graph_to_onnx
  23. Convert a Hugging Face Model into ONNX with optimum Hugging

    Face provides optimum, a tool to convert a model into ONNX. optimum-cli export onnx --model {model identifier} {out dir} optimum-cli generates {out dir}/model.onnx .
  24. Detour: Convert a Hugging Face Model into ONNX with torch.onnx.export

    We can use torch.onnx.export instead of optimum when you need to add some layers to model. import torch.onnx model.eval() dummy_input = ??? torch.onnx.export( model, dummy_input, "model.onnx", export_params=True, opset_version=10, do_constant_folding=True, input_names = ['input_ids', "token_type_ids", "attention_mask"], output_names = ['output'] )
  25. Detour: Convert a Hugging Face Model into ONNX(Legacy) With Hugging

    Face Transformers Version <= 4, there is a convert_graph_to_onnx module. Avoid this API if possible as this is marked deprecated . from transformers.convert_graph_to_onnx import convert model_location: str = "path/to/model/dir" output_location: str = "path/to/output/dir" convert( pipeline_name="...", framework="pt", model=model_location, tokenizer=model_location, output=Path(output_location), opset=12 )
  26. Question: How to Get Model Input & Output ? val

    input: Map[String, OnnxTensor] = Map( "input_ids" -> ???, "attention_mask" -> ???, ... )
  27. Question: How To Get Input Tensor? val input: Map[String, OnnxTensor]

    = Map( "input_ids" -> ???, "attention_mask" -> ???, ... )
  28. Use HF Tokenizers If Possible Deep Java Library has Hugging

    Face Tokenizers module. It can load some tokenizers from Hugging Face. //> using dep "ai.djl.huggingface:tokenizers:0.28.0" val t = HuggingFaceTokenizer .newInstance("sentence-transformers/paraphrase-xlm-r-multilingual-v1") val encoding = t.encode("これはテストです") val (inputIds, typeIds, attentionMask) = (encoding.getIds(),encoding.getTypeIds(),encoding.getAttentionMask())
  29. What Does Tokenizer Actually Do? pre-process: (some tokenizers) perform cleanup

    and/or normalization tokenize: split text into words and sub-words, which are referred to as "token" special token handling: add special tokens such as [CLS] , [SEP] or [PAD] encoding: map tokens: Seq[String] | Seq[Byte] to ids: Seq[Number] decoding: map ids: Seq[Number] to tokens: Seq[String] | Seq[Byte] back There are some Japanese embedding models that perform tasks specific to Japanese text at pre-process phase and tokenize phase which are difficult for HuggingFaceTokenizer to support.
  30. Difficulties specific to Japanese Tokenization (& Embedding) Japanese tokenization is

    more complicated than English. It often requires special treatment, which makes a model less portable. The difficulties are ambiguity in word boundary the variety of characters in use
  31. Ambiguity in Word Boundary すもももももももものうち => [すもも, も, もも, も,

    もも, の, う ち] It is a well-known tongue twister. It means "Both Japanese plum and peach are a kind of peaches". It is not easy even for native speakers to find correct boundaries at a glance.
  32. Ambiguity in Word Boundary In Japanese, there's no explicit word

    boundary except punctuation("、 ", "。 ") in contrast of English, where whitespace acts as a word boundary. Therefore, BertTokenizer , which uses whitespace as a word boundary, does not work well with Japanese text.
  33. For example, these models do not work well with Japanese

    text distiluse-base-multilingual-cased-v2 quora-distilbert-multilingual LaBSE because they use BertTokenizer .
  34. Ambiguity in Word Boundary We often rely on a dictionary-based

    morpheme analyzer to detect Japanese word boundary(e.g. BertJapaneseTokenizer ). Unfortunately, it brings additional dependencies and is not easy to incorporate it into portable model format in a programming-language-agnostic way.
  35. The Variety of Characters in Use 日本語の文章にはひらがな・カタカナ・漢字や 123 などの半角英数字、123 などの全角英数字、

    一二、三などの漢数字、ABC などのアルファベット、記号や絵文字 などが含まれる。 Japanese sentences include hiragana(ひらがな), katakana(カタカナ), kanji(漢字), half-width alphanumeric characters such as 123, full-width alphanumeric characters such as 123, Chinese numerals such as "一", "二", and "三", and alphabetic characters such as ABC as well as various other symbols and emojis . Tokenizers should care about the fact that some order and combination of character kinds are more likely and other order and combination are less likely. See https://github.com/tanreinama/Japanese-BPEEncoder_V2/blob/master/README.md for more details.
  36. Must Choose Right Tokenizer & Model Combination for Preciseness For

    preciseness, we must choose right tokenizer and model combination because... a model assumes inputs come from the right tokenizer with the right configuration it is usual that one tokenizer use a sub-wording algorithm different from another(wordpiece, BPE, sentencepiece, etc.) token id of a token also differs depending on tokenizer
  37. Examples of Model & Tokenizer Combination Model Tokenizer Optimized for

    Japanese sentence-transformers/paraphrase-xlm-r- multilingual-v1 SentencePiece No intfloat/multilingual-e5-large SentencePiece No pkshatech/GLuCoSE-base-ja MLukeTokenizer Yes cl-nagoya/sup-simcse-ja-base MeCab+wordpiece Yes
  38. Must Choose Right Tokenizer & Model Combination for Portability For

    portability, we must choose right tokenizer and model combination because... some tokenizers are available in multiple programming languages while others are not some tokenizers depend on less common implementation dedicated for Japanese
  39. Examples of Non-Python Tokenizers Rust: Huggingface Tokenizers and its binding

    in Scala Java: DJL WordPiece Tokenizer and DJL SentencePiece Binding Java: DJL HuggingFace Tokenizers ONNX: NLP operators
  40. Some Tokenizers Are Available From JVM //> using dep "com.lihaoyi::pprint:0.9.0"

    //> using dep "ai.djl.huggingface:tokenizers:0.28.0" import pprint.* import ai.djl.huggingface.tokenizers.HuggingFaceTokenizer val modelName = "sentence-transformers/paraphrase-xlm-r-multilingual-v1" val tknz = HuggingFaceTokenizer.newInstance(modelName) val encoding = tknz.encode("これはテストです.") pprintln(encoding.getTokens()) //=> Array("<s>", "▁これは", "テスト", "です", ".", "</s>") pprintln(tknz.decode(encoding.getIds())) //=> "<s> これはテストです.</s>"
  41. Some Japanese Tokenizers Are NOT Portable # Python tokenizer =

    AutoTokenizer.from_pretrained("cl-nagoya/sup-simcse-ja-base") Here, it reads tokenizer configuration from either remote or local files using tokenizer model and vocabulary files. It also sets up MeCab morpheme analyzer for Japanese word segmentation which is only available from Python.
  42. Hack: Implement Tokenization & Encoding For example, cl-nagoya/sup-simcse-ja-base model uses

    MeCab morpheme analyzer in combination with Wordpiece. { ... "subword_tokenizer_type": "wordpiece", "sudachi_kwargs": null, "tokenizer_class": "BertJapaneseTokenizer", "unk_token": "[UNK]", "word_tokenizer_type": "mecab" }
  43. On JVM, MeCab is not available, but Sudachi is available.

    We can use Wordpiece implementation in Deep Java Library. It increases the ratio of [UNK] tokens to use Sudachi instead of MeCab because Wordpiece expects input to be segmented by MeCab.
  44. Hack: Setup Wordpiece //> using dep "ai.djl:api:0.28.0" //> using dep

    "com.worksap.nlp:sudachi:0.7.3" import ai.djl.modality.nlp.DefaultVocabulary import ai.djl.modality.nlp.bert.WordpieceTokenizer val voc = DefaultVocabulary .builder() // download vocab.txt from // https://huggingface.co/tohoku-nlp/bert-base-japanese-v3/blob/main/vocab.txt .addFromTextFile(Path.of("vocab.txt")) .optUnknownToken("[UNK]") .build() val wp = WordpieceTokenizer( vocabulary = voc, unknown = "[UNK]", maxInputChars = 512 )
  45. Hack: Setup Morpheme Analyzer import com.worksap.nlp.sudachi import com.worksap.nlp.sudachi.DictionaryFactory // download

    system_core.dic from // http://sudachi.s3-website-ap-northeast-1.amazonaws.com/sudachidict/ val settings = s"""|{ | "systemDict" : "system_core.dic", | "oovProviderPlugin" : [ | { | "class" : "com.worksap.nlp.sudachi.SimpleOovProviderPlugin", | "oovPOS" : [ "名詞", "普通名詞", "一般", "*", "*", "*"] | } | ] |}""".stripMargin val sudachiToknizer = (new DictionaryFactory) .create( sudachi.Config.fromJsonString( settings, sudachi.PathAnchor.none() ) ).create()
  46. Hack: Get Model Inputs val words = sudachiToknizer .tokenize(SplitMode.C, "これはテストです")

    .asScala .map(_.surface()) val inputIds = words.flatMap(wp.tokenize.andThen(_.asScala)).map(voc.getIndex) // We must prepare token type ids, attention mask and paddings.
  47. Get Embedding val result = sess.run( Map( "input_ids" -> OnnxTensor.createTensor(

    env, LongBuffer.wrap(inputIds), Array(1, inputIds.length.toLong) ), "attention_mask" -> ... ).asJava ) val emb = result .get("sentence_embedding") .toScala .get .getValue() .asInstanceOf[Array[Array[Float]]]
  48. Summary: How to Use Japanese Embedding in Scala? 1. find

    an appropriate model 2. convert the model to ONNX 3. load the ONNX model on ONNX Runtime 4. encode input text to tensor using HuggingFaceTransformer from DJL or manually implement custom encoder 5. feed the tensor to the model
  49. Summary: Tokenizer & Model Requirements Tokenizer should be available from

    JVM(pure JVM language or via FFI binding) use morpheme analyzer available from JVM in combination with Wordpiece or SentencePiece tokenizer The model should return embedding be converted to ONNX
  50. Summary: What We Need To Use Japanese Embedding Models Java

    ONNX Runtime Deep Java Library(or Equivalent Tokenizer Bindings) Classical NLP Toolings(e.g. Text Normalizer, Morpheme Analyzer) Python・ optimum-cli
  51. References & Learning Materials Deep Java Library supports various machine

    learning tasks as well as model training and tuning. https://docs.djl.ai/
  52. References & Learning Materials Articles focusing on Japanese NLP https://medium.com/axinc/bertjapanesetokenizer-日本語

    bert向けトークナイ ザ -7b54120aa245 https://tech.yellowback.net/posts/sentence-transformers-japanese-models https://speakerdeck.com/nttcom/exploring-publicly-available-japanese-embedding- models