The concept of developing a new kana-kanji conversion method in the era of LLM

The concept of developing a new kana-kanji conversion method in
the era of LLM Masahiko Hashimoto @openSUSE.Asia summit 2024 Tokyo.

2/18 Who am I. Name: Masahiko Hashimoto (Shika) Belong to:
– Japan openSUSE User group (Maid for miscellaneous tasks?) – Tokaido Linux User Group Tastes: Writing a nobel Works: Building an AI in a research lab (Federated Learning)

3/18 Today's Table of Contents 1. Current status of Kana-Kanji
conversion on openSUSE 2. Implementation of conversion algorithm using LLM 3. Creating an OSS Kana-Kanji conversion dictionary using word embedding 4. Collecting and preserving corpus using open data 5. Future operation plans using federated learning 6. Development of input methods that run on ibus on openSUSE ...15 minutes??? (Is it possible?)

4/18 What’s Kana-Kanji conversion? Japanese language has Hiragana and kanji,
and both Hiragana and Kanji are used when writing example) I am Masahiko Hashimoto Hiragana : わたしははしもとまさひこです ←Keyboard input Kanji & Hiragana: 私は橋本雅彦です Input from the keyboard is done in hiragana, which is then converted to Hiragana and Kanji

5/18 Current status of Kana-Kanji conversion on openSUSE IBus and
Mozc are used for Kana-Kanji conversion in openSUSE • Mozc: Kana-Kanji conversion engine released by Google as OSS in 2010 • IBus: Accepts input from the keyboard and passes it to Mozc Both are essential for kana-kanji conversion, but...

6/18 Current status of Kana-Kanji conversion on openSUSE IBus and
Mozc are used for Kana-Kanji conversion in openSUSE • Mozc: Kana-Kanji conversion engine released by Google as OSS in 2010 → 2010?? 14 years ago??? • IBus: Accepts input from the keyboard and passes it to Mozc → … (I won't touch it today) Both are essential for kana-kanji conversion, but...

7/18 Now is the era of Generative AI The basic
operation of a generative AI such as LLM is to predict the next word Approach: Is it possible to use this next word prediction for kana-kanji conversion?

8/18 What’s Transformer? (BERT & GPT) Current generative AI is
based on the Transformer model https://heidloff.net/article/foundation-models-transformers-bert-and-gpt/ • Classification • Questions and answers • Summarization • Named entity recognition • Translation • Generation

9/18 An approach to Kana-Kanji conversion using generative AI ①Generate
the next word using the Transformer decoder model. ②The generated word is compared with the kana-kanji conversion candidates. ③The closest match is output as the conversion result.

10/18 The old version of the ANYA I had already
developed Kana- Kanji conversion “ANYA” using a generative AI model, but I did not use the Transformer model there. Confirmed that it was reasonably accurate. Current word Z Next Word (Predict) Input Output AutoEncoder type model

11/18 Methods? ①Generate the next word using the Transformer decoder
model. ②The generated word is compared with the kana-kanji conversion candidates. ③The closest match is output as the conversion result. How do we compare?

12/18 Methods? Ex: 私のなまえはなかのです Candidates: 　　　　名前　生絵　艶絵　菜真恵 correct Fixed (prompt) Generate
Word embedding Compare one by one with the embedded vectors previously prepared in the dictionary Calculate the distance between the generated word and the candidate words and use it as the cost. The costs are added up, → and the route with the lowest cost is used as the final conversion result.

13/18 ANYA dictionary using word embedding Ex: 私のなまえはなかのです Candidates: 　　　　名前　生絵　艶絵　菜真恵
Fixed (prompt) Generate Word embedding なまえ, 名前, 11287 なまえ, 生絵, 2343201 なまえ, 艶絵, 11140832 なまえ, 菜真恵, 2344215 ANYA dictionary 64bit DEX word embedding Decimal to binary conversion 11287 (DEX) 　↓ 101100000１10111 (BIN) 　↓ [0,0...1,0,1,1,0,0,0,0,0,1,1,0,1,1,1]

14/18 Issue Conversion speed わたしのなまえはなかのです → 1.5 sec to convert
(!???) Conversion accuracy → not at a level where show you A small LLM was supposed to be made for the conversion speed, but the accuracy, not just the speed, was reduced → Continuing to investigate ways to achieve speed while also achieving accuracy

15/18 Collecting and preserving corpus using open data One of
the problems is the lack of data collection for training. Where to collect from? 1.Aozora-Bunko *Contains archaic words 2.Wikipedia *Is that license okay? What is fair use? 3.Others

16/18 One of the solutions? : Federated Learning A method
of collecting and aggregating AI models to a server, rather than collecting training data to a server. Learning is possible → without collecting input history Should we incorporate it into ANYA? https://en.wikipedia.org/wiki/Federated_learning

17/18 Development Status of ANYA Server side (conversion engine server):
https://github.com/anya-im/Anya IBus input method: https://github.com/anya-im/ibus-anya Thank you, Syuta Hashimoto! Older versions (AutoEncoder Model) are already available. Please wait a little longer for the new version(Transformer Model). All developed in openSUSE.

18/18 Thank you for your attention.

The concept of developing a new kana-kanji con...

The concept of developing a new kana-kanji conversion method in the era of LLM

Masahiko Hashimoto

More Decks by Masahiko Hashimoto

Other Decks in Technology

Featured

Transcript

The concept of developing a new kana-kanji conversion method in

2/18 Who am I. Name: Masahiko Hashimoto (Shika) Belong to:

3/18 Today's Table of Contents 1. Current status of Kana-Kanji

4/18 What’s Kana-Kanji conversion? Japanese language has Hiragana and kanji,

5/18 Current status of Kana-Kanji conversion on openSUSE IBus and

6/18 Current status of Kana-Kanji conversion on openSUSE IBus and

7/18 Now is the era of Generative AI The basic

8/18 What’s Transformer? (BERT & GPT) Current generative AI is

9/18 An approach to Kana-Kanji conversion using generative AI ①Generate

10/18 The old version of the ANYA I had already

11/18 Methods? ①Generate the next word using the Transformer decoder

12/18 Methods? Ex: 私のなまえはなかのです Candidates: 　　　　名前　生絵　艶絵　菜真恵 correct Fixed (prompt) Generate

13/18 ANYA dictionary using word embedding Ex: 私のなまえはなかのです Candidates: 　　　　名前　生絵　艶絵　菜真恵

14/18 Issue Conversion speed わたしのなまえはなかのです → 1.5 sec to convert

15/18 Collecting and preserving corpus using open data One of

16/18 One of the solutions? : Federated Learning A method

17/18 Development Status of ANYA Server side (conversion engine server):

18/18 Thank you for your attention.