Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The concept of developing a new kana-kanji con...

The concept of developing a new kana-kanji conversion method in the era of LLM

Avatar for Masahiko Hashimoto

Masahiko Hashimoto

June 09, 2026

More Decks by Masahiko Hashimoto

Other Decks in Technology

Transcript

  1. The concept of developing a new kana-kanji conversion method in

    the era of LLM Masahiko Hashimoto @openSUSE.Asia summit 2024 Tokyo.
  2. 2/18 Who am I. Name: Masahiko Hashimoto (Shika) Belong to:

    – Japan openSUSE User group (Maid for miscellaneous tasks?) – Tokaido Linux User Group Tastes: Writing a nobel Works: Building an AI in a research lab (Federated Learning)
  3. 3/18 Today's Table of Contents 1. Current status of Kana-Kanji

    conversion on openSUSE 2. Implementation of conversion algorithm using LLM 3. Creating an OSS Kana-Kanji conversion dictionary using word embedding 4. Collecting and preserving corpus using open data 5. Future operation plans using federated learning 6. Development of input methods that run on ibus on openSUSE ...15 minutes??? (Is it possible?)
  4. 4/18 What’s Kana-Kanji conversion? Japanese language has Hiragana and kanji,

    and both Hiragana and Kanji are used when writing example) I am Masahiko Hashimoto Hiragana : わたしははしもとまさひこです ←Keyboard input Kanji & Hiragana: 私は橋本雅彦です Input from the keyboard is done in hiragana, which is then converted to Hiragana and Kanji
  5. 5/18 Current status of Kana-Kanji conversion on openSUSE IBus and

    Mozc are used for Kana-Kanji conversion in openSUSE • Mozc: Kana-Kanji conversion engine released by Google as OSS in 2010 • IBus: Accepts input from the keyboard and passes it to Mozc Both are essential for kana-kanji conversion, but...
  6. 6/18 Current status of Kana-Kanji conversion on openSUSE IBus and

    Mozc are used for Kana-Kanji conversion in openSUSE • Mozc: Kana-Kanji conversion engine released by Google as OSS in 2010 → 2010?? 14 years ago??? • IBus: Accepts input from the keyboard and passes it to Mozc → … (I won't touch it today) Both are essential for kana-kanji conversion, but...
  7. 7/18 Now is the era of Generative AI The basic

    operation of a generative AI such as LLM is to predict the next word Approach: Is it possible to use this next word prediction for kana-kanji conversion?
  8. 8/18 What’s Transformer? (BERT & GPT) Current generative AI is

    based on the Transformer model https://heidloff.net/article/foundation-models-transformers-bert-and-gpt/ • Classification • Questions and answers • Summarization • Named entity recognition • Translation • Generation
  9. 9/18 An approach to Kana-Kanji conversion using generative AI ①Generate

    the next word using the Transformer decoder model. ②The generated word is compared with the kana-kanji conversion candidates. ③The closest match is output as the conversion result.
  10. 10/18 The old version of the ANYA I had already

    developed Kana- Kanji conversion “ANYA” using a generative AI model, but I did not use the Transformer model there. Confirmed that it was reasonably accurate. Current word Z Next Word (Predict) Input Output AutoEncoder type model
  11. 11/18 Methods? ①Generate the next word using the Transformer decoder

    model. ②The generated word is compared with the kana-kanji conversion candidates. ③The closest match is output as the conversion result. How do we compare?
  12. 12/18 Methods? Ex: 私のなまえはなかのです Candidates:     名前 生絵 艶絵 菜真恵 correct Fixed (prompt) Generate

    Word embedding Compare one by one with the embedded vectors previously prepared in the dictionary Calculate the distance between the generated word and the candidate words and use it as the cost. The costs are added up, → and the route with the lowest cost is used as the final conversion result.
  13. 13/18 ANYA dictionary using word embedding Ex: 私のなまえはなかのです Candidates:     名前 生絵 艶絵 菜真恵

    Fixed (prompt) Generate Word embedding なまえ, 名前, 11287 なまえ, 生絵, 2343201 なまえ, 艶絵, 11140832 なまえ, 菜真恵, 2344215 ANYA dictionary 64bit DEX word embedding Decimal to binary conversion 11287 (DEX)  ↓ 101100000110111 (BIN)  ↓ [0,0...1,0,1,1,0,0,0,0,0,1,1,0,1,1,1]
  14. 14/18 Issue Conversion speed わたしのなまえはなかのです → 1.5 sec to convert

    (!???) Conversion accuracy → not at a level where show you A small LLM was supposed to be made for the conversion speed, but the accuracy, not just the speed, was reduced → Continuing to investigate ways to achieve speed while also achieving accuracy
  15. 15/18 Collecting and preserving corpus using open data One of

    the problems is the lack of data collection for training. Where to collect from? 1.Aozora-Bunko *Contains archaic words 2.Wikipedia *Is that license okay? What is fair use? 3.Others
  16. 16/18 One of the solutions? : Federated Learning A method

    of collecting and aggregating AI models to a server, rather than collecting training data to a server. Learning is possible → without collecting input history Should we incorporate it into ANYA? https://en.wikipedia.org/wiki/Federated_learning
  17. 17/18 Development Status of ANYA Server side (conversion engine server):

    https://github.com/anya-im/Anya IBus input method: https://github.com/anya-im/ibus-anya Thank you, Syuta Hashimoto! Older versions (AutoEncoder Model) are already available. Please wait a little longer for the new version(Transformer Model). All developed in openSUSE.