Liquid AI Hackathon Tokyo プレゼン資料

Direct Speech-to-Speech Translation by Fine-tuning LFM2 2025/10/11-12 Liquid AI Hackathon
Tokyo Aratako

About Me

Overall Architecture LFM2-350M XCodec2 Tokenizer XCodec2 Tokenizer XCodec2 XCodec2 STT
TTS STS-MT Speech Data Text Data Speech Tokens Speech Tokens Text Tokens Text Tokens Speech Tokens Speech Tokens +Task Specific System Prompt +Task Specific System Prompt +Task Specific System Prompt Vocabulary expanded for speech tokens

Preparing Parallel Speech Corpus A large-scale, parallel Japanese-English speech corpus
is not readily available, so I created my own. Process: 1. Start with the Emilia Dataset (a multilingual text and speech dataset). 2. Use LFM2-ENJP-MT to translate the text (JP↔EN). 3. Use Llasa-3B (EN) and Anime-Llasa-3B (JP) TTS models to synthesize audio from the translated text. Resulting Dataset: • Original JP → Synthetic EN: 492.32 hours • Original EN → Synthetic JP: 488.28 hours • Total: ~980 hours of parallel speech data.

Training Step 1: Teaching the LLM about Audio Problem: The
base LFM2 model has no prior knowledge of audio tokens. The STS task [Audio] -> [Audio] involves completely new "words". Solution: Continual Pre-Training (CPT) I performed unsupervised CPT on a large, mixed corpus of audio and text to familiarize the model with the audio token space. Data Used: • Audio: ◦ Emilia-JA: ~2,800 hours ◦ Emilia-EN: ~5,000 hours (estimated) • Text: ◦ Japanese Wikipedia (Full) ◦ English Wikipedia (~20% of full) https://api.wandb.ai/links/aratako-lm/r5sb7xpe

https://api.wandb.ai/links/aratako-lm/397kuhw3 Training Step 2: Supervised Fine-Tuning After CPT, I fine-tuned
the model on my supervised tasks, with STS-MT as the primary focus. Tasks & Data: • STS-MT (Primary): ~980 hours of parallel data. I trained on both directions (JA→EN and EN→JA), making it ~1,960 hours of effective training data. • STT (Auxiliary): ◦ JA: ~230 hours ◦ EN: ~255 hours • TTS (Auxiliary): ◦ JA: ~230 hours ◦ EN: ~255 hours 1 epoch

Results: Unfortunately, It Didn't Work My primary STS translation goal
was not achieved. • Audio Generation (STS & TTS): The model generated audio tokens, but they immediately fell into a repetitive loop and did not form coherent speech. • Task Differentiation (STT): When prompted with the STT task, the model correctly produced text tokens instead of audio tokens. This indicates it learned to distinguish tasks based on the prompt, even though the transcribed text was gibberish (like えっと、えっと、えっと、...). Overall, the model learned a very basic aspect of task differentiation but failed completely at the core generation tasks.

Analysis & Hypothesis I believe the primary reason for failure
was insufficient data. • Scale Mismatch: Modern LLM-based TTS models are often trained on over 100k hours of data from scratch. My ~2,000 hours of total supervised data was likely far too little for the model to learn the complex patterns of speech generation. • Learning Did Occur: The model did learn to differentiate tasks (outputting audio tokens for STS/TTS vs. text tokens for TTS). This suggests the multi-task setup was partially effective, but the generation quality suffered from data scarcity. • Better Approach: Instead of fine-tuning from a base text LFM2, starting from a pre-trained audio-centric model (like LFM2-Audio) might have provided a much better foundation and yielded more successful results.

Summary & Conclusion • Goal: I aimed to build a
direct Speech-to-Speech translation model by fine-tuning LFM2 on multiple tasks. • Method: I created a ~1000-hour synthetic parallel speech corpus and implemented a two-step training process: continued pre-training followed by multi-task fine-tuning. • Outcome: The model failed to generate coherent speech, likely due to a significant lack of training data. However, it demonstrated a basic ability to handle audio inputs for STT. For future work, starting from a pre-trained audio model is highly recommended.

Appendix • Model Checkpoint: https://huggingface.co/Aratako/LiquidAI-Hackathon-Tokyo-Model • Dataset used to train:
◦ CPT Dataset: https://huggingface.co/datasets/Aratako/LiquidAI-Hackathon-Tokyo-CPT-Data ◦ SFT Dataset: https://huggingface.co/datasets/Aratako/LiquidAI-Hackathon-Tokyo-SFT-Data • Exmaple of STT result: ◦ Starting to learn the frequent filler words? • Example of TTS result: ◦ Starting to learn how to make sounds?

Appendix Special Thanks • XCodec2: https://huggingface.co/HKUSTAudio/xcodec2 • Llasa: https://huggingface.co/HKUSTAudio/Llasa-3B •
axolotl: https://github.com/axolotl-ai-cloud/axolotl • vLLM: https://github.com/vllm-project/vllm • Emilia: https://huggingface.co/datasets/amphion/Emilia-Dataset • Liquid AI, Weights and Biases, and Lambda • and other ML community and projects!

Liquid AI Hackathon Tokyo プレゼン資料

Liquid AI Hackathon Tokyo プレゼン資料

Aratako

More Decks by Aratako

Other Decks in Technology

Featured

Transcript

Direct Speech-to-Speech Translation by Fine-tuning LFM2 2025/10/11-12 Liquid AI Hackathon

About Me

Overall Architecture LFM2-350M XCodec2 Tokenizer XCodec2 Tokenizer XCodec2 XCodec2 STT

Preparing Parallel Speech Corpus A large-scale, parallel Japanese-English speech corpus

Training Step 1: Teaching the LLM about Audio Problem: The

https://api.wandb.ai/links/aratako-lm/397kuhw3 Training Step 2: Supervised Fine-Tuning After CPT, I fine-tuned

Results: Unfortunately, It Didn't Work My primary STS translation goal

Analysis & Hypothesis I believe the primary reason for failure

Summary & Conclusion • Goal: I aimed to build a

Appendix • Model Checkpoint: https://huggingface.co/Aratako/LiquidAI-Hackathon-Tokyo-Model • Dataset used to train:

Appendix Special Thanks • XCodec2: https://huggingface.co/HKUSTAudio/xcodec2 • Llasa: https://huggingface.co/HKUSTAudio/Llasa-3B •