Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
【輪講資料】Moshi: a speech-text foundation model for...
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Hayato Tsukagoshi
July 15, 2025
Research
3
1.1k
【輪講資料】Moshi: a speech-text foundation model for real-time dialogue
リアルタイム音声対話モデル Moshi を提案した論文の紹介資料です
Hayato Tsukagoshi
July 15, 2025
Tweet
Share
More Decks by Hayato Tsukagoshi
See All by Hayato Tsukagoshi
Word Embeddings Are Steers for Language Models
hpprc
1
290
NLP2024 招待論文セッション: 定義文を用いた文埋め込み構成法
hpprc
1
170
修論発表.pdf
hpprc
0
150
YANS2024: 目指せ国際会議!「あぶない国際会議」
hpprc
0
310
Isotropy, Clusters, and Classifiers
hpprc
3
990
[輪講資料] Matryoshka Representation Learning
hpprc
5
2.4k
[輪講資料] Text Embeddings by Weakly-Supervised Contrastive Pre-training
hpprc
4
1.5k
[輪講資料] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
hpprc
1
1.1k
WhitenedCSE: Whitening-based Contrastive Learning of Sentence Embeddings
hpprc
3
910
Other Decks in Research
See All in Research
病院向け生成AIプロダクト開発の実践と課題
hagino3000
0
530
LiDARセキュリティ最前線(2025年)
kentaroy47
0
140
AIスーパーコンピュータにおけるLLM学習処理性能の計測と可観測性 / AI Supercomputer LLM Benchmarking and Observability
yuukit
1
660
[チュートリアル] 電波マップ構築入門 :研究動向と課題設定の勘所
k_sato
0
260
Self-Hosted WebAssembly Runtime for Runtime-Neutral Checkpoint/Restore in Edge–Cloud Continuum
chikuwait
0
340
Sat2City:3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion
satai
4
660
生成AI による論文執筆サポート・ワークショップ ─ サーベイ/リサーチクエスチョン編 / Workshop on AI-Assisted Paper Writing Support: Survey/Research Question Edition
ks91
PRO
0
140
HU Berlin: Industrial-Strength Natural Language Processing with spaCy and Prodigy
inesmontani
PRO
0
230
2025-11-21-DA-10th-satellite
yegusa
0
110
2026-01-30-MandSL-textbook-jp-cos-lod
yegusa
0
230
空間音響処理における物理法則に基づく機械学習
skoyamalab
0
190
ブレグマン距離最小化に基づくリース表現量推定:バイアス除去学習の統一理論
masakat0
0
140
Featured
See All Featured
Chasing Engaging Ingredients in Design
codingconduct
0
110
Practical Orchestrator
shlominoach
191
11k
Utilizing Notion as your number one productivity tool
mfonobong
3
220
B2B Lead Gen: Tactics, Traps & Triumph
marketingsoph
0
57
Why Your Marketing Sucks and What You Can Do About It - Sophie Logan
marketingsoph
0
77
Technical Leadership for Architectural Decision Making
baasie
2
250
How Software Deployment tools have changed in the past 20 years
geshan
0
32k
It's Worth the Effort
3n
188
29k
Organizational Design Perspectives: An Ontology of Organizational Design Elements
kimpetersen
PRO
1
470
WCS-LA-2024
lcolladotor
0
450
Java REST API Framework Comparison - PWX 2021
mraible
34
9.1k
Taking LLMs out of the black box: A practical guide to human-in-the-loop distillation
inesmontani
PRO
3
2k
Transcript
Moshi: a speech-text foundation model for real-time dialogue Alexandre Défossez,
Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, Neil Zeghidour https://arxiv.org/abs/2410.00037 Nagoya Univ. D3, Hayato Tsukagoshi
•Full-duplexͳϦΞϧλΠϜରϞσϧ Moshi ΛఏҊ͢Δจ • ϢʔβͷԻΛฉ͖ͳ͕Βಉ࣌ʹϞσϧ͕ग़ྗͰ͖Δ • 㱻 half-duplex: ยํ͕ͯ͠Δؒɺ͏ยํͤͳ͍ •ϑϥϯεͷύϦΛڌͱ͢ΔඇӦརݚڀॴ
Kyutai ͷݚڀ •పఈతʹετϦʔϛϯάॲཧΛҙࣝͨ͠ΞʔΩςΫνϟ͕ಛ • ϢʔβԻɾϞσϧԻɾϞσϧςΩετΛಉ࣌ʹϞσϧೖྗ •χϡʔϥϧԻίʔσοΫ Mimi ։ൃͯ͠׆༻ • 24000HzͷԻΛ12.5HzͷτʔΫϯྻʹτʔΫφΠζ͢Δ ֓ཁ 2
•౦தݚͷେڮ͘Μ͕ ຊޠ൛ϞσϧΛެ։ • ΦϦδφϧͷMoshiʹରͯ͠ ຊޠରσʔλ + ߹σʔλ Ͱ fi ne-tuning
•ΊͪΌͪ͘ΌόζͬͯΔ… ༨ஊ 3
•Ի+ݴޠͳਂֶशͷ࠷ઌͰ໘ന͍ʂ બఆཧ༝ 4
•ࣗݾճؼܕTransformerϕʔεͷ7BϞσϧ + ԻτʔΫφΠβ •ԻτʔΫφΠβ Mimi ʹΑΓԻΛIDྻʹม͠ࢄతʹѻ͏ • frame rate (1ඵ͋ͨΓͷσʔλྔ)
12.5 •ೖྗ: ϢʔβͷԻɺϞσϧͷԻɺςΩετ (inner monologue) • ͦΕͧΕʹରԠ͢ΔϕΫτϧΛͨ͋͠ΘͤͯTransformerʹೖྗ Moshiͷߏ 5
•MoshiΛࢧ͑Δج൫ٕज़ͷҰͭɺ96.2MͰConvͱTransformer͔ΒͳΔ (hf) • 80msΛ1 tokenͱͯ͠ѻ͍ɺೖྗαϯϓϦϯάϨʔτ24000Hz •ԻܗΛࢄతͳAudio tokenʹม͢ΔNeural Audio Codec •
VQ-VAEͰΒΕΔdiscrete bottleneckΛ࠾༻ •Audio tokenAcoustic TokenͱSemantic Tokenͷ2छྨ͕ग़ྗ • Semantic Token: ԻͷҙຯతɾԻӆతใΛଊ͑Δ • WavLM ͷຒΊࠐΈදݱΛৠཹ • Acoustic Token: ࡉ͔ͳԻڹಛΛଊ͑Δ •Residual Vector Quantizer (RVQ) ʹΑΓஈ֊తʹԻܗΛྔࢠԽ Mimi 6
•MoshiΛࢧ͑Δج൫ٕज़ͷҰͭɺ96.2MͰConvͱTransformer͔ΒͳΔ (hf) • 80msΛ1 tokenͱͯ͠ѻ͍ɺೖྗαϯϓϦϯάϨʔτ24000Hz •ԻܗΛࢄతͳAudio tokenʹม͢ΔNeural Audio Codec •
VQ-VAEͰΒΕΔdiscrete bottleneckΛ࠾༻ •Audio tokenAcoustic TokenͱSemantic Tokenͷ2छྨ͕ग़ྗ • Semantic Token: ԻͷҙຯతɾԻӆతใΛଊ͑Δ • WavLM ͷຒΊࠐΈදݱΛৠཹ • Acoustic Token: ࡉ͔ͳԻڹಛΛଊ͑Δ •Residual Vector Quantizer (RVQ) ʹΑΓஈ֊తʹԻܗΛྔࢠԽ Mimi 7
•ϕΫτϧΛෳͷID͔ΒͳΔIDྻʹྔࢠԽ •ྔࢠԽஈ֊తʹߦΘΕΔ • ·ͣϕΫτϧྔࢠԽΛߦ͏ • ࣍ʹೖྗϕΫτϧͱྔࢠԽޙͷϕΫτϧͱͷࠩΛಉ༷ʹྔࢠԽ͢Δ • ҎԼ܁Γฦ͠ •ॏཁͳใ͔ΒॱʹྔࢠԽ͢ΔΑ͏ʹࣗવʹֶश͞ΕΔ •
Quantizer·ͣೖྗϕΫτϧશମΛද͢Α͏ͳϕΫτϧΛબͿ • ײ֮తʹMatryoshka Representation Learningʹ͍ۙʁ Residual Vector Quantization: RVQ 8
RVQ: Πϝʔδਤ 9 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047
RVQ: Πϝʔδਤ 10 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 ࠷ۙ
RVQ: Πϝʔδਤ 11 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 [ 1, ग़ྗIDྻ
RVQ: Πϝʔδਤ 12 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 [ 1, ग़ྗIDྻ = -
RVQ: Πϝʔδਤ 13 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 [ 1, ग़ྗIDྻ
RVQ: Πϝʔδਤ 14 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 [ 1, ग़ྗIDྻ ࠷ۙ
RVQ: Πϝʔδਤ 15 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 ग़ྗIDྻ [ 1, 3,
RVQ: Πϝʔδਤ 16 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 ग़ྗIDྻ [ 1, 3, = -
RVQ: Πϝʔδਤ 17 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 ग़ྗIDྻ [ 1, 3, ࠷ۙ
RVQ: Πϝʔδਤ 18 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 ग़ྗIDྻ [ 1, 3, 2, = -
RVQ: Πϝʔδਤ (nճޙ) 19 Codebook ྔࢠԽର … id=0 id=1 id=2
id=3 id=2047 ग़ྗIDྻ [ 1, 3, 2, 2047, …, 4]
Mimiͷ܇࿅֓ཁਤ: ΊͬͪΌ؆ུԽ൛ 20 Mimi Encoder Mimi Decoder WavLM Cosྨࣅ ❄
࠶ߏଛࣦ + ఢରతଛࣦ
Mimiͷ܇࿅֓ཁਤ: ΊͬͪΌ؆ུԽ൛ 21 Mimi Encoder Mimi Decoder WavLM Cosྨࣅ ❄
࠶ߏଛࣦ + ఢରతଛࣦ non-causalϞσϧͷϕΫτϧ ʹ͚ۙͮͭͭɺԻ࣭ߴΊΔ
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 22
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 23 ੜͷԻΛࣗݾճؼతʹϕΫτϧྻ
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 24 Acoustic TokenRVQ Semantic Tokenઢܗ+VQ
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 25 WavLMͷϕΫτϧʹ Semantic Token͕ۙͮ͘Α͏ʹֶश
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 26 ͠߹ΘͤΛDecoderʹೖྗͯ͠ ԻܗΛग़ྗ
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 27 ग़ྗͨ͠Իܗ͕ೖྗʹۙͮ͘Α͏ʹ +ຊͬΆ͘ͳΔΑ͏ʹֶश
•·ͣ௨ৗͷࣗݾճؼܕݴޠϞσϧΛߏங • ެ։ӳޠίʔύε 2.1T tokensɺܥྻ4096ɺϞσϧαΠζ7B • ߏங͞Εͨ7B LLMΛHeliumͱݺশ • ͜ͷஈ֊Ͱ୯७ʹtext-in,
text-out •࣍ʹɺHeliumΛϕʔεʹԻΛೖग़ྗʹͯ͠܇࿅ • ͱݴͬͯMimiͷτʔΫϯΛ༧ଌ͢ΔΑ͏ʹ܇࿅͢ΔͷͰ௨ৗͷ ݴޠϞσϦϯάͱରͯ͠มΘΒͳ͍ (࣍ͷτʔΫϯ༧ଌ) •Temporal Transformer (HeliumͰॳظԽ) ͱ Depth Transformer͔ΒͳΔ • ͜ͷೋͭΛ·ͱΊͯRQ-Transformerͱݺশ MoshiͷΞʔΩςΫνϟ֓ཁ 28
•·ͣ௨ৗͷࣗݾճؼܕݴޠϞσϧΛߏங • ެ։ӳޠίʔύε 2.1T tokensɺܥྻ4096ɺϞσϧαΠζ7B • ߏங͞Εͨ7B LLMΛHeliumͱݺশ • ͜ͷஈ֊Ͱ୯७ʹtext-in,
text-out •࣍ʹɺHeliumΛϕʔεʹԻΛೖग़ྗʹͯ͠܇࿅ • ͱݴͬͯMimiͷτʔΫϯΛ༧ଌ͢ΔΑ͏ʹ܇࿅͢ΔͷͰ௨ৗͷ ݴޠϞσϦϯάͱରͯ͠มΘΒͳ͍ (࣍ͷτʔΫϯ༧ଌ) •Temporal Transformer (HeliumͰॳظԽ) ͱ Depth Transformer͔ΒͳΔ • ͜ͷೋͭΛ·ͱΊͯRQ-Transformerͱݺশ MoshiͷΞʔΩςΫνϟ֓ཁ 29
•Temporal Transformer͕ςΩετΛग़ྗ •Depth Transformer͕Semantic TokenͱAcoustic TokenΛࣗݾճؼతʹग़ྗ →࣌ؒํɾcodebookํͷೋͭͷࣗݾճؼͷྲྀΕ MoshiͷΞʔΩςΫνϟਤ 30 RQ-Transformer
Mimi Encoder Mimi Decoder Temporal Transformer Helium Depth Transformer
•Temporal TransformerʹϕΫτϧΛೖྗ •࣍ͷτʔΫϯ༧ଌͰ܇࿅ RQ-TransformerͷΞʔΩςΫνϟਤ 31
•1࣌ࠁ͝ͱʹ… •ϢʔβͷԻ͕1+7 token •ϞσϧͷԻ͕1+7 token •ϞσϧͷςΩετ͕1 token → Multi-stream Modeling
•1࣌ࠁ͝ͱʹશ෦͠߹Θͤͯ ୯ҰͷϕΫτϧʹ͠ɺϞσϧ ೖྗ Moshiͷೖྗ֓ཁਤ 32
•1࣌ࠁ͝ͱʹ… •ϢʔβͷԻ͕1+7 token •ϞσϧͷԻ͕1+7 token •ϞσϧͷςΩετ͕1 token → Multi-stream Modeling
•1࣌ࠁ͝ͱʹશ෦͠߹Θͤͯ ୯ҰͷϕΫτϧʹ͠ɺϞσϧ ೖྗ Moshiͷೖྗ֓ཁਤ 33 https://github.com/kyutai-labs/moshi/blob/950e9771dc33d7aa48f80175a189c5c902016df2/moshi/moshi/models/lm.py#L381 ݩ࣮ (৴͍͕͡) 17ݸͷຒΊࠐΈΛ͠߹Θͤͯ ҰͭͷϕΫτϧʹ͍ͯ͠Δ Σ੧(❛□❛✿)
Moshiͷೖྗ֓ཁਤ: ετϦʔϛϯάॲཧͷ߹ 34 Ϟσϧͷग़ྗԻ ϢʔβͷೖྗԻ Ϟσϧͷग़ྗςΩετ •Ұఆ࣌ؒ͝ͱʹೖྗ͕Ϟσϧʹೖͬͯ͘Δ •ετϦʔϛϯάॲཧͷͨΊʹ: • ϞσϧҰఆ࣌ؒҎʹॲཧΛྃͤͯ͞ग़ྗΛग़͢
• ͦΕΛ͏ҰೖྗʹೖΕͭͭɺϢʔβଆͷ࣍ͷ࣌ࠁͷԻೖྗ
Moshiͷೖྗ֓ཁਤ: ࣮ࡍͷ࣌ࠁ͝ͱͷೖྗ 35 Ϟσϧͷग़ྗԻ ϢʔβͷೖྗԻ Ϟσϧͷग़ྗςΩετ •Ұఆ࣌ؒ͝ͱʹೖྗ͕Ϟσϧʹೖͬͯ͘Δ •ετϦʔϛϯάॲཧͷͨΊʹ: • ϞσϧҰఆ࣌ؒҎʹॲཧΛྃͤͯ͞ग़ྗΛग़͢
• ͦΕΛ͏ҰೖྗʹೖΕͭͭɺϢʔβଆͷ࣍ͷ࣌ࠁͷԻೖྗ t=2
Moshiͷೖྗ֓ཁਤ: ࣮ࡍͷ࣌ࠁ͝ͱͷೖྗ 36 Ϟσϧͷग़ྗԻ ϢʔβͷೖྗԻ Ϟσϧͷग़ྗςΩετ •Ұఆ࣌ؒ͝ͱʹೖྗ͕Ϟσϧʹೖͬͯ͘Δ •ετϦʔϛϯάॲཧͷͨΊʹ: • ϞσϧҰఆ࣌ؒҎʹॲཧΛྃͤͯ͞ग़ྗΛग़͢
• ͦΕΛ͏ҰೖྗʹೖΕͭͭɺϢʔβଆͷ࣍ͷ࣌ࠁͷԻೖྗ t=3
Moshiͷೖྗ֓ཁਤ: ࣮ࡍͷ࣌ࠁ͝ͱͷೖྗ 37 Ϟσϧͷग़ྗԻ ϢʔβͷೖྗԻ Ϟσϧͷग़ྗςΩετ •Ұఆ࣌ؒ͝ͱʹೖྗ͕Ϟσϧʹೖͬͯ͘Δ •ετϦʔϛϯάॲཧͷͨΊʹ: • ϞσϧҰఆ࣌ؒҎʹॲཧΛྃͤͯ͞ग़ྗΛग़͢
• ͦΕΛ͏ҰೖྗʹೖΕͭͭɺϢʔβଆͷ࣍ͷ࣌ࠁͷԻೖྗ t=4
Moshiͷೖྗ֓ཁਤ: ࣮ࡍͷ࣌ࠁ͝ͱͷೖྗ 38 Ϟσϧͷग़ྗԻ ϢʔβͷೖྗԻ Ϟσϧͷग़ྗςΩετ •Ұఆ࣌ؒ͝ͱʹೖྗ͕Ϟσϧʹೖͬͯ͘Δ •ετϦʔϛϯάॲཧͷͨΊʹ: • ϞσϧҰఆ࣌ؒҎʹॲཧΛྃͤͯ͞ग़ྗΛग़͢
• ͦΕΛ͏ҰೖྗʹೖΕͭͭɺϢʔβଆͷ࣍ͷ࣌ࠁͷԻೖྗ t=5
Իೝࣝ(ASR), Ի߹(TTS)ͷస༻ 39 ςΩετ Ի Ի ςΩετ ASR TTS •MoshiͷMulti-stream
Modeling؆୯ʹASR, TTSద༻Ͱ͖Δ •ζϨΛม͑Δ͚ͩͰࣗવʹͲͪΒͷλεΫදݱՄೳ • ASRͷ߹ॻ͖ى͍ͨ͜͠ԻΛฉ͍͔ͯΒςΩετΛग़ྗ • TTSͷ߹ൃԻ͍ͨ͠ςΩετΛݟ͔ͯΒԻΛग़ྗ ͕ͬͪͭ͜ ͕ͬͪͭ͜
Իೝࣝ(ASR), Ի߹(TTS)ͷస༻ 40 ςΩετ Ի Ի ςΩετ ASR TTS Ϟσϧࣗͷग़ྗςΩετ
80ms͝ͱͷϢʔβͷԻ ͜͜ͷग़ྗϕΫτϧͰ࣍୯ޠ༧ଌ
Իೝࣝ(ASR), Ի߹(TTS)ͷస༻ 41 ςΩετ Ի Ի ςΩετ ASR TTS ͜͜ͷग़ྗϕΫτϧ͔Β࣍”Ի”༧ଌ
Ϣʔβ͕ೖྗͨ͠ςΩετ Ϟσϧͷग़ྗԻ
•Temporal Transformer͕ςΩετΛग़ྗ •Depth Transformer͕Semantic TokenͱAcoustic TokenΛࣗݾճؼతʹग़ྗ →࣌ؒํɾcodebookํͷೋͭͷࣗݾճؼͷྲྀΕ MoshiͷΞʔΩςΫνϟਤ (࠶ܝ) 42
RQ-Transformer Mimi Encoder Mimi Decoder Temporal Transformer Helium Depth Transformer
1. Heliumͷࣄલֶश: 2.1T tokenͰ7BͷLLMΛ܇࿅ 2. RQ‑Transformerͷࣄલֶश: ԻɾςΩετΛೖग़ྗʹ700ສֶ࣌ؒश 3. Multi-Streamରֶश: ্هΛऀˠԻɾςΩετΛಉ࣌ʹ܇࿅
4. Fisher datasetʹΑΔ fi ne-tuning 5. ࢦֶࣔश Moshiͷ܇࿅ఔ 43
•ධՁ߲ • HeliumͷLLM ͱͯ͠ͷೳྗ • ԻτʔΫφΠζ • ԻLMͱͯ͠ͷೳྗ • ԻQA
• ରੜ࣭ • ετϦʔϛϯάASR, TTS • ྔࢠԽ ධՁ࣮ݧ 44
•Llama2Mistralͱൺֱͯ͠ѱ͘ͳ͍ੑೳ • → ಉنܭࢉࢿݯͷLLMͱ͍͍ͯ͠ײ͡ ධՁ࣮ݧ: LLMͱͯ͠ͷධՁ 45 ܧଓֶश͡Όμϝͩͬͨͷ͔? 🤨
•ABX: ԻͷຒΊࠐΈදݱΛ༻͍ͨࣗಈධՁࢦඪ •MOSNet: reference-freeͳԻ࣭༧ଌධՁ (ਂֶशϞσϧʹΑΔਪఆ) •MUSHRA: ਓखʹΑΔओ؍ධՁࢦඪ Ի࣭ʹؔ͢ΔධՁ 46
•ABX: ԻͷຒΊࠐΈදݱΛ༻͍ͨࣗಈධՁࢦඪ •MOSNet: reference-freeͳԻ࣭༧ଌධՁ (ਂֶशϞσϧʹΑΔਪఆ) •MUSHRA: ਓखʹΑΔओ؍ධՁࢦඪ Ի࣭ʹؔ͢ΔධՁ 47 Causa,
ߴѹॖͷׂʹѱ͘ͳ͍
•sWUGGY: ͋Δ୯ޠ͔Βِͷ୯ޠΛ࡞ΓɺͲͪΒͷ͕֬ߴ͍͔ΛଌΔ • ݩͷWUGGYςΩετϨϕϧ͕ͩɺsWUGGYTTSͯ͠ԻͰධՁ •sBLIMP: ౷ޠతʹਖ਼͍͠ํͷςΩετΛબͿλεΫ • ͪ͜ΒԻϨϕϧͰධՁ •sStoryCloze: 4จ͕༩͑ΒΕɺೋͭ༩͑ΒΕΔ5จͷਖ਼͍͠ํΛબͿ
• શମΛԻʹͯ͠ධՁ •sTopic‑StoryCloze: sStoryClozeΛ؆୯ʹͨ͠όʔδϣϯ •MMLU: inner monologueͷςΩετΛͬͯී௨ʹΛղ͚Δ͔ධՁ ԻݴޠϞσϧͱͯ͠ͷධՁ 48
•ͲͷλεΫͰฏۉͯ͠ߴ͍ੑೳɺԻ͚ͩͰͳ͘ςΩετॲཧՄೳ ԻݴޠϞσϧͱͯ͠ͷධՁ 49
•ੜԻΛWhisperͰจࣈىͯ͜͠͠DialoGPTͰରͷPPLධՁ •MoshiPPL͕͘ରςΩετͱͯࣗ͠વ •ऀؒͷ(Gap, Pause)গͳ͘λʔϯςΠΩϯάࣗવʹͰ͖͍ͯΔ ରੜ࣭ 50
•full-duplexͳԻରϞσϧ Moshi ΛఏҊ • χϡʔϥϧԻίʔσοΫ Mimi ͱ RQ-Transformer Ͱߏ •Multi-stream
modelingʹΑΔϢʔβԻɾϞσϧԻɾςΩετͷಉ࣌ॲཧ • શମΛcausalʹߏ͢Δ͜ͱͰετϦʔϛϯάॲཧΛՄೳʹ ·ͱΊ 51 RQ-Transformer Mimi Encoder Mimi Decoder Temporal Transformer Helium Depth Transformer