Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
【輪講資料】Moshi: a speech-text foundation model for...
Search
Hayato Tsukagoshi
July 15, 2025
Research
3
790
【輪講資料】Moshi: a speech-text foundation model for real-time dialogue
リアルタイム音声対話モデル Moshi を提案した論文の紹介資料です
Hayato Tsukagoshi
July 15, 2025
Tweet
Share
More Decks by Hayato Tsukagoshi
See All by Hayato Tsukagoshi
Word Embeddings Are Steers for Language Models
hpprc
1
270
NLP2024 招待論文セッション: 定義文を用いた文埋め込み構成法
hpprc
1
140
修論発表.pdf
hpprc
0
120
YANS2024: 目指せ国際会議!「あぶない国際会議」
hpprc
0
280
Isotropy, Clusters, and Classifiers
hpprc
3
960
[輪講資料] Matryoshka Representation Learning
hpprc
5
2k
[輪講資料] Text Embeddings by Weakly-Supervised Contrastive Pre-training
hpprc
4
1.4k
[輪講資料] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
hpprc
1
1.1k
WhitenedCSE: Whitening-based Contrastive Learning of Sentence Embeddings
hpprc
3
870
Other Decks in Research
See All in Research
カスタマーサクセスの視点からAWS Summitの展示を考える~製品開発で活用できる勘所~
masakiokuda
2
220
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues
satai
3
330
論文紹介:Safety Alignment Should be Made More Than Just a Few Tokens Deep
kazutoshishinoda
0
120
長期・短期メモリを活用したエージェントの個別最適化
isidaitc
0
270
MetaEarth: A Generative Foundation Model for Global-Scale Remote Sensing Image Generation
satai
4
410
Pythonでジオを使い倒そう! 〜それとFOSS4G Hiroshima 2026のご紹介を少し〜
wata909
0
1.1k
生成AI による論文執筆サポート・ワークショップ ─ サーベイ/リサーチクエスチョン編 / Workshop on AI-Assisted Paper Writing Support: Survey/Research Question Edition
ks91
PRO
0
110
2025/7/5 応用音響研究会招待講演@北海道大学
takuma_okamoto
1
230
AIスパコン「さくらONE」の オブザーバビリティ / Observability for AI Supercomputer SAKURAONE
yuukit
2
820
湯村研究室の紹介2025 / yumulab2025
yumulab
0
150
AI in Enterprises - Java and Open Source to the Rescue
ivargrimstad
0
940
ウェブ・ソーシャルメディア論文読み会 第31回: The rising entropy of English in the attention economy. (Commun Psychology, 2024)
hkefka385
1
110
Featured
See All Featured
Building Flexible Design Systems
yeseniaperezcruz
329
39k
Why You Should Never Use an ORM
jnunemaker
PRO
60
9.6k
Principles of Awesome APIs and How to Build Them.
keavy
127
17k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
1.8k
Done Done
chrislema
186
16k
Code Review Best Practice
trishagee
72
19k
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
55
3.1k
Designing for humans not robots
tammielis
254
26k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
508
140k
Building an army of robots
kneath
306
46k
Practical Orchestrator
shlominoach
190
11k
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
9
980
Transcript
Moshi: a speech-text foundation model for real-time dialogue Alexandre Défossez,
Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, Neil Zeghidour https://arxiv.org/abs/2410.00037 Nagoya Univ. D3, Hayato Tsukagoshi
•Full-duplexͳϦΞϧλΠϜରϞσϧ Moshi ΛఏҊ͢Δจ • ϢʔβͷԻΛฉ͖ͳ͕Βಉ࣌ʹϞσϧ͕ग़ྗͰ͖Δ • 㱻 half-duplex: ยํ͕ͯ͠Δؒɺ͏ยํͤͳ͍ •ϑϥϯεͷύϦΛڌͱ͢ΔඇӦརݚڀॴ
Kyutai ͷݚڀ •పఈతʹετϦʔϛϯάॲཧΛҙࣝͨ͠ΞʔΩςΫνϟ͕ಛ • ϢʔβԻɾϞσϧԻɾϞσϧςΩετΛಉ࣌ʹϞσϧೖྗ •χϡʔϥϧԻίʔσοΫ Mimi ։ൃͯ͠׆༻ • 24000HzͷԻΛ12.5HzͷτʔΫϯྻʹτʔΫφΠζ͢Δ ֓ཁ 2
•౦தݚͷେڮ͘Μ͕ ຊޠ൛ϞσϧΛެ։ • ΦϦδφϧͷMoshiʹରͯ͠ ຊޠରσʔλ + ߹σʔλ Ͱ fi ne-tuning
•ΊͪΌͪ͘ΌόζͬͯΔ… ༨ஊ 3
•Ի+ݴޠͳਂֶशͷ࠷ઌͰ໘ന͍ʂ બఆཧ༝ 4
•ࣗݾճؼܕTransformerϕʔεͷ7BϞσϧ + ԻτʔΫφΠβ •ԻτʔΫφΠβ Mimi ʹΑΓԻΛIDྻʹม͠ࢄతʹѻ͏ • frame rate (1ඵ͋ͨΓͷσʔλྔ)
12.5 •ೖྗ: ϢʔβͷԻɺϞσϧͷԻɺςΩετ (inner monologue) • ͦΕͧΕʹରԠ͢ΔϕΫτϧΛͨ͋͠ΘͤͯTransformerʹೖྗ Moshiͷߏ 5
•MoshiΛࢧ͑Δج൫ٕज़ͷҰͭɺ96.2MͰConvͱTransformer͔ΒͳΔ (hf) • 80msΛ1 tokenͱͯ͠ѻ͍ɺೖྗαϯϓϦϯάϨʔτ24000Hz •ԻܗΛࢄతͳAudio tokenʹม͢ΔNeural Audio Codec •
VQ-VAEͰΒΕΔdiscrete bottleneckΛ࠾༻ •Audio tokenAcoustic TokenͱSemantic Tokenͷ2छྨ͕ग़ྗ • Semantic Token: ԻͷҙຯతɾԻӆతใΛଊ͑Δ • WavLM ͷຒΊࠐΈදݱΛৠཹ • Acoustic Token: ࡉ͔ͳԻڹಛΛଊ͑Δ •Residual Vector Quantizer (RVQ) ʹΑΓஈ֊తʹԻܗΛྔࢠԽ Mimi 6
•MoshiΛࢧ͑Δج൫ٕज़ͷҰͭɺ96.2MͰConvͱTransformer͔ΒͳΔ (hf) • 80msΛ1 tokenͱͯ͠ѻ͍ɺೖྗαϯϓϦϯάϨʔτ24000Hz •ԻܗΛࢄతͳAudio tokenʹม͢ΔNeural Audio Codec •
VQ-VAEͰΒΕΔdiscrete bottleneckΛ࠾༻ •Audio tokenAcoustic TokenͱSemantic Tokenͷ2छྨ͕ग़ྗ • Semantic Token: ԻͷҙຯతɾԻӆతใΛଊ͑Δ • WavLM ͷຒΊࠐΈදݱΛৠཹ • Acoustic Token: ࡉ͔ͳԻڹಛΛଊ͑Δ •Residual Vector Quantizer (RVQ) ʹΑΓஈ֊తʹԻܗΛྔࢠԽ Mimi 7
•ϕΫτϧΛෳͷID͔ΒͳΔIDྻʹྔࢠԽ •ྔࢠԽஈ֊తʹߦΘΕΔ • ·ͣϕΫτϧྔࢠԽΛߦ͏ • ࣍ʹೖྗϕΫτϧͱྔࢠԽޙͷϕΫτϧͱͷࠩΛಉ༷ʹྔࢠԽ͢Δ • ҎԼ܁Γฦ͠ •ॏཁͳใ͔ΒॱʹྔࢠԽ͢ΔΑ͏ʹࣗવʹֶश͞ΕΔ •
Quantizer·ͣೖྗϕΫτϧશମΛද͢Α͏ͳϕΫτϧΛબͿ • ײ֮తʹMatryoshka Representation Learningʹ͍ۙʁ Residual Vector Quantization: RVQ 8
RVQ: Πϝʔδਤ 9 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047
RVQ: Πϝʔδਤ 10 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 ࠷ۙ
RVQ: Πϝʔδਤ 11 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 [ 1, ग़ྗIDྻ
RVQ: Πϝʔδਤ 12 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 [ 1, ग़ྗIDྻ = -
RVQ: Πϝʔδਤ 13 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 [ 1, ग़ྗIDྻ
RVQ: Πϝʔδਤ 14 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 [ 1, ग़ྗIDྻ ࠷ۙ
RVQ: Πϝʔδਤ 15 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 ग़ྗIDྻ [ 1, 3,
RVQ: Πϝʔδਤ 16 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 ग़ྗIDྻ [ 1, 3, = -
RVQ: Πϝʔδਤ 17 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 ग़ྗIDྻ [ 1, 3, ࠷ۙ
RVQ: Πϝʔδਤ 18 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 ग़ྗIDྻ [ 1, 3, 2, = -
RVQ: Πϝʔδਤ (nճޙ) 19 Codebook ྔࢠԽର … id=0 id=1 id=2
id=3 id=2047 ग़ྗIDྻ [ 1, 3, 2, 2047, …, 4]
Mimiͷ܇࿅֓ཁਤ: ΊͬͪΌ؆ུԽ൛ 20 Mimi Encoder Mimi Decoder WavLM Cosྨࣅ ❄
࠶ߏଛࣦ + ఢରతଛࣦ
Mimiͷ܇࿅֓ཁਤ: ΊͬͪΌ؆ུԽ൛ 21 Mimi Encoder Mimi Decoder WavLM Cosྨࣅ ❄
࠶ߏଛࣦ + ఢରతଛࣦ non-causalϞσϧͷϕΫτϧ ʹ͚ۙͮͭͭɺԻ࣭ߴΊΔ
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 22
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 23 ੜͷԻΛࣗݾճؼతʹϕΫτϧྻ
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 24 Acoustic TokenRVQ Semantic Tokenઢܗ+VQ
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 25 WavLMͷϕΫτϧʹ Semantic Token͕ۙͮ͘Α͏ʹֶश
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 26 ͠߹ΘͤΛDecoderʹೖྗͯ͠ ԻܗΛग़ྗ
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 27 ग़ྗͨ͠Իܗ͕ೖྗʹۙͮ͘Α͏ʹ +ຊͬΆ͘ͳΔΑ͏ʹֶश
•·ͣ௨ৗͷࣗݾճؼܕݴޠϞσϧΛߏங • ެ։ӳޠίʔύε 2.1T tokensɺܥྻ4096ɺϞσϧαΠζ7B • ߏங͞Εͨ7B LLMΛHeliumͱݺশ • ͜ͷஈ֊Ͱ୯७ʹtext-in,
text-out •࣍ʹɺHeliumΛϕʔεʹԻΛೖग़ྗʹͯ͠܇࿅ • ͱݴͬͯMimiͷτʔΫϯΛ༧ଌ͢ΔΑ͏ʹ܇࿅͢ΔͷͰ௨ৗͷ ݴޠϞσϦϯάͱରͯ͠มΘΒͳ͍ (࣍ͷτʔΫϯ༧ଌ) •Temporal Transformer (HeliumͰॳظԽ) ͱ Depth Transformer͔ΒͳΔ • ͜ͷೋͭΛ·ͱΊͯRQ-Transformerͱݺশ MoshiͷΞʔΩςΫνϟ֓ཁ 28
•·ͣ௨ৗͷࣗݾճؼܕݴޠϞσϧΛߏங • ެ։ӳޠίʔύε 2.1T tokensɺܥྻ4096ɺϞσϧαΠζ7B • ߏங͞Εͨ7B LLMΛHeliumͱݺশ • ͜ͷஈ֊Ͱ୯७ʹtext-in,
text-out •࣍ʹɺHeliumΛϕʔεʹԻΛೖग़ྗʹͯ͠܇࿅ • ͱݴͬͯMimiͷτʔΫϯΛ༧ଌ͢ΔΑ͏ʹ܇࿅͢ΔͷͰ௨ৗͷ ݴޠϞσϦϯάͱରͯ͠มΘΒͳ͍ (࣍ͷτʔΫϯ༧ଌ) •Temporal Transformer (HeliumͰॳظԽ) ͱ Depth Transformer͔ΒͳΔ • ͜ͷೋͭΛ·ͱΊͯRQ-Transformerͱݺশ MoshiͷΞʔΩςΫνϟ֓ཁ 29
•Temporal Transformer͕ςΩετΛग़ྗ •Depth Transformer͕Semantic TokenͱAcoustic TokenΛࣗݾճؼతʹग़ྗ →࣌ؒํɾcodebookํͷೋͭͷࣗݾճؼͷྲྀΕ MoshiͷΞʔΩςΫνϟਤ 30 RQ-Transformer
Mimi Encoder Mimi Decoder Temporal Transformer Helium Depth Transformer
•Temporal TransformerʹϕΫτϧΛೖྗ •࣍ͷτʔΫϯ༧ଌͰ܇࿅ RQ-TransformerͷΞʔΩςΫνϟਤ 31
•1࣌ࠁ͝ͱʹ… •ϢʔβͷԻ͕1+7 token •ϞσϧͷԻ͕1+7 token •ϞσϧͷςΩετ͕1 token → Multi-stream Modeling
•1࣌ࠁ͝ͱʹશ෦͠߹Θͤͯ ୯ҰͷϕΫτϧʹ͠ɺϞσϧ ೖྗ Moshiͷೖྗ֓ཁਤ 32
•1࣌ࠁ͝ͱʹ… •ϢʔβͷԻ͕1+7 token •ϞσϧͷԻ͕1+7 token •ϞσϧͷςΩετ͕1 token → Multi-stream Modeling
•1࣌ࠁ͝ͱʹશ෦͠߹Θͤͯ ୯ҰͷϕΫτϧʹ͠ɺϞσϧ ೖྗ Moshiͷೖྗ֓ཁਤ 33 https://github.com/kyutai-labs/moshi/blob/950e9771dc33d7aa48f80175a189c5c902016df2/moshi/moshi/models/lm.py#L381 ݩ࣮ (৴͍͕͡) 17ݸͷຒΊࠐΈΛ͠߹Θͤͯ ҰͭͷϕΫτϧʹ͍ͯ͠Δ Σ੧(❛□❛✿)
Moshiͷೖྗ֓ཁਤ: ετϦʔϛϯάॲཧͷ߹ 34 Ϟσϧͷग़ྗԻ ϢʔβͷೖྗԻ Ϟσϧͷग़ྗςΩετ •Ұఆ࣌ؒ͝ͱʹೖྗ͕Ϟσϧʹೖͬͯ͘Δ •ετϦʔϛϯάॲཧͷͨΊʹ: • ϞσϧҰఆ࣌ؒҎʹॲཧΛྃͤͯ͞ग़ྗΛग़͢
• ͦΕΛ͏ҰೖྗʹೖΕͭͭɺϢʔβଆͷ࣍ͷ࣌ࠁͷԻೖྗ
Moshiͷೖྗ֓ཁਤ: ࣮ࡍͷ࣌ࠁ͝ͱͷೖྗ 35 Ϟσϧͷग़ྗԻ ϢʔβͷೖྗԻ Ϟσϧͷग़ྗςΩετ •Ұఆ࣌ؒ͝ͱʹೖྗ͕Ϟσϧʹೖͬͯ͘Δ •ετϦʔϛϯάॲཧͷͨΊʹ: • ϞσϧҰఆ࣌ؒҎʹॲཧΛྃͤͯ͞ग़ྗΛग़͢
• ͦΕΛ͏ҰೖྗʹೖΕͭͭɺϢʔβଆͷ࣍ͷ࣌ࠁͷԻೖྗ t=2
Moshiͷೖྗ֓ཁਤ: ࣮ࡍͷ࣌ࠁ͝ͱͷೖྗ 36 Ϟσϧͷग़ྗԻ ϢʔβͷೖྗԻ Ϟσϧͷग़ྗςΩετ •Ұఆ࣌ؒ͝ͱʹೖྗ͕Ϟσϧʹೖͬͯ͘Δ •ετϦʔϛϯάॲཧͷͨΊʹ: • ϞσϧҰఆ࣌ؒҎʹॲཧΛྃͤͯ͞ग़ྗΛग़͢
• ͦΕΛ͏ҰೖྗʹೖΕͭͭɺϢʔβଆͷ࣍ͷ࣌ࠁͷԻೖྗ t=3
Moshiͷೖྗ֓ཁਤ: ࣮ࡍͷ࣌ࠁ͝ͱͷೖྗ 37 Ϟσϧͷग़ྗԻ ϢʔβͷೖྗԻ Ϟσϧͷग़ྗςΩετ •Ұఆ࣌ؒ͝ͱʹೖྗ͕Ϟσϧʹೖͬͯ͘Δ •ετϦʔϛϯάॲཧͷͨΊʹ: • ϞσϧҰఆ࣌ؒҎʹॲཧΛྃͤͯ͞ग़ྗΛग़͢
• ͦΕΛ͏ҰೖྗʹೖΕͭͭɺϢʔβଆͷ࣍ͷ࣌ࠁͷԻೖྗ t=4
Moshiͷೖྗ֓ཁਤ: ࣮ࡍͷ࣌ࠁ͝ͱͷೖྗ 38 Ϟσϧͷग़ྗԻ ϢʔβͷೖྗԻ Ϟσϧͷग़ྗςΩετ •Ұఆ࣌ؒ͝ͱʹೖྗ͕Ϟσϧʹೖͬͯ͘Δ •ετϦʔϛϯάॲཧͷͨΊʹ: • ϞσϧҰఆ࣌ؒҎʹॲཧΛྃͤͯ͞ग़ྗΛग़͢
• ͦΕΛ͏ҰೖྗʹೖΕͭͭɺϢʔβଆͷ࣍ͷ࣌ࠁͷԻೖྗ t=5
Իೝࣝ(ASR), Ի߹(TTS)ͷస༻ 39 ςΩετ Ի Ի ςΩετ ASR TTS •MoshiͷMulti-stream
Modeling؆୯ʹASR, TTSద༻Ͱ͖Δ •ζϨΛม͑Δ͚ͩͰࣗવʹͲͪΒͷλεΫදݱՄೳ • ASRͷ߹ॻ͖ى͍ͨ͜͠ԻΛฉ͍͔ͯΒςΩετΛग़ྗ • TTSͷ߹ൃԻ͍ͨ͠ςΩετΛݟ͔ͯΒԻΛग़ྗ ͕ͬͪͭ͜ ͕ͬͪͭ͜
Իೝࣝ(ASR), Ի߹(TTS)ͷస༻ 40 ςΩετ Ի Ի ςΩετ ASR TTS Ϟσϧࣗͷग़ྗςΩετ
80ms͝ͱͷϢʔβͷԻ ͜͜ͷग़ྗϕΫτϧͰ࣍୯ޠ༧ଌ
Իೝࣝ(ASR), Ի߹(TTS)ͷస༻ 41 ςΩετ Ի Ի ςΩετ ASR TTS ͜͜ͷग़ྗϕΫτϧ͔Β࣍”Ի”༧ଌ
Ϣʔβ͕ೖྗͨ͠ςΩετ Ϟσϧͷग़ྗԻ
•Temporal Transformer͕ςΩετΛग़ྗ •Depth Transformer͕Semantic TokenͱAcoustic TokenΛࣗݾճؼతʹग़ྗ →࣌ؒํɾcodebookํͷೋͭͷࣗݾճؼͷྲྀΕ MoshiͷΞʔΩςΫνϟਤ (࠶ܝ) 42
RQ-Transformer Mimi Encoder Mimi Decoder Temporal Transformer Helium Depth Transformer
1. Heliumͷࣄલֶश: 2.1T tokenͰ7BͷLLMΛ܇࿅ 2. RQ‑Transformerͷࣄલֶश: ԻɾςΩετΛೖग़ྗʹ700ສֶ࣌ؒश 3. Multi-Streamରֶश: ্هΛऀˠԻɾςΩετΛಉ࣌ʹ܇࿅
4. Fisher datasetʹΑΔ fi ne-tuning 5. ࢦֶࣔश Moshiͷ܇࿅ఔ 43
•ධՁ߲ • HeliumͷLLM ͱͯ͠ͷೳྗ • ԻτʔΫφΠζ • ԻLMͱͯ͠ͷೳྗ • ԻQA
• ରੜ࣭ • ετϦʔϛϯάASR, TTS • ྔࢠԽ ධՁ࣮ݧ 44
•Llama2Mistralͱൺֱͯ͠ѱ͘ͳ͍ੑೳ • → ಉنܭࢉࢿݯͷLLMͱ͍͍ͯ͠ײ͡ ධՁ࣮ݧ: LLMͱͯ͠ͷධՁ 45 ܧଓֶश͡Όμϝͩͬͨͷ͔? 🤨
•ABX: ԻͷຒΊࠐΈදݱΛ༻͍ͨࣗಈධՁࢦඪ •MOSNet: reference-freeͳԻ࣭༧ଌධՁ (ਂֶशϞσϧʹΑΔਪఆ) •MUSHRA: ਓखʹΑΔओ؍ධՁࢦඪ Ի࣭ʹؔ͢ΔධՁ 46
•ABX: ԻͷຒΊࠐΈදݱΛ༻͍ͨࣗಈධՁࢦඪ •MOSNet: reference-freeͳԻ࣭༧ଌධՁ (ਂֶशϞσϧʹΑΔਪఆ) •MUSHRA: ਓखʹΑΔओ؍ධՁࢦඪ Ի࣭ʹؔ͢ΔධՁ 47 Causa,
ߴѹॖͷׂʹѱ͘ͳ͍
•sWUGGY: ͋Δ୯ޠ͔Βِͷ୯ޠΛ࡞ΓɺͲͪΒͷ͕֬ߴ͍͔ΛଌΔ • ݩͷWUGGYςΩετϨϕϧ͕ͩɺsWUGGYTTSͯ͠ԻͰධՁ •sBLIMP: ౷ޠతʹਖ਼͍͠ํͷςΩετΛબͿλεΫ • ͪ͜ΒԻϨϕϧͰධՁ •sStoryCloze: 4จ͕༩͑ΒΕɺೋͭ༩͑ΒΕΔ5จͷਖ਼͍͠ํΛબͿ
• શମΛԻʹͯ͠ධՁ •sTopic‑StoryCloze: sStoryClozeΛ؆୯ʹͨ͠όʔδϣϯ •MMLU: inner monologueͷςΩετΛͬͯී௨ʹΛղ͚Δ͔ධՁ ԻݴޠϞσϧͱͯ͠ͷධՁ 48
•ͲͷλεΫͰฏۉͯ͠ߴ͍ੑೳɺԻ͚ͩͰͳ͘ςΩετॲཧՄೳ ԻݴޠϞσϧͱͯ͠ͷධՁ 49
•ੜԻΛWhisperͰจࣈىͯ͜͠͠DialoGPTͰରͷPPLධՁ •MoshiPPL͕͘ରςΩετͱͯࣗ͠વ •ऀؒͷ(Gap, Pause)গͳ͘λʔϯςΠΩϯάࣗવʹͰ͖͍ͯΔ ରੜ࣭ 50
•full-duplexͳԻରϞσϧ Moshi ΛఏҊ • χϡʔϥϧԻίʔσοΫ Mimi ͱ RQ-Transformer Ͱߏ •Multi-stream
modelingʹΑΔϢʔβԻɾϞσϧԻɾςΩετͷಉ࣌ॲཧ • શମΛcausalʹߏ͢Δ͜ͱͰετϦʔϛϯάॲཧΛՄೳʹ ·ͱΊ 51 RQ-Transformer Mimi Encoder Mimi Decoder Temporal Transformer Helium Depth Transformer