Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
【輪講資料】Moshi: a speech-text foundation model for...
Search
Hayato Tsukagoshi
July 15, 2025
Research
3
600
【輪講資料】Moshi: a speech-text foundation model for real-time dialogue
リアルタイム音声対話モデル Moshi を提案した論文の紹介資料です
Hayato Tsukagoshi
July 15, 2025
Tweet
Share
More Decks by Hayato Tsukagoshi
See All by Hayato Tsukagoshi
Word Embeddings Are Steers for Language Models
hpprc
1
230
NLP2024 招待論文セッション: 定義文を用いた文埋め込み構成法
hpprc
1
120
修論発表.pdf
hpprc
0
97
YANS2024: 目指せ国際会議!「あぶない国際会議」
hpprc
0
250
Isotropy, Clusters, and Classifiers
hpprc
3
880
[輪講資料] Matryoshka Representation Learning
hpprc
5
1.7k
[輪講資料] Text Embeddings by Weakly-Supervised Contrastive Pre-training
hpprc
4
1.4k
[輪講資料] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
hpprc
1
1k
WhitenedCSE: Whitening-based Contrastive Learning of Sentence Embeddings
hpprc
3
850
Other Decks in Research
See All in Research
Generative Models 2025
takahashihiroshi
24
13k
診断前の病歴テキストを対象としたLLMによるエンティティリンキング精度検証
hagino3000
1
120
Computational OT #1 - Monge and Kantorovitch
gpeyre
0
230
SSII2025 [TS1] 光学・物理原理に基づく深層画像生成
ssii
PRO
4
4.1k
多言語カスタマーインタビューの“壁”を越える~PMと生成AIの共創~ 株式会社ジグザグ 松野 亘
watarumatsuno
0
110
引力・斥力を制御可能なランダム部分集合の確率分布
wasyro
0
220
Cross-Media Information Spaces and Architectures
signer
PRO
0
230
20250502_ABEJA_論文読み会_スライド
flatton
0
190
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
shunk031
14
9.6k
EarthSynth: Generating Informative Earth Observation with Diffusion Models
satai
3
170
Creation and environmental applications of 15-year daily inundation and vegetation maps for Siberia by integrating satellite and meteorological datasets
satai
3
190
近似動的計画入門
mickey_kubo
4
1k
Featured
See All Featured
GraphQLの誤解/rethinking-graphql
sonatard
71
11k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
251
21k
Bootstrapping a Software Product
garrettdimon
PRO
307
110k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
29
2.8k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
358
30k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
34
6k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
53k
A designer walks into a library…
pauljervisheath
207
24k
The Pragmatic Product Professional
lauravandoore
36
6.8k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
49
3k
Principles of Awesome APIs and How to Build Them.
keavy
126
17k
ReactJS: Keep Simple. Everything can be a component!
pedronauck
667
120k
Transcript
Moshi: a speech-text foundation model for real-time dialogue Alexandre Défossez,
Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, Neil Zeghidour https://arxiv.org/abs/2410.00037 Nagoya Univ. D3, Hayato Tsukagoshi
•Full-duplexͳϦΞϧλΠϜରϞσϧ Moshi ΛఏҊ͢Δจ • ϢʔβͷԻΛฉ͖ͳ͕Βಉ࣌ʹϞσϧ͕ग़ྗͰ͖Δ • 㱻 half-duplex: ยํ͕ͯ͠Δؒɺ͏ยํͤͳ͍ •ϑϥϯεͷύϦΛڌͱ͢ΔඇӦརݚڀॴ
Kyutai ͷݚڀ •పఈతʹετϦʔϛϯάॲཧΛҙࣝͨ͠ΞʔΩςΫνϟ͕ಛ • ϢʔβԻɾϞσϧԻɾϞσϧςΩετΛಉ࣌ʹϞσϧೖྗ •χϡʔϥϧԻίʔσοΫ Mimi ։ൃͯ͠׆༻ • 24000HzͷԻΛ12.5HzͷτʔΫϯྻʹτʔΫφΠζ͢Δ ֓ཁ 2
•౦தݚͷେڮ͘Μ͕ ຊޠ൛ϞσϧΛެ։ • ΦϦδφϧͷMoshiʹରͯ͠ ຊޠରσʔλ + ߹σʔλ Ͱ fi ne-tuning
•ΊͪΌͪ͘ΌόζͬͯΔ… ༨ஊ 3
•Ի+ݴޠͳਂֶशͷ࠷ઌͰ໘ന͍ʂ બఆཧ༝ 4
•ࣗݾճؼܕTransformerϕʔεͷ7BϞσϧ + ԻτʔΫφΠβ •ԻτʔΫφΠβ Mimi ʹΑΓԻΛIDྻʹม͠ࢄతʹѻ͏ • frame rate (1ඵ͋ͨΓͷσʔλྔ)
12.5 •ೖྗ: ϢʔβͷԻɺϞσϧͷԻɺςΩετ (inner monologue) • ͦΕͧΕʹରԠ͢ΔϕΫτϧΛͨ͋͠ΘͤͯTransformerʹೖྗ Moshiͷߏ 5
•MoshiΛࢧ͑Δج൫ٕज़ͷҰͭɺ96.2MͰConvͱTransformer͔ΒͳΔ (hf) • 80msΛ1 tokenͱͯ͠ѻ͍ɺೖྗαϯϓϦϯάϨʔτ24000Hz •ԻܗΛࢄతͳAudio tokenʹม͢ΔNeural Audio Codec •
VQ-VAEͰΒΕΔdiscrete bottleneckΛ࠾༻ •Audio tokenAcoustic TokenͱSemantic Tokenͷ2छྨ͕ग़ྗ • Semantic Token: ԻͷҙຯతɾԻӆతใΛଊ͑Δ • WavLM ͷຒΊࠐΈදݱΛৠཹ • Acoustic Token: ࡉ͔ͳԻڹಛΛଊ͑Δ •Residual Vector Quantizer (RVQ) ʹΑΓஈ֊తʹԻܗΛྔࢠԽ Mimi 6
•MoshiΛࢧ͑Δج൫ٕज़ͷҰͭɺ96.2MͰConvͱTransformer͔ΒͳΔ (hf) • 80msΛ1 tokenͱͯ͠ѻ͍ɺೖྗαϯϓϦϯάϨʔτ24000Hz •ԻܗΛࢄతͳAudio tokenʹม͢ΔNeural Audio Codec •
VQ-VAEͰΒΕΔdiscrete bottleneckΛ࠾༻ •Audio tokenAcoustic TokenͱSemantic Tokenͷ2छྨ͕ग़ྗ • Semantic Token: ԻͷҙຯతɾԻӆతใΛଊ͑Δ • WavLM ͷຒΊࠐΈදݱΛৠཹ • Acoustic Token: ࡉ͔ͳԻڹಛΛଊ͑Δ •Residual Vector Quantizer (RVQ) ʹΑΓஈ֊తʹԻܗΛྔࢠԽ Mimi 7
•ϕΫτϧΛෳͷID͔ΒͳΔIDྻʹྔࢠԽ •ྔࢠԽஈ֊తʹߦΘΕΔ • ·ͣϕΫτϧྔࢠԽΛߦ͏ • ࣍ʹೖྗϕΫτϧͱྔࢠԽޙͷϕΫτϧͱͷࠩΛಉ༷ʹྔࢠԽ͢Δ • ҎԼ܁Γฦ͠ •ॏཁͳใ͔ΒॱʹྔࢠԽ͢ΔΑ͏ʹࣗવʹֶश͞ΕΔ •
Quantizer·ͣೖྗϕΫτϧશମΛද͢Α͏ͳϕΫτϧΛબͿ • ײ֮తʹMatryoshka Representation Learningʹ͍ۙʁ Residual Vector Quantization: RVQ 8
RVQ: Πϝʔδਤ 9 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047
RVQ: Πϝʔδਤ 10 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 ࠷ۙ
RVQ: Πϝʔδਤ 11 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 [ 1, ग़ྗIDྻ
RVQ: Πϝʔδਤ 12 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 [ 1, ग़ྗIDྻ = -
RVQ: Πϝʔδਤ 13 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 [ 1, ग़ྗIDྻ
RVQ: Πϝʔδਤ 14 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 [ 1, ग़ྗIDྻ ࠷ۙ
RVQ: Πϝʔδਤ 15 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 ग़ྗIDྻ [ 1, 3,
RVQ: Πϝʔδਤ 16 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 ग़ྗIDྻ [ 1, 3, = -
RVQ: Πϝʔδਤ 17 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 ग़ྗIDྻ [ 1, 3, ࠷ۙ
RVQ: Πϝʔδਤ 18 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 ग़ྗIDྻ [ 1, 3, 2, = -
RVQ: Πϝʔδਤ (nճޙ) 19 Codebook ྔࢠԽର … id=0 id=1 id=2
id=3 id=2047 ग़ྗIDྻ [ 1, 3, 2, 2047, …, 4]
Mimiͷ܇࿅֓ཁਤ: ΊͬͪΌ؆ུԽ൛ 20 Mimi Encoder Mimi Decoder WavLM Cosྨࣅ ❄
࠶ߏଛࣦ + ఢରతଛࣦ
Mimiͷ܇࿅֓ཁਤ: ΊͬͪΌ؆ུԽ൛ 21 Mimi Encoder Mimi Decoder WavLM Cosྨࣅ ❄
࠶ߏଛࣦ + ఢରతଛࣦ non-causalϞσϧͷϕΫτϧ ʹ͚ۙͮͭͭɺԻ࣭ߴΊΔ
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 22
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 23 ੜͷԻΛࣗݾճؼతʹϕΫτϧྻ
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 24 Acoustic TokenRVQ Semantic Tokenઢܗ+VQ
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 25 WavLMͷϕΫτϧʹ Semantic Token͕ۙͮ͘Α͏ʹֶश
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 26 ͠߹ΘͤΛDecoderʹೖྗͯ͠ ԻܗΛग़ྗ
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 27 ग़ྗͨ͠Իܗ͕ೖྗʹۙͮ͘Α͏ʹ +ຊͬΆ͘ͳΔΑ͏ʹֶश
•·ͣ௨ৗͷࣗݾճؼܕݴޠϞσϧΛߏங • ެ։ӳޠίʔύε 2.1T tokensɺܥྻ4096ɺϞσϧαΠζ7B • ߏங͞Εͨ7B LLMΛHeliumͱݺশ • ͜ͷஈ֊Ͱ୯७ʹtext-in,
text-out •࣍ʹɺHeliumΛϕʔεʹԻΛೖग़ྗʹͯ͠܇࿅ • ͱݴͬͯMimiͷτʔΫϯΛ༧ଌ͢ΔΑ͏ʹ܇࿅͢ΔͷͰ௨ৗͷ ݴޠϞσϦϯάͱରͯ͠มΘΒͳ͍ (࣍ͷτʔΫϯ༧ଌ) •Temporal Transformer (HeliumͰॳظԽ) ͱ Depth Transformer͔ΒͳΔ • ͜ͷೋͭΛ·ͱΊͯRQ-Transformerͱݺশ MoshiͷΞʔΩςΫνϟ֓ཁ 28
•·ͣ௨ৗͷࣗݾճؼܕݴޠϞσϧΛߏங • ެ։ӳޠίʔύε 2.1T tokensɺܥྻ4096ɺϞσϧαΠζ7B • ߏங͞Εͨ7B LLMΛHeliumͱݺশ • ͜ͷஈ֊Ͱ୯७ʹtext-in,
text-out •࣍ʹɺHeliumΛϕʔεʹԻΛೖग़ྗʹͯ͠܇࿅ • ͱݴͬͯMimiͷτʔΫϯΛ༧ଌ͢ΔΑ͏ʹ܇࿅͢ΔͷͰ௨ৗͷ ݴޠϞσϦϯάͱରͯ͠มΘΒͳ͍ (࣍ͷτʔΫϯ༧ଌ) •Temporal Transformer (HeliumͰॳظԽ) ͱ Depth Transformer͔ΒͳΔ • ͜ͷೋͭΛ·ͱΊͯRQ-Transformerͱݺশ MoshiͷΞʔΩςΫνϟ֓ཁ 29
•Temporal Transformer͕ςΩετΛग़ྗ •Depth Transformer͕Semantic TokenͱAcoustic TokenΛࣗݾճؼతʹग़ྗ →࣌ؒํɾcodebookํͷೋͭͷࣗݾճؼͷྲྀΕ MoshiͷΞʔΩςΫνϟਤ 30 RQ-Transformer
Mimi Encoder Mimi Decoder Temporal Transformer Helium Depth Transformer
•Temporal TransformerʹϕΫτϧΛೖྗ •࣍ͷτʔΫϯ༧ଌͰ܇࿅ RQ-TransformerͷΞʔΩςΫνϟਤ 31
•1࣌ࠁ͝ͱʹ… •ϢʔβͷԻ͕1+7 token •ϞσϧͷԻ͕1+7 token •ϞσϧͷςΩετ͕1 token → Multi-stream Modeling
•1࣌ࠁ͝ͱʹશ෦͠߹Θͤͯ ୯ҰͷϕΫτϧʹ͠ɺϞσϧ ೖྗ Moshiͷೖྗ֓ཁਤ 32
•1࣌ࠁ͝ͱʹ… •ϢʔβͷԻ͕1+7 token •ϞσϧͷԻ͕1+7 token •ϞσϧͷςΩετ͕1 token → Multi-stream Modeling
•1࣌ࠁ͝ͱʹશ෦͠߹Θͤͯ ୯ҰͷϕΫτϧʹ͠ɺϞσϧ ೖྗ Moshiͷೖྗ֓ཁਤ 33 https://github.com/kyutai-labs/moshi/blob/950e9771dc33d7aa48f80175a189c5c902016df2/moshi/moshi/models/lm.py#L381 ݩ࣮ (৴͍͕͡) 17ݸͷຒΊࠐΈΛ͠߹Θͤͯ ҰͭͷϕΫτϧʹ͍ͯ͠Δ Σ੧(❛□❛✿)
Moshiͷೖྗ֓ཁਤ: ετϦʔϛϯάॲཧͷ߹ 34 Ϟσϧͷग़ྗԻ ϢʔβͷೖྗԻ Ϟσϧͷग़ྗςΩετ •Ұఆ࣌ؒ͝ͱʹೖྗ͕Ϟσϧʹೖͬͯ͘Δ •ετϦʔϛϯάॲཧͷͨΊʹ: • ϞσϧҰఆ࣌ؒҎʹॲཧΛྃͤͯ͞ग़ྗΛग़͢
• ͦΕΛ͏ҰೖྗʹೖΕͭͭɺϢʔβଆͷ࣍ͷ࣌ࠁͷԻೖྗ
Moshiͷೖྗ֓ཁਤ: ࣮ࡍͷ࣌ࠁ͝ͱͷೖྗ 35 Ϟσϧͷग़ྗԻ ϢʔβͷೖྗԻ Ϟσϧͷग़ྗςΩετ •Ұఆ࣌ؒ͝ͱʹೖྗ͕Ϟσϧʹೖͬͯ͘Δ •ετϦʔϛϯάॲཧͷͨΊʹ: • ϞσϧҰఆ࣌ؒҎʹॲཧΛྃͤͯ͞ग़ྗΛग़͢
• ͦΕΛ͏ҰೖྗʹೖΕͭͭɺϢʔβଆͷ࣍ͷ࣌ࠁͷԻೖྗ t=2
Moshiͷೖྗ֓ཁਤ: ࣮ࡍͷ࣌ࠁ͝ͱͷೖྗ 36 Ϟσϧͷग़ྗԻ ϢʔβͷೖྗԻ Ϟσϧͷग़ྗςΩετ •Ұఆ࣌ؒ͝ͱʹೖྗ͕Ϟσϧʹೖͬͯ͘Δ •ετϦʔϛϯάॲཧͷͨΊʹ: • ϞσϧҰఆ࣌ؒҎʹॲཧΛྃͤͯ͞ग़ྗΛग़͢
• ͦΕΛ͏ҰೖྗʹೖΕͭͭɺϢʔβଆͷ࣍ͷ࣌ࠁͷԻೖྗ t=3
Moshiͷೖྗ֓ཁਤ: ࣮ࡍͷ࣌ࠁ͝ͱͷೖྗ 37 Ϟσϧͷग़ྗԻ ϢʔβͷೖྗԻ Ϟσϧͷग़ྗςΩετ •Ұఆ࣌ؒ͝ͱʹೖྗ͕Ϟσϧʹೖͬͯ͘Δ •ετϦʔϛϯάॲཧͷͨΊʹ: • ϞσϧҰఆ࣌ؒҎʹॲཧΛྃͤͯ͞ग़ྗΛग़͢
• ͦΕΛ͏ҰೖྗʹೖΕͭͭɺϢʔβଆͷ࣍ͷ࣌ࠁͷԻೖྗ t=4
Moshiͷೖྗ֓ཁਤ: ࣮ࡍͷ࣌ࠁ͝ͱͷೖྗ 38 Ϟσϧͷग़ྗԻ ϢʔβͷೖྗԻ Ϟσϧͷग़ྗςΩετ •Ұఆ࣌ؒ͝ͱʹೖྗ͕Ϟσϧʹೖͬͯ͘Δ •ετϦʔϛϯάॲཧͷͨΊʹ: • ϞσϧҰఆ࣌ؒҎʹॲཧΛྃͤͯ͞ग़ྗΛग़͢
• ͦΕΛ͏ҰೖྗʹೖΕͭͭɺϢʔβଆͷ࣍ͷ࣌ࠁͷԻೖྗ t=5
Իೝࣝ(ASR), Ի߹(TTS)ͷస༻ 39 ςΩετ Ի Ի ςΩετ ASR TTS •MoshiͷMulti-stream
Modeling؆୯ʹASR, TTSద༻Ͱ͖Δ •ζϨΛม͑Δ͚ͩͰࣗવʹͲͪΒͷλεΫදݱՄೳ • ASRͷ߹ॻ͖ى͍ͨ͜͠ԻΛฉ͍͔ͯΒςΩετΛग़ྗ • TTSͷ߹ൃԻ͍ͨ͠ςΩετΛݟ͔ͯΒԻΛग़ྗ ͕ͬͪͭ͜ ͕ͬͪͭ͜
Իೝࣝ(ASR), Ի߹(TTS)ͷస༻ 40 ςΩετ Ի Ի ςΩετ ASR TTS Ϟσϧࣗͷग़ྗςΩετ
80ms͝ͱͷϢʔβͷԻ ͜͜ͷग़ྗϕΫτϧͰ࣍୯ޠ༧ଌ
Իೝࣝ(ASR), Ի߹(TTS)ͷస༻ 41 ςΩετ Ի Ի ςΩετ ASR TTS ͜͜ͷग़ྗϕΫτϧ͔Β࣍”Ի”༧ଌ
Ϣʔβ͕ೖྗͨ͠ςΩετ Ϟσϧͷग़ྗԻ
•Temporal Transformer͕ςΩετΛग़ྗ •Depth Transformer͕Semantic TokenͱAcoustic TokenΛࣗݾճؼతʹग़ྗ →࣌ؒํɾcodebookํͷೋͭͷࣗݾճؼͷྲྀΕ MoshiͷΞʔΩςΫνϟਤ (࠶ܝ) 42
RQ-Transformer Mimi Encoder Mimi Decoder Temporal Transformer Helium Depth Transformer
1. Heliumͷࣄલֶश: 2.1T tokenͰ7BͷLLMΛ܇࿅ 2. RQ‑Transformerͷࣄલֶश: ԻɾςΩετΛೖग़ྗʹ700ສֶ࣌ؒश 3. Multi-Streamରֶश: ্هΛऀˠԻɾςΩετΛಉ࣌ʹ܇࿅
4. Fisher datasetʹΑΔ fi ne-tuning 5. ࢦֶࣔश Moshiͷ܇࿅ఔ 43
•ධՁ߲ • HeliumͷLLM ͱͯ͠ͷೳྗ • ԻτʔΫφΠζ • ԻLMͱͯ͠ͷೳྗ • ԻQA
• ରੜ࣭ • ετϦʔϛϯάASR, TTS • ྔࢠԽ ධՁ࣮ݧ 44
•Llama2Mistralͱൺֱͯ͠ѱ͘ͳ͍ੑೳ • → ಉنܭࢉࢿݯͷLLMͱ͍͍ͯ͠ײ͡ ධՁ࣮ݧ: LLMͱͯ͠ͷධՁ 45 ܧଓֶश͡Όμϝͩͬͨͷ͔? 🤨
•ABX: ԻͷຒΊࠐΈදݱΛ༻͍ͨࣗಈධՁࢦඪ •MOSNet: reference-freeͳԻ࣭༧ଌධՁ (ਂֶशϞσϧʹΑΔਪఆ) •MUSHRA: ਓखʹΑΔओ؍ධՁࢦඪ Ի࣭ʹؔ͢ΔධՁ 46
•ABX: ԻͷຒΊࠐΈදݱΛ༻͍ͨࣗಈධՁࢦඪ •MOSNet: reference-freeͳԻ࣭༧ଌධՁ (ਂֶशϞσϧʹΑΔਪఆ) •MUSHRA: ਓखʹΑΔओ؍ධՁࢦඪ Ի࣭ʹؔ͢ΔධՁ 47 Causa,
ߴѹॖͷׂʹѱ͘ͳ͍
•sWUGGY: ͋Δ୯ޠ͔Βِͷ୯ޠΛ࡞ΓɺͲͪΒͷ͕֬ߴ͍͔ΛଌΔ • ݩͷWUGGYςΩετϨϕϧ͕ͩɺsWUGGYTTSͯ͠ԻͰධՁ •sBLIMP: ౷ޠతʹਖ਼͍͠ํͷςΩετΛબͿλεΫ • ͪ͜ΒԻϨϕϧͰධՁ •sStoryCloze: 4จ͕༩͑ΒΕɺೋͭ༩͑ΒΕΔ5จͷਖ਼͍͠ํΛબͿ
• શମΛԻʹͯ͠ධՁ •sTopic‑StoryCloze: sStoryClozeΛ؆୯ʹͨ͠όʔδϣϯ •MMLU: inner monologueͷςΩετΛͬͯී௨ʹΛղ͚Δ͔ධՁ ԻݴޠϞσϧͱͯ͠ͷධՁ 48
•ͲͷλεΫͰฏۉͯ͠ߴ͍ੑೳɺԻ͚ͩͰͳ͘ςΩετॲཧՄೳ ԻݴޠϞσϧͱͯ͠ͷධՁ 49
•ੜԻΛWhisperͰจࣈىͯ͜͠͠DialoGPTͰରͷPPLධՁ •MoshiPPL͕͘ରςΩετͱͯࣗ͠વ •ऀؒͷ(Gap, Pause)গͳ͘λʔϯςΠΩϯάࣗવʹͰ͖͍ͯΔ ରੜ࣭ 50
•full-duplexͳԻରϞσϧ Moshi ΛఏҊ • χϡʔϥϧԻίʔσοΫ Mimi ͱ RQ-Transformer Ͱߏ •Multi-stream
modelingʹΑΔϢʔβԻɾϞσϧԻɾςΩετͷಉ࣌ॲཧ • શମΛcausalʹߏ͢Δ͜ͱͰετϦʔϛϯάॲཧΛՄೳʹ ·ͱΊ 51 RQ-Transformer Mimi Encoder Mimi Decoder Temporal Transformer Helium Depth Transformer