Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
【輪講資料】Moshi: a speech-text foundation model for...
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Hayato Tsukagoshi
July 15, 2025
Research
1.3k
3
Share
【輪講資料】Moshi: a speech-text foundation model for real-time dialogue
リアルタイム音声対話モデル Moshi を提案した論文の紹介資料です
Hayato Tsukagoshi
July 15, 2025
More Decks by Hayato Tsukagoshi
See All by Hayato Tsukagoshi
Word Embeddings Are Steers for Language Models
hpprc
1
320
NLP2024 招待論文セッション: 定義文を用いた文埋め込み構成法
hpprc
1
190
修論発表.pdf
hpprc
0
160
YANS2024: 目指せ国際会議!「あぶない国際会議」
hpprc
0
340
Isotropy, Clusters, and Classifiers
hpprc
3
1k
[輪講資料] Matryoshka Representation Learning
hpprc
5
2.7k
[輪講資料] Text Embeddings by Weakly-Supervised Contrastive Pre-training
hpprc
4
1.5k
[輪講資料] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
hpprc
1
1.2k
WhitenedCSE: Whitening-based Contrastive Learning of Sentence Embeddings
hpprc
3
960
Other Decks in Research
See All in Research
AGI4OPT:自然言語から数理最適化を導くエ ージェントスキル Translating Human Intent into Mathematical Optimization
mickey_kubo
0
120
衛星×エッジAI勉強会 衛星上におけるAI処理制約とそ取組について
satai
4
490
2026年3月1日(日)福島「除染土」の公共利用をかんがえる
atsukomasano2026
0
600
進学校の生徒にはア行の苗字が多いのか
ozekinote
0
410
IEEE AIxVR 2026 Keynote Talk: "Beyond Visibility: Understanding Scenes and Humans under Challenging Conditions with Diverse Sensing"
miso2024
0
190
コーディングエージェントとABNを再考
hf149
2
650
RS-Agent: Automating Remote Sensing Tasks through Intelligent Agent
satai
0
180
姫路市 -都市OSの「再実装」-
hopin
0
1.7k
20年前に50代だった人たちの今
hysmrk
0
200
2026 東京科学大 情報通信系 研究室紹介 (すずかけ台)
icttitech
0
3.5k
The mathematics of transformers
gpeyre
0
270
非試合日の野球場を楽しむためのARホームランボールキャッチ体験システムの開発 / EC79-miyazaki
yumulab
0
180
Featured
See All Featured
So, you think you're a good person
axbom
PRO
2
2k
Darren the Foodie - Storyboard
khoart
PRO
3
3.3k
Visualization
eitanlees
151
17k
Measuring Dark Social's Impact On Conversion and Attribution
stephenakadiri
2
200
XXLCSS - How to scale CSS and keep your sanity
sugarenia
250
1.3M
Redefining SEO in the New Era of Traffic Generation
szymonslowik
1
300
The Cult of Friendly URLs
andyhume
79
6.9k
Mobile First: as difficult as doing things right
swwweet
225
10k
Leveraging LLMs for student feedback in introductory data science courses - posit::conf(2025)
minecr
1
270
Self-Hosted WebAssembly Runtime for Runtime-Neutral Checkpoint/Restore in Edge–Cloud Continuum
chikuwait
0
540
Mind Mapping
helmedeiros
PRO
1
200
GraphQLとの向き合い方2022年版
quramy
50
15k
Transcript
Moshi: a speech-text foundation model for real-time dialogue Alexandre Défossez,
Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, Neil Zeghidour https://arxiv.org/abs/2410.00037 Nagoya Univ. D3, Hayato Tsukagoshi
•Full-duplexͳϦΞϧλΠϜରϞσϧ Moshi ΛఏҊ͢Δจ • ϢʔβͷԻΛฉ͖ͳ͕Βಉ࣌ʹϞσϧ͕ग़ྗͰ͖Δ • 㱻 half-duplex: ยํ͕ͯ͠Δؒɺ͏ยํͤͳ͍ •ϑϥϯεͷύϦΛڌͱ͢ΔඇӦརݚڀॴ
Kyutai ͷݚڀ •పఈతʹετϦʔϛϯάॲཧΛҙࣝͨ͠ΞʔΩςΫνϟ͕ಛ • ϢʔβԻɾϞσϧԻɾϞσϧςΩετΛಉ࣌ʹϞσϧೖྗ •χϡʔϥϧԻίʔσοΫ Mimi ։ൃͯ͠׆༻ • 24000HzͷԻΛ12.5HzͷτʔΫϯྻʹτʔΫφΠζ͢Δ ֓ཁ 2
•౦தݚͷେڮ͘Μ͕ ຊޠ൛ϞσϧΛެ։ • ΦϦδφϧͷMoshiʹରͯ͠ ຊޠରσʔλ + ߹σʔλ Ͱ fi ne-tuning
•ΊͪΌͪ͘ΌόζͬͯΔ… ༨ஊ 3
•Ի+ݴޠͳਂֶशͷ࠷ઌͰ໘ന͍ʂ બఆཧ༝ 4
•ࣗݾճؼܕTransformerϕʔεͷ7BϞσϧ + ԻτʔΫφΠβ •ԻτʔΫφΠβ Mimi ʹΑΓԻΛIDྻʹม͠ࢄతʹѻ͏ • frame rate (1ඵ͋ͨΓͷσʔλྔ)
12.5 •ೖྗ: ϢʔβͷԻɺϞσϧͷԻɺςΩετ (inner monologue) • ͦΕͧΕʹରԠ͢ΔϕΫτϧΛͨ͋͠ΘͤͯTransformerʹೖྗ Moshiͷߏ 5
•MoshiΛࢧ͑Δج൫ٕज़ͷҰͭɺ96.2MͰConvͱTransformer͔ΒͳΔ (hf) • 80msΛ1 tokenͱͯ͠ѻ͍ɺೖྗαϯϓϦϯάϨʔτ24000Hz •ԻܗΛࢄతͳAudio tokenʹม͢ΔNeural Audio Codec •
VQ-VAEͰΒΕΔdiscrete bottleneckΛ࠾༻ •Audio tokenAcoustic TokenͱSemantic Tokenͷ2छྨ͕ग़ྗ • Semantic Token: ԻͷҙຯతɾԻӆతใΛଊ͑Δ • WavLM ͷຒΊࠐΈදݱΛৠཹ • Acoustic Token: ࡉ͔ͳԻڹಛΛଊ͑Δ •Residual Vector Quantizer (RVQ) ʹΑΓஈ֊తʹԻܗΛྔࢠԽ Mimi 6
•MoshiΛࢧ͑Δج൫ٕज़ͷҰͭɺ96.2MͰConvͱTransformer͔ΒͳΔ (hf) • 80msΛ1 tokenͱͯ͠ѻ͍ɺೖྗαϯϓϦϯάϨʔτ24000Hz •ԻܗΛࢄతͳAudio tokenʹม͢ΔNeural Audio Codec •
VQ-VAEͰΒΕΔdiscrete bottleneckΛ࠾༻ •Audio tokenAcoustic TokenͱSemantic Tokenͷ2छྨ͕ग़ྗ • Semantic Token: ԻͷҙຯతɾԻӆతใΛଊ͑Δ • WavLM ͷຒΊࠐΈදݱΛৠཹ • Acoustic Token: ࡉ͔ͳԻڹಛΛଊ͑Δ •Residual Vector Quantizer (RVQ) ʹΑΓஈ֊తʹԻܗΛྔࢠԽ Mimi 7
•ϕΫτϧΛෳͷID͔ΒͳΔIDྻʹྔࢠԽ •ྔࢠԽஈ֊తʹߦΘΕΔ • ·ͣϕΫτϧྔࢠԽΛߦ͏ • ࣍ʹೖྗϕΫτϧͱྔࢠԽޙͷϕΫτϧͱͷࠩΛಉ༷ʹྔࢠԽ͢Δ • ҎԼ܁Γฦ͠ •ॏཁͳใ͔ΒॱʹྔࢠԽ͢ΔΑ͏ʹࣗવʹֶश͞ΕΔ •
Quantizer·ͣೖྗϕΫτϧશମΛද͢Α͏ͳϕΫτϧΛબͿ • ײ֮తʹMatryoshka Representation Learningʹ͍ۙʁ Residual Vector Quantization: RVQ 8
RVQ: Πϝʔδਤ 9 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047
RVQ: Πϝʔδਤ 10 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 ࠷ۙ
RVQ: Πϝʔδਤ 11 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 [ 1, ग़ྗIDྻ
RVQ: Πϝʔδਤ 12 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 [ 1, ग़ྗIDྻ = -
RVQ: Πϝʔδਤ 13 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 [ 1, ग़ྗIDྻ
RVQ: Πϝʔδਤ 14 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 [ 1, ग़ྗIDྻ ࠷ۙ
RVQ: Πϝʔδਤ 15 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 ग़ྗIDྻ [ 1, 3,
RVQ: Πϝʔδਤ 16 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 ग़ྗIDྻ [ 1, 3, = -
RVQ: Πϝʔδਤ 17 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 ग़ྗIDྻ [ 1, 3, ࠷ۙ
RVQ: Πϝʔδਤ 18 Codebook ྔࢠԽର … id=0 id=1 id=2 id=3
id=2047 ग़ྗIDྻ [ 1, 3, 2, = -
RVQ: Πϝʔδਤ (nճޙ) 19 Codebook ྔࢠԽର … id=0 id=1 id=2
id=3 id=2047 ग़ྗIDྻ [ 1, 3, 2, 2047, …, 4]
Mimiͷ܇࿅֓ཁਤ: ΊͬͪΌ؆ུԽ൛ 20 Mimi Encoder Mimi Decoder WavLM Cosྨࣅ ❄
࠶ߏଛࣦ + ఢରతଛࣦ
Mimiͷ܇࿅֓ཁਤ: ΊͬͪΌ؆ུԽ൛ 21 Mimi Encoder Mimi Decoder WavLM Cosྨࣅ ❄
࠶ߏଛࣦ + ఢରతଛࣦ non-causalϞσϧͷϕΫτϧ ʹ͚ۙͮͭͭɺԻ࣭ߴΊΔ
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 22
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 23 ੜͷԻΛࣗݾճؼతʹϕΫτϧྻ
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 24 Acoustic TokenRVQ Semantic Tokenઢܗ+VQ
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 25 WavLMͷϕΫτϧʹ Semantic Token͕ۙͮ͘Α͏ʹֶश
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 26 ͠߹ΘͤΛDecoderʹೖྗͯ͠ ԻܗΛग़ྗ
•݁ߏؤுͬͯ࡞͍ͬͯΔ Mimiͷ֓ཁਤ 27 ग़ྗͨ͠Իܗ͕ೖྗʹۙͮ͘Α͏ʹ +ຊͬΆ͘ͳΔΑ͏ʹֶश
•·ͣ௨ৗͷࣗݾճؼܕݴޠϞσϧΛߏங • ެ։ӳޠίʔύε 2.1T tokensɺܥྻ4096ɺϞσϧαΠζ7B • ߏங͞Εͨ7B LLMΛHeliumͱݺশ • ͜ͷஈ֊Ͱ୯७ʹtext-in,
text-out •࣍ʹɺHeliumΛϕʔεʹԻΛೖग़ྗʹͯ͠܇࿅ • ͱݴͬͯMimiͷτʔΫϯΛ༧ଌ͢ΔΑ͏ʹ܇࿅͢ΔͷͰ௨ৗͷ ݴޠϞσϦϯάͱରͯ͠มΘΒͳ͍ (࣍ͷτʔΫϯ༧ଌ) •Temporal Transformer (HeliumͰॳظԽ) ͱ Depth Transformer͔ΒͳΔ • ͜ͷೋͭΛ·ͱΊͯRQ-Transformerͱݺশ MoshiͷΞʔΩςΫνϟ֓ཁ 28
•·ͣ௨ৗͷࣗݾճؼܕݴޠϞσϧΛߏங • ެ։ӳޠίʔύε 2.1T tokensɺܥྻ4096ɺϞσϧαΠζ7B • ߏங͞Εͨ7B LLMΛHeliumͱݺশ • ͜ͷஈ֊Ͱ୯७ʹtext-in,
text-out •࣍ʹɺHeliumΛϕʔεʹԻΛೖग़ྗʹͯ͠܇࿅ • ͱݴͬͯMimiͷτʔΫϯΛ༧ଌ͢ΔΑ͏ʹ܇࿅͢ΔͷͰ௨ৗͷ ݴޠϞσϦϯάͱରͯ͠มΘΒͳ͍ (࣍ͷτʔΫϯ༧ଌ) •Temporal Transformer (HeliumͰॳظԽ) ͱ Depth Transformer͔ΒͳΔ • ͜ͷೋͭΛ·ͱΊͯRQ-Transformerͱݺশ MoshiͷΞʔΩςΫνϟ֓ཁ 29
•Temporal Transformer͕ςΩετΛग़ྗ •Depth Transformer͕Semantic TokenͱAcoustic TokenΛࣗݾճؼతʹग़ྗ →࣌ؒํɾcodebookํͷೋͭͷࣗݾճؼͷྲྀΕ MoshiͷΞʔΩςΫνϟਤ 30 RQ-Transformer
Mimi Encoder Mimi Decoder Temporal Transformer Helium Depth Transformer
•Temporal TransformerʹϕΫτϧΛೖྗ •࣍ͷτʔΫϯ༧ଌͰ܇࿅ RQ-TransformerͷΞʔΩςΫνϟਤ 31
•1࣌ࠁ͝ͱʹ… •ϢʔβͷԻ͕1+7 token •ϞσϧͷԻ͕1+7 token •ϞσϧͷςΩετ͕1 token → Multi-stream Modeling
•1࣌ࠁ͝ͱʹશ෦͠߹Θͤͯ ୯ҰͷϕΫτϧʹ͠ɺϞσϧ ೖྗ Moshiͷೖྗ֓ཁਤ 32
•1࣌ࠁ͝ͱʹ… •ϢʔβͷԻ͕1+7 token •ϞσϧͷԻ͕1+7 token •ϞσϧͷςΩετ͕1 token → Multi-stream Modeling
•1࣌ࠁ͝ͱʹશ෦͠߹Θͤͯ ୯ҰͷϕΫτϧʹ͠ɺϞσϧ ೖྗ Moshiͷೖྗ֓ཁਤ 33 https://github.com/kyutai-labs/moshi/blob/950e9771dc33d7aa48f80175a189c5c902016df2/moshi/moshi/models/lm.py#L381 ݩ࣮ (৴͍͕͡) 17ݸͷຒΊࠐΈΛ͠߹Θͤͯ ҰͭͷϕΫτϧʹ͍ͯ͠Δ Σ੧(❛□❛✿)
Moshiͷೖྗ֓ཁਤ: ετϦʔϛϯάॲཧͷ߹ 34 Ϟσϧͷग़ྗԻ ϢʔβͷೖྗԻ Ϟσϧͷग़ྗςΩετ •Ұఆ࣌ؒ͝ͱʹೖྗ͕Ϟσϧʹೖͬͯ͘Δ •ετϦʔϛϯάॲཧͷͨΊʹ: • ϞσϧҰఆ࣌ؒҎʹॲཧΛྃͤͯ͞ग़ྗΛग़͢
• ͦΕΛ͏ҰೖྗʹೖΕͭͭɺϢʔβଆͷ࣍ͷ࣌ࠁͷԻೖྗ
Moshiͷೖྗ֓ཁਤ: ࣮ࡍͷ࣌ࠁ͝ͱͷೖྗ 35 Ϟσϧͷग़ྗԻ ϢʔβͷೖྗԻ Ϟσϧͷग़ྗςΩετ •Ұఆ࣌ؒ͝ͱʹೖྗ͕Ϟσϧʹೖͬͯ͘Δ •ετϦʔϛϯάॲཧͷͨΊʹ: • ϞσϧҰఆ࣌ؒҎʹॲཧΛྃͤͯ͞ग़ྗΛग़͢
• ͦΕΛ͏ҰೖྗʹೖΕͭͭɺϢʔβଆͷ࣍ͷ࣌ࠁͷԻೖྗ t=2
Moshiͷೖྗ֓ཁਤ: ࣮ࡍͷ࣌ࠁ͝ͱͷೖྗ 36 Ϟσϧͷग़ྗԻ ϢʔβͷೖྗԻ Ϟσϧͷग़ྗςΩετ •Ұఆ࣌ؒ͝ͱʹೖྗ͕Ϟσϧʹೖͬͯ͘Δ •ετϦʔϛϯάॲཧͷͨΊʹ: • ϞσϧҰఆ࣌ؒҎʹॲཧΛྃͤͯ͞ग़ྗΛग़͢
• ͦΕΛ͏ҰೖྗʹೖΕͭͭɺϢʔβଆͷ࣍ͷ࣌ࠁͷԻೖྗ t=3
Moshiͷೖྗ֓ཁਤ: ࣮ࡍͷ࣌ࠁ͝ͱͷೖྗ 37 Ϟσϧͷग़ྗԻ ϢʔβͷೖྗԻ Ϟσϧͷग़ྗςΩετ •Ұఆ࣌ؒ͝ͱʹೖྗ͕Ϟσϧʹೖͬͯ͘Δ •ετϦʔϛϯάॲཧͷͨΊʹ: • ϞσϧҰఆ࣌ؒҎʹॲཧΛྃͤͯ͞ग़ྗΛग़͢
• ͦΕΛ͏ҰೖྗʹೖΕͭͭɺϢʔβଆͷ࣍ͷ࣌ࠁͷԻೖྗ t=4
Moshiͷೖྗ֓ཁਤ: ࣮ࡍͷ࣌ࠁ͝ͱͷೖྗ 38 Ϟσϧͷग़ྗԻ ϢʔβͷೖྗԻ Ϟσϧͷग़ྗςΩετ •Ұఆ࣌ؒ͝ͱʹೖྗ͕Ϟσϧʹೖͬͯ͘Δ •ετϦʔϛϯάॲཧͷͨΊʹ: • ϞσϧҰఆ࣌ؒҎʹॲཧΛྃͤͯ͞ग़ྗΛग़͢
• ͦΕΛ͏ҰೖྗʹೖΕͭͭɺϢʔβଆͷ࣍ͷ࣌ࠁͷԻೖྗ t=5
Իೝࣝ(ASR), Ի߹(TTS)ͷస༻ 39 ςΩετ Ի Ի ςΩετ ASR TTS •MoshiͷMulti-stream
Modeling؆୯ʹASR, TTSద༻Ͱ͖Δ •ζϨΛม͑Δ͚ͩͰࣗવʹͲͪΒͷλεΫදݱՄೳ • ASRͷ߹ॻ͖ى͍ͨ͜͠ԻΛฉ͍͔ͯΒςΩετΛग़ྗ • TTSͷ߹ൃԻ͍ͨ͠ςΩετΛݟ͔ͯΒԻΛग़ྗ ͕ͬͪͭ͜ ͕ͬͪͭ͜
Իೝࣝ(ASR), Ի߹(TTS)ͷస༻ 40 ςΩετ Ի Ի ςΩετ ASR TTS Ϟσϧࣗͷग़ྗςΩετ
80ms͝ͱͷϢʔβͷԻ ͜͜ͷग़ྗϕΫτϧͰ࣍୯ޠ༧ଌ
Իೝࣝ(ASR), Ի߹(TTS)ͷస༻ 41 ςΩετ Ի Ի ςΩετ ASR TTS ͜͜ͷग़ྗϕΫτϧ͔Β࣍”Ի”༧ଌ
Ϣʔβ͕ೖྗͨ͠ςΩετ Ϟσϧͷग़ྗԻ
•Temporal Transformer͕ςΩετΛग़ྗ •Depth Transformer͕Semantic TokenͱAcoustic TokenΛࣗݾճؼతʹग़ྗ →࣌ؒํɾcodebookํͷೋͭͷࣗݾճؼͷྲྀΕ MoshiͷΞʔΩςΫνϟਤ (࠶ܝ) 42
RQ-Transformer Mimi Encoder Mimi Decoder Temporal Transformer Helium Depth Transformer
1. Heliumͷࣄલֶश: 2.1T tokenͰ7BͷLLMΛ܇࿅ 2. RQ‑Transformerͷࣄલֶश: ԻɾςΩετΛೖग़ྗʹ700ສֶ࣌ؒश 3. Multi-Streamରֶश: ্هΛऀˠԻɾςΩετΛಉ࣌ʹ܇࿅
4. Fisher datasetʹΑΔ fi ne-tuning 5. ࢦֶࣔश Moshiͷ܇࿅ఔ 43
•ධՁ߲ • HeliumͷLLM ͱͯ͠ͷೳྗ • ԻτʔΫφΠζ • ԻLMͱͯ͠ͷೳྗ • ԻQA
• ରੜ࣭ • ετϦʔϛϯάASR, TTS • ྔࢠԽ ධՁ࣮ݧ 44
•Llama2Mistralͱൺֱͯ͠ѱ͘ͳ͍ੑೳ • → ಉنܭࢉࢿݯͷLLMͱ͍͍ͯ͠ײ͡ ධՁ࣮ݧ: LLMͱͯ͠ͷධՁ 45 ܧଓֶश͡Όμϝͩͬͨͷ͔? 🤨
•ABX: ԻͷຒΊࠐΈදݱΛ༻͍ͨࣗಈධՁࢦඪ •MOSNet: reference-freeͳԻ࣭༧ଌධՁ (ਂֶशϞσϧʹΑΔਪఆ) •MUSHRA: ਓखʹΑΔओ؍ධՁࢦඪ Ի࣭ʹؔ͢ΔධՁ 46
•ABX: ԻͷຒΊࠐΈදݱΛ༻͍ͨࣗಈධՁࢦඪ •MOSNet: reference-freeͳԻ࣭༧ଌධՁ (ਂֶशϞσϧʹΑΔਪఆ) •MUSHRA: ਓखʹΑΔओ؍ධՁࢦඪ Ի࣭ʹؔ͢ΔධՁ 47 Causa,
ߴѹॖͷׂʹѱ͘ͳ͍
•sWUGGY: ͋Δ୯ޠ͔Βِͷ୯ޠΛ࡞ΓɺͲͪΒͷ͕֬ߴ͍͔ΛଌΔ • ݩͷWUGGYςΩετϨϕϧ͕ͩɺsWUGGYTTSͯ͠ԻͰධՁ •sBLIMP: ౷ޠతʹਖ਼͍͠ํͷςΩετΛબͿλεΫ • ͪ͜ΒԻϨϕϧͰධՁ •sStoryCloze: 4จ͕༩͑ΒΕɺೋͭ༩͑ΒΕΔ5จͷਖ਼͍͠ํΛબͿ
• શମΛԻʹͯ͠ධՁ •sTopic‑StoryCloze: sStoryClozeΛ؆୯ʹͨ͠όʔδϣϯ •MMLU: inner monologueͷςΩετΛͬͯී௨ʹΛղ͚Δ͔ධՁ ԻݴޠϞσϧͱͯ͠ͷධՁ 48
•ͲͷλεΫͰฏۉͯ͠ߴ͍ੑೳɺԻ͚ͩͰͳ͘ςΩετॲཧՄೳ ԻݴޠϞσϧͱͯ͠ͷධՁ 49
•ੜԻΛWhisperͰจࣈىͯ͜͠͠DialoGPTͰରͷPPLධՁ •MoshiPPL͕͘ରςΩετͱͯࣗ͠વ •ऀؒͷ(Gap, Pause)গͳ͘λʔϯςΠΩϯάࣗવʹͰ͖͍ͯΔ ରੜ࣭ 50
•full-duplexͳԻରϞσϧ Moshi ΛఏҊ • χϡʔϥϧԻίʔσοΫ Mimi ͱ RQ-Transformer Ͱߏ •Multi-stream
modelingʹΑΔϢʔβԻɾϞσϧԻɾςΩετͷಉ࣌ॲཧ • શମΛcausalʹߏ͢Δ͜ͱͰετϦʔϛϯάॲཧΛՄೳʹ ·ͱΊ 51 RQ-Transformer Mimi Encoder Mimi Decoder Temporal Transformer Helium Depth Transformer