Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ローカルで動く高性能音楽生成AI【ACE-Step-1.5】でRetake機能を提案します!

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for asap asap
May 05, 2026
52

 ローカルで動く高性能音楽生成AI【ACE-Step-1.5】でRetake機能を提案します!

生成AIなんでも展示会 Vol.5「https://www.genai-expo.com/
で展示する資料です

「島ア / ア-7」で展示予定です

Avatar for asap

asap

May 05, 2026

Transcript

  1. Inputs Step 1 Step 3 Step 4 Outputs Reasoning LLM

    (acestep-5Hz-lm-1.7B) <think> Reasoning Generate 5Hz Audio Codes キャプション (スタイル) • ethereal symphonic rock • fast tempo メタデータ (BPM・キーなど) (option) • BPM 180 • Key D major AudioTokenDetokenizer (Upscale) Diffusion Transformer (acestep-v15-sft) Main Inputs [BatchSize, Times(Second x 25), 192] 64次元, 25Hz chunk_masks (All True) 64次元, 25Hz axis = 2 192次元 Conditioning [BatchSize, sum all token length, 2048] AceStepLyricEncoder [BatchSize, NumberOfAudio, 2048] (Option) AceStepTimbreEncoder [BatchSize, TokenLength, 2048] VAE Decoder (1D VAE Decoder) 1D VAE Tiled Decoder 参照音声 (option) wav, mp3, …etc 歌詞テキスト (セクション指定 含む) • [Intro] • [Verse] • [Chorus] • [Outro] Task Instruction (タスク説明) •Text2Music • Repaint • …etc Step 2 Embedding Model (Qwen3-Embedding-0.6B) Vectorization (Token -> Vector) [BatchSize, TokenLength, 1024] C o n c a t e n a t e C o n c a t e n a t e x_t (Noisy Latent) 64次元, 25Hz src_latents t step Main Inputs t+1 step Outputs [BatchSize, Times, 64] Projection (1024 → 2048) axis = 1 2048次元 Qwen3 Tokenizer 1D VAE Tiled Encoder [BatchSize, Times(Second x 25), 64] Conv1D [BatchSize, Times//2, 2048] DiT Block x12 Sliding Window Attention Full Attention (GQA) Cross-Attention Diffusion Transformer Model T Step Later Cross-Attention Conv1D [BatchSize, Times, 64] 48kHz Audio LLM Outputs (CoT and Audio Codes) DiT Outputs (Latents) Text2Music(テキストからの音楽生成) ACE-Step-1.5 理論説明 asap Retake機能で利用 Retake機能で利用
  2. ACE-Step-1.5 理論説明 Inputs Step 4 Step 2 Outputs Diffusion Transformer(Same

    as Text2Audio) (acestep-v15-sft) Main Inputs [BatchSize, Times(Second x 25), 192] 64次元, 25Hz chunk_masks (All True) 64次元, 25Hz axis = 2 192次元 Conditioning [BatchSize, sum all token length, 2048] AceStepLyricEncoder [BatchSize, TokenLength, 2048] VAE Decoder (1D VAE Decoder) 1D VAE Tiled Decoder 歌詞テキスト (セクション指定 含む) • [Intro] • [Verse] • [Chorus] • [Outro] Step 1 Embedding Model (Qwen3-Embedding-0.6B) Vectorization (Token -> Vector) [BatchSize, TokenLength, 1024] C o n c a t e n a t e C o n c a t e n a t e x_t (Noisy Latent) 64次元, 25Hz t step Main Inputs t+1 step Outputs [BatchSize, Times, 64] Projection (1024 → 2048) axis = 1 2048次元 Qwen3 Tokenizer Conv1D [BatchSize, Times//2, 2048] DiT Block x12 Sliding Window Attention Full Attention (GQA) Cross-Attention Diffusion Transformer Model T Step Later Cross-Attention Conv1D [BatchSize, Times, 64] 48kHz Audio <think> Generate 5Hz Audio Codes DiT Latent 編集区間 (複数選択可) ex) 30.2s 〜 33.4s 130.5s 〜 135.6s [BatchSize, Times, 64] ⊕ = Overwriting Step 3 x0.3 AudioTokenDetokenizer (Upscale) src_latents Retake(提案手法:破綻の微修正) ①原曲生成時に保存した Audio Codesと潜在表現を直接利用 asap ②原曲生成時と同一の Audio CodesがDiTの生成をガイド ③DiTの初期ノイズに原曲と同じ潜在表現を 混ぜて生成方向を誘導
  3. ACE-Step-1.5 理論説明 Inputs Step 1 Step 2 Step 4 Outputs

    キャプション (スタイル) メタデータ (BPM・キーなど) (option) Diffusion Transformer (acestep-v15-sft) Main Inputs [BatchSize, Times(Second x 25), 192] 64次元, 25Hz chunk_masks 64次元, 25Hz axis = 2 192次元 Conditioning [BatchSize, sum all token length, 2048] AceStepLyricEncoder [BatchSize, NumberOfAudio, 2048] (Option) AceStepTimbreEncoder [BatchSize, TokenLength, 2048] VAE Decoder (1D VAE Decoder) 1D VAE Tiled Decoder 参照音声 (Reference Audio) (option) wav, mp3, …etc 歌詞テキスト (曲全体) Task Instruction (タスク説明) Embedding Model (Qwen3-Embedding-0.6B) Vectorization (Token -> Vector) [BatchSize, TokenLength, 1024] C o n c a t e n a t e C o n c a t e n a t e x_t (Noisy Latent) 64次元, 25Hz src_latents t step Main Inputs t+1 step Outputs [BatchSize, Times, 64] Projection (1024 → 2048) axis = 1 2048次元 Qwen3 Tokenizer 1D VAE Tiled Encoder [BatchSize, Times(Second x 25), 64] Conv1D [BatchSize, Times//2, 2048] DiT Block x12 Sliding Window Attention Full Attention (GQA) Cross-Attention Diffusion Transformer Model Cross-Attention Conv1D [BatchSize, Times, 64] 48kHz Audio 編集区間 (再生成箇所) ex) 130.5s 〜 152.6s 元音源 (編集音源) (Src Audio) wav, mp3, …etc 1D VAE Tiled Encoder [BatchSize, Times(Second x 25), 64] 編集区間はTrue それ以外はFalse Silence latents T Step Later ⊕ = Overwriting Step 3 元音源の編集区間の 特徴量はsilence_latentに上書き Repaint(比較手法:一部区間の修正) asap ①あらゆる曲に対応するため Audio codesの代わりにVAE Encoderによる潜在表現を利用 →劣化要素 ②大きな編集にも対応するため 編集区間の中ではsrc_latentsが上書きされる →原曲から大きく離れる
  4. Retake(提案手法:破綻の微修正) ACE-Step-1.5 理論説明 Inputs Step 4 Step 2 Outputs Diffusion

    Transformer(Same as Text2Audio) (acestep-v15-sft) Main Inputs [BatchSize, Times(Second x 25), 192] 64次元, 25Hz chunk_masks (All True) 64次元, 25Hz axis = 2 192次元 Conditioning [BatchSize, sum all token length, 2048] AceStepLyricEncoder [BatchSize, TokenLength, 2048] VAE Decoder (1D VAE Decoder) 1D VAE Tiled Decoder 歌詞テキスト (セクション指定 含む) • [Intro] • [Verse] • [Chorus] • [Outro] Step 1 Embedding Model (Qwen3-Embedding-0.6B) Vectorization (Token -> Vector) [BatchSize, TokenLength, 1024] C o n c a t e n a t e C o n c a t e n a t e x_t (Noisy Latent) 64次元, 25Hz t step Main Inputs t+1 step Outputs [BatchSize, Times, 64] Projection (1024 → 2048) axis = 1 2048次元 Qwen3 Tokenizer Conv1D [BatchSize, Times//2, 2048] DiT Block x12 Sliding Window Attention Full Attention (GQA) Cross-Attention Diffusion Transformer Model T Step Later Cross-Attention Conv1D [BatchSize, Times, 64] 48kHz Audio <think> Generate 5Hz Audio Codes DiT Latent 編集区間 (複数選択可) ex) 30.2s 〜 33.4s 130.5s 〜 135.6s [BatchSize, Times, 64] ⊕ = Overwriting Step 3 x0.3 AudioTokenDetokenizer (Upscale) src_latents ①原曲生成時に保存した Audio Codesと潜在表現を直接利用 ②原曲生成時と同一の Audio CodesがDiTの生成をガイド ③DiTの初期ノイズに原曲と同じ潜在表現を 混ぜて生成方向を誘導 asap