Audio and Video Processing with Generative AI

小森一成(@icckx)

皆さんは、生成AIつかってますか？✋ 今日は生成AIの話です。

• 一般認知されている ◦ テレビCM ▪ みんなAI Gemini に相談だ ◦ 本屋
▪ 面倒なことはChatGPTにやらせよう生成AIのここ最近

さらに最近は、生成AIを使い分ける

Web検索：マルチエージェント検索が登場。生成AI検索時代が到来しつつあるし、

特性を加味し、複数の生成AIを組み合わせることも一般的になりつつある

ChatGPT(o1-preview) + ImageFX

それなりの画像が出力できたりするように

OpenAI コンシューマー向けAIチャットマップ Anthropic Google Gemini Free Advanced ChatGPT Free
Plus Team Enterprise Claude Free Pro Team Gemini for GoogleWorkSpace Business Enterprise ... Gems GPTs Perplexity NotebookLM X Grok Perplexity Genspark Genspark Microsoft Projects

OpenAI よく使われるクラウド周辺と、モデルマップ Anthropic Google GPT 4o 4o-mini o1 Whisper
DALL-E TTS Embeddings Claude Haiku Sonnet Opus Gemini Ultra Pro Flash Nano Gemma Google Cloud Azure AWS AOAI Bedrock Vertex AI Microsoft

OpenAI 開発者コンソール, Tools マップ Anthropic Google Google Cloud Azure AWS
Vertex AI AOAI Bedrock Vertex AI Stusio Google AI Stusio Azure OpenAI Studio Amazon Bedrock Studio Playglound API Console GitHub Amazon Q Developer GitHub Copilot Microsoft

Text to Text? 非エンジニアからの需要で多いと感じているものは...

Text to Text

Video vs Audio??

Video(mp4) >>> Audio(mp3)

XXX.mp4

今日は、よくある要件（会話）を例にどのように実装していくのか見ていきます。

• Transcription • Diarization • TimeStamp • File Size •
Expression

基本形 MP3 MP4 軽量化 data store

from openai import OpenAI client = OpenAI() audio_file= open("/path/to/file/german.mp3", "rb")
transcriptions = client.audio.transcriptions.create( model="whisper-1", file=audio_file, response_format="verbose_json", timestamp_granularities=["segment"] ) print(transcriptions.text) ←文字起こし ex) Whisper ←タイムスタンプ

ex) Whisper Whisperは25MB制限 WAVEだと4:30でも25MBに到達しそう。MP3で節約できます ※クラウドストレージの節約にも効果があります

ex) Whisper 超えそうだったらチャンク分割 rom pydub import AudioSegment song = AudioSegment.from_mp3("good_morning.mp3")
# PyDub handles time in milliseconds ten_minutes = 10 * 60 * 1000 ﬁrst_10_minutes = song[:ten_minutes] ﬁrst_10_minutes.export("good_morning_10.mp3", format="mp3")

Expression

Diarizationとは（Whisperの例の途中ですが）音声から「いつ，誰が発話したのか」を推定する技術参照:https://www.youtube.com/watch?v=37R_R82lfwA

Person A: Hey, are we still meeting at the cafe
this afternoon? I’m thinking of working on our project there. Person B: Yes, I’m good with that! What time are we meeting again? Person C: I think we said around 3 PM, right? I’ll be there a bit earlier to grab a table. Person A: Perfect! I’ll bring my laptop and notes. Do we have any updates from last time? Person B: I worked on the new designs and made some progress. I’ll show them to you when we meet. Person C: Great! I’ve been working on the backend, so we can discuss how to integrate everything later. Person A: Awesome, looking forward to it. See you guys at 3! Person B: See you! Person C: See you soon! Transcription Diarizationは、 ABCの情報がほしいかどうか音声からでなければ抽出できない

ex) Whisper 文字起こしに前処理いれる from pyannote.audio import Audio, Pipeline pipeline =
Pipeline.from_pretrained("pyannote/speaker-diarization-3.1") diarization = pipeline(audio_ﬁle.name) audio = Audio(sample_rate=16000, mono=True) for segment, _, speaker in diarization.itertracks(yield_label=True): # 音声ファイルから話者のセグメントを切り出す waveform, sample_rate = audio.crop(no_silence_audio_ﬁle.name, segment) 　　・・・

ex) Whisper MP3 MP3 MP3 MP3 MP3 Person A: Hey,
are we still meeting at the cafe this afternoon? I’m thinking of working on our project there. Person B: Yes, I’m good with that! What time are we meeting again? Person C: I think we said around 3 PM, right? I’ll be there a bit earlier to grab a table. Person A: Perfect! I’ll bring my laptop and notes. Do we have any updates from last time? Person B: I worked on the new designs and made some progress. I’ll show them to you when we meet. Person C: Great! I’ve been working on the backend, so we can discuss how to integrate everything later. Person A: Awesome, looking forward to it. See you guys at 3! Person B: See you! Person C: See you soon!

https://github.com/pyannote/pyannote-audio • Hugging Face　　を利用する必要あり • GPUのパラメータセットすると爆速に。GPUマシンが必要になる • NVIDIA限定で動作。Apple Silicon経由したMPSだと開発不可に

Expression

Audio Text Whisperに組み合わせるなら、現状は研究段階のものしかなさそう。

なかなか、Whisperはハードル高め？ほかは？

登録参照結果 ex) Vertex AI (Gemini 1.5 Pro)

from google.cloud import storage from vertexai.generative_models import GenerationConfig, GenerativeModel, Part
bucket = storage.Client().bucket(bucket_name) blob = bucket.blob(file_name) blob.upload_from_file(file) gcs_uri = f"gs://{bucket_name}/{file_name}" video = Part.from_uri(mime_type="video/mp4", uri=gcs_uri) prompt = "XXXXX" response = model.generate_content([video, prompt], stream=True) for chunk in response: result += chunk.text print(result) ex) Vertex AI (Gemini 1.5 Pro) ←ストレージ保存 ←ここ

Expression ex) Vertex AI (Gemini 1.5 Pro)

⇒指示プロンプト ⇒指示プロンプト ⇒指示プロンプト ⇒25MB以上許容 ⇒指示プロンプト • Transcription • Diarization •
TimeStamp • File Size • Expression ex) Vertex AI (Gemini 1.5 Pro)

VertexAI（Gemini 1.5 Pro）だと、デフォルトでOK 現状１強

San Francisco October 1 London October 30 Singapore November 21
状況は変わる？

Audio and Video Processing with Generative AI

Audio and Video Processing with Generative AI

More Decks by Issei.Komori

Other Decks in Technology

Featured

Transcript