Audio and Video Processing with Generative AI

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

小森一成(@icckx)

Slide 3

Slide 3 text

皆さんは、生成AIつかってますか？✋ 今日は生成AIの話です。

Slide 4

Slide 4 text

● 一般認知されている ○ テレビCM ■ みんなAI Gemini に相談だ ○ 本屋 ■ 面倒なことはChatGPTにやらせよう生成AIのここ最近

Slide 5

Slide 5 text

さらに最近は、生成AIを使い分ける

Slide 6

Slide 6 text

Web検索：マルチエージェント検索が登場。生成AI検索時代が到来しつつあるし、

Slide 7

Slide 7 text

特性を加味し、複数の生成AIを組み合わせることも一般的になりつつある

Slide 8

Slide 8 text

ChatGPT(o1-preview) + ImageFX

Slide 9

Slide 9 text

それなりの画像が出力できたりするように

Slide 10

Slide 10 text

OpenAI コンシューマー向けAIチャットマップ Anthropic Google Gemini Free Advanced ChatGPT Free Plus Team Enterprise Claude Free Pro Team Gemini for GoogleWorkSpace Business Enterprise ... Gems GPTs Perplexity NotebookLM X Grok Perplexity Genspark Genspark Microsoft Projects

Slide 11

Slide 11 text

OpenAI よく使われるクラウド周辺と、モデルマップ Anthropic Google GPT 4o 4o-mini o1 Whisper DALL-E TTS Embeddings Claude Haiku Sonnet Opus Gemini Ultra Pro Flash Nano Gemma Google Cloud Azure AWS AOAI Bedrock Vertex AI Microsoft

Slide 12

Slide 12 text

OpenAI 開発者コンソール, Tools マップ Anthropic Google Google Cloud Azure AWS Vertex AI AOAI Bedrock Vertex AI Stusio Google AI Stusio Azure OpenAI Studio Amazon Bedrock Studio Playglound API Console GitHub Amazon Q Developer GitHub Copilot Microsoft

Slide 13

Slide 13 text

Text to Text? 非エンジニアからの需要で多いと感じているものは...

Slide 14

Slide 14 text

Text to Text

Slide 15

Slide 15 text

Video vs Audio??

Slide 16

Slide 16 text

Video(mp4) >>> Audio(mp3)

Slide 17

Slide 17 text

XXX.mp4

Slide 18

Slide 18 text

今日は、よくある要件（会話）を例にどのように実装していくのか見ていきます。

Slide 19

Slide 19 text

● Transcription ● Diarization ● TimeStamp ● File Size ● Expression

Slide 20

Slide 20 text

基本形 MP3 MP4 軽量化 data store

Slide 21

Slide 21 text

from openai import OpenAI client = OpenAI() audio_file= open("/path/to/file/german.mp3", "rb") transcriptions = client.audio.transcriptions.create( model="whisper-1", file=audio_file, response_format="verbose_json", timestamp_granularities=["segment"] ) print(transcriptions.text) ←文字起こし ex) Whisper ←タイムスタンプ

Slide 22

Slide 22 text

ex) Whisper Whisperは25MB制限 WAVEだと4:30でも25MBに到達しそう。MP3で節約できます ※クラウドストレージの節約にも効果があります

Slide 23

Slide 23 text

ex) Whisper 超えそうだったらチャンク分割 rom pydub import AudioSegment song = AudioSegment.from_mp3("good_morning.mp3") # PyDub handles time in milliseconds ten_minutes = 10 * 60 * 1000 ﬁrst_10_minutes = song[:ten_minutes] ﬁrst_10_minutes.export("good_morning_10.mp3", format="mp3")

Slide 24

Slide 24 text

● Transcription ● Diarization ● TimeStamp ● File Size ● Expression

Slide 25

Slide 25 text

Diarizationとは（Whisperの例の途中ですが）音声から「いつ，誰が発話したのか」を推定する技術参照:https://www.youtube.com/watch?v=37R_R82lfwA

Slide 26

Slide 26 text

Person A: Hey, are we still meeting at the cafe this afternoon? I’m thinking of working on our project there. Person B: Yes, I’m good with that! What time are we meeting again? Person C: I think we said around 3 PM, right? I’ll be there a bit earlier to grab a table. Person A: Perfect! I’ll bring my laptop and notes. Do we have any updates from last time? Person B: I worked on the new designs and made some progress. I’ll show them to you when we meet. Person C: Great! I’ve been working on the backend, so we can discuss how to integrate everything later. Person A: Awesome, looking forward to it. See you guys at 3! Person B: See you! Person C: See you soon! Transcription Diarizationは、 ABCの情報がほしいかどうか音声からでなければ抽出できない

Slide 27

Slide 27 text

ex) Whisper 文字起こしに前処理いれる from pyannote.audio import Audio, Pipeline pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1") diarization = pipeline(audio_ﬁle.name) audio = Audio(sample_rate=16000, mono=True) for segment, _, speaker in diarization.itertracks(yield_label=True): # 音声ファイルから話者のセグメントを切り出す waveform, sample_rate = audio.crop(no_silence_audio_ﬁle.name, segment) 　　・・・

Slide 28

Slide 28 text

ex) Whisper MP3 MP3 MP3 MP3 MP3 Person A: Hey, are we still meeting at the cafe this afternoon? I’m thinking of working on our project there. Person B: Yes, I’m good with that! What time are we meeting again? Person C: I think we said around 3 PM, right? I’ll be there a bit earlier to grab a table. Person A: Perfect! I’ll bring my laptop and notes. Do we have any updates from last time? Person B: I worked on the new designs and made some progress. I’ll show them to you when we meet. Person C: Great! I’ve been working on the backend, so we can discuss how to integrate everything later. Person A: Awesome, looking forward to it. See you guys at 3! Person B: See you! Person C: See you soon!

Slide 29

Slide 29 text

https://github.com/pyannote/pyannote-audio ● Hugging Face　　を利用する必要あり ● GPUのパラメータセットすると爆速に。GPUマシンが必要になる ● NVIDIA限定で動作。Apple Silicon経由したMPSだと開発不可に

Slide 30

Slide 30 text

● Transcription ● Diarization ● TimeStamp ● File Size ● Expression

Slide 31

Slide 31 text

Audio Text Whisperに組み合わせるなら、現状は研究段階のものしかなさそう。

Slide 32

Slide 32 text

なかなか、Whisperはハードル高め？ほかは？

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

登録参照結果 ex) Vertex AI (Gemini 1.5 Pro)

Slide 35

Slide 35 text

from google.cloud import storage from vertexai.generative_models import GenerationConfig, GenerativeModel, Part bucket = storage.Client().bucket(bucket_name) blob = bucket.blob(file_name) blob.upload_from_file(file) gcs_uri = f"gs://{bucket_name}/{file_name}" video = Part.from_uri(mime_type="video/mp4", uri=gcs_uri) prompt = "XXXXX" response = model.generate_content([video, prompt], stream=True) for chunk in response: result += chunk.text print(result) ex) Vertex AI (Gemini 1.5 Pro) ←ストレージ保存 ←ここ

Slide 36

Slide 36 text

● Transcription ● Diarization ● TimeStamp ● File Size ● Expression ex) Vertex AI (Gemini 1.5 Pro)

Slide 37

Slide 37 text

⇒指示プロンプト ⇒指示プロンプト ⇒指示プロンプト ⇒25MB以上許容 ⇒指示プロンプト ● Transcription ● Diarization ● TimeStamp ● File Size ● Expression ex) Vertex AI (Gemini 1.5 Pro)

Slide 38

Slide 38 text

VertexAI（Gemini 1.5 Pro）だと、デフォルトでOK 現状１強

Slide 39

Slide 39 text

San Francisco October 1 London October 30 Singapore November 21 状況は変わる？

Slide 40

Slide 40 text

No content