Video Interpolation with Diffusion Models

Video Interpolation with Diffusion Models 発表者: tomoaki_teshima 2024/Aug/3 第61回コンピュータビジョン勉強会＠関東
CVPR2024読み会(後編) 1 tomoaki_teshima tomoaki0705 tomoaki_teshima tomoaki0705

Video Interpolation with Diffusion Models Siddhant Jain1*, Daniel Watson2*, Eric
Tabellion1, Aleksander Holynski1, Ben Poole2, Janne Kontkanen1 1Google Research 2Google DeepMind *Equal contribution 2024/Aug/3 第61回コンピュータビジョン勉強会＠関東 CVPR2024読み会(後編) 2

Interpolated video 2024/Aug/3 第61回コンピュータビジョン勉強会＠関東 CVPR2024読み会(後編) 3 GT AMT FILM
LDMVFI RIFE VIDM (Ours)

Basic Architecture • Supplementary Websiteに載ってる図 • Cascaded Diffusion Model •
入力はシーケンス中の最初と最後のフレーム • 1段目でlow resolutionのフレームを7枚生成 • 2段目でhigh resolution化 • 画像生成には diffusion model を使う 2024/Aug/3 第61回コンピュータビジョン勉強会＠関東 CVPR2024読み会(後編) 4

ご清聴ありがとうございました 2024/Aug/3 第61回コンピュータビジョン勉強会＠関東 CVPR2024読み会(後編) 5

目的/要約 • フレーム補完のためにDiffusion Modelベースの生成AIを提案 • VIDIM(Video Interpolation Diffusion Model)は2つのモデルを直列につなぐ
• 従来手法にあったClassifierを利用するモデルをConditioningすることで回避 • DAVISとUCF101でテスト • 従来法との違いを定量的/定性的に評価 2024/Aug/3 第61回コンピュータビジョン勉強会＠関東 CVPR2024読み会(後編) 6

Ho et al.[16] • Cascaded model のupscaling • Class label
により Condition されている • 家屋、Comic、Zebraなど約1000ものClassが使われている 2024/Aug/3 第61回コンピュータビジョン勉強会＠関東 CVPR2024読み会(後編) 7 参考文献[16]より引用

従来手法との比較[16] We additionally demonstrate how classifier-free guidance on the start
and end frame and conditioning the super resolution model on the original high-resolution frames without additional parameters unlocks high- fidelity results • Classifier は使わず、start and end frame を使う • start and end frame : conditioning 2024/Aug/3 第61回コンピュータビジョン勉強会＠関東 CVPR2024読み会(後編) 8

Jiang et al.[21] Super slomo: High quality estimation of multiple
intermediate frames for video interpolation 2024/Aug/3 第61回コンピュータビジョン勉強会＠関東 CVPR2024読み会(後編) 9

従来手法との比較[21 etc] Most works also agree that optical flow is
best learned for the frame interpolation • オプティカルフローに頼る手法のなんと多いことか • オプティカルフローを利用する場合、暗に以下の制約を利用している • フレーム同士が離れていない (短時間の補完) • 直線運動で近似できる (線型性) • パンモーションなど、視点が大幅に変わるシーンは無い • 本手法ではオプティカルフローでなく、Diffusion Modelで間を埋める 2024/Aug/3 第61回コンピュータビジョン勉強会＠関東 CVPR2024読み会(後編) 10

親の顔より見たdataset DAVIS [35] • Flow-edge Guided Video Completion : オプティカルフローを
抽出し、前後のフレームから欠損領域を復元 (ECCV 2020, 第五回全日本コンピュータビジョン勉強会) • Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories : 複数フレームにわたるオプティカルフロー、 Point Trajectories を提案 (ECCV 2022, 第57回コンピュータビジョン勉強会＠関東) 2024/Aug/3 第61回コンピュータビジョン勉強会＠関東 CVPR2024読み会(後編) 11

Cascaded Diffusion Model • Base model • 64x64 のstartとendから、補完する64x64の7枚の画像を生成
• Super-resolution model • 256x256 のstartとendおよび 64x64の7枚の画像から、 256x256の7枚の画像を生成 2024/Aug/3 第61回コンピュータビジョン勉強会＠関東 CVPR2024読み会(後編) 12

Base model の training 2024/Aug/3 第61回コンピュータビジョン勉強会＠関東 CVPR2024読み会(後編) 13 end
start

Super-resolution model の training • 64x64 の画像はSuper resolution modelに通される前にnaively upsampleする
• concatenates each (naively upsampled) low-resolution conditioning frame to the noisy high-resolution frames along the channel axis 2024/Aug/3 第61回コンピュータビジョン勉強会＠関東 CVPR2024読み会(後編) 14 start end

Fig.1 結果 (middle frame) 2024/Aug/3 第61回コンピュータビジョン勉強会＠関東 CVPR2024読み会(後編) 15 •
従来手法は、フレームを混ぜたような結果、あるいはぼやけた結果になる • 提案手法と真値画像が必ず一致してるわけでもない

結果 2024/Aug/3 第61回コンピュータビジョン勉強会＠関東 CVPR2024読み会(後編) 16 GT AMT FILM LDMVFI
RIFE VIDM (Ours)

定量的評価(1) 2024/Aug/3 第61回コンピュータビジョン勉強会＠関東 CVPR2024読み会(後編) 17

定量的評価(2) 2024/Aug/3 第61回コンピュータビジョン勉強会＠関東 CVPR2024読み会(後編) 18

定量的評価について • 性能が従来手法と比べて1番ではない • 生成モデルの結果をReconstruction based metricsで評価しても必ずしも良い性能になるとは限らない（ことが知られている） • ぼやけた画像はReconstruction
metrics で高性能を叩きだす傾向にある • 定性評価ではぼやけた画像は最低評価をうける傾向にある 2024/Aug/3 第61回コンピュータビジョン勉強会＠関東 CVPR2024読み会(後編) 19

定性的評価 2024/Aug/3 第61回コンピュータビジョン勉強会＠関東 CVPR2024読み会(後編) 20 提案手法の圧勝

Super resolution の貢献 2024/Aug/3 第61回コンピュータビジョン勉強会＠関東 CVPR2024読み会(後編) 21 高画質にする際に、conditioningしないとやはりぼやける

まとめ • フレーム補完するDiffusionモデル、VIDIMを提案 • 従来手法でよくあったオプティカルフローは使ってない • Classifierも使ってない • 低画質の補完画像を作るbase model、後段のSuper-resolution
modelの直列で実現 • Conditioningすることで画質向上 2024/Aug/3 第61回コンピュータビジョン勉強会＠関東 CVPR2024読み会(後編) 22

読みきれなかった部分（懺悔） • Diffusionモデルにノイズを足す方法 • 数式は一切分からなかった • Super-resolution modelはトポロジーがbase modelと違うんだけれど、どう違うのかが読みきれなかった
2024/Aug/3 第61回コンピュータビジョン勉強会＠関東 CVPR2024読み会(後編) 23

Video Interpolation with Diffusion Models

Video Interpolation with Diffusion Models

Aki Teshima

More Decks by Aki Teshima

Featured

Transcript