CVPR2024 参加報告 - Speaker Deck

Slide 1

Slide 1 text

AI 2024.07 河内大輝, 濱田晃一株式会社ディー・エヌ・エー CVPR2024 参加報告 ~Image/Video Generative Modelsを中心に~

Slide 2

Slide 2 text

AI 2 項目 00｜自己紹介 01｜CVPR2024 概要とトレンド 02｜Workshops & Tutorials 03｜Main Conference

Slide 3

Slide 3 text

AI 3 Speaker: 河内大輝 /Hiroki Kawauchi 河内大輝 /Hiroki Kawauchi AIエンジニア・データサイエンティスト（社会人2年目） @DeNA …ゲームなどでの、LLM・Computer Visionを用いたAIプロダクト開発 @大学時代/インターン/個人開発まちづくり・交通分野 × AI …3D都市モデルとAI Agentを用いた、まちづくりのためのアプリ開発 …衛星画像を用いた、物流モニタリングシステムの新規事業開発宇宙利用・脱炭素分野 × AI …衛星画像を用いた、森林・炭素吸収モニタリングシステム開発 …衛星画像からの物体検出モデル研究[Kawauchi & Fuse, 2022] linkedin.com/in/hiroki-kawauchi https://github.com/hiroki-kawauchi https://x.com/kwchrk_

Slide 4

Slide 4 text

AI 4 Speaker: 濱田晃一 /Koichi Hamada (@hamadakoichi) 2010- : DeNA中途入社 / ゲーム機械学習チーム立上げ・ゲーム適用 2011– : Mobage 数十の分散機械学習の実装・サービス提供 2014- : DeNA全サービス対象多様なサービスを対象に機械学習の実装・サービス提供・博士: 量子統計場の理論 (理論物理) ・TokyoWebmining 主催 [開発・サービス提供例] ・QA4AI：AIプロダクト品質保証ガイドライン - AIコンテンツ生成 Working Group Leader ・著書: Mobageを支える技術 2010年、DeNAの機械学習/AIチーム立ち上げその後 14年間、機械学習/AIの研究開発・サービス提供 (CEDEC 2014 Best Book Award受賞) linkedin.com/in/hamadakoichi twitter.com/hamadakoichi SNS マンガゲームプラットフォーム対話ニュースファッションアニメ生成

Slide 5

Slide 5 text

AI 5 項目 01｜CVPR2024 概要とトレンド 02｜Workshops & Tutorials 03｜Main Conference

Slide 6

Slide 6 text

AI 6 01 CVPR2024 概要とトレンド

Slide 7

Slide 7 text

AI 7 ▪ コンピュータビジョン・パターン認識分野におけるトップカンファレンス ▪ h5-index rankingでは、Natureに次ぐ2位 (440, 2024/08現在) ▪ アメリカ・シアトルにて開催 ▪ 6/17-18: Workshops & Tutorials, 6/19-21: Main Conference CVPR2024：概要左ロゴ引用元:https://media.eventhosts.cc/Conferences/CVPR2024/OpeningRemarkSlides.pdf

Slide 8

Slide 8 text

AI 8 ▪ 会場：Seattle Convention Center（Arch・Summit）にて開催 ▪ タコマ空港から車で20分程度 ▪ 会場周辺の治安は良好 ▪ Oral→Posterは建物移動が必要（10分程度） CVPR2024：会場右画像引用元:https://media.eventhosts.cc/Conferences/CVPR2024/OpeningRemarkSlides.pdf

Slide 9

Slide 9 text

AI 9 CVPR2024：Stats ▪ 11,532 submissions ▪ 2,719 accepted papers ▪ 23.6% acceptance rate ▪ 3.3% orals ▪ 11.9% highlights ▪ 35,691 authors ▪ うち企業所属は28% ▪ 9,872 reviewers 引用:https://media.eventhosts.cc/Conferences/CVPR2024/ OpeningRemarkSlides.pdf

Slide 10

Slide 10 text

AI 10 CVPR2024：Attendees ▪ 参加者はVirtual合わせて過去最多 ▪ 日本からは347人（約3%） ▪ cf)日本人著者割合（2.43%※1より若干多い） ※1: 「CVPR2024-速報-」ResearchPortトップカンファレンス定点観測 vol.13 他引用:https://media.eventhosts.cc/Conferences/CVPR2024/Ope ningRemarkSlides.pdf

Slide 11

Slide 11 text

AI 11 CVPR2024：ランチ/レセプション ▪ ランチは参加費に含まれ、サンドウィッチなどが提供（席は十分にあった） ▪ レセプションは申込制で、早めに満席になっていた

Slide 12

Slide 12 text

AI 12 CVPR2024：企業ブース ▪ ブース（Poster会場に隣接）右画像引用:https://cvpr.thecvf.com/Conferences/2024/Sponsors

Slide 13

Slide 13 text

AI 13 CVPR2024：Art Exhibitions / Demos ▪ アート展示・Demo展示（会場に隣接） ▪ 基本的にPoster発表と被っているので時間配分しないといけない

Slide 14

Slide 14 text

AI 14 CVPR2024：トレンド Image / Video Generative Models 3D Gaussian Splatting MetaなどXR系ニーズ Event Camera系も増えた印象 VLM, MLMM 引用: Workbook: CVPR 2024

Slide 15

Slide 15 text

AI 15 ▪ Voxel51社のエンジニアが、Colab上で動くCVPR論文検索ツールを公開 ▪ 論文閲覧に加えて、クラスタリングとかもできる CVPR2024：論文検索・トレンド分析ツール検索ツールURL:https://huggingface.co/blog/harpreetsahota/cvpr2024-survival-guide

Slide 16

Slide 16 text

AI 16 02 Workshops & Tutorials

Slide 17

Slide 17 text

AI 17 ▪ Workshopは123件採択 ▪ 1日中開催されているものもあり、充実but回りきれない ▪ 狭めの部屋に割り当てられた場合、立ち見や入場制限がある場合も ▪ Main Conferenceより鮮度の高い情報を得られる ▪ 会期中に複数社の動画生成モデル（LumaAI、Runway）が発表され、製品発表会化している場面も（殆ど技術的な詳細は語られない） Workshops & Tutorials 概要

Slide 18

Slide 18 text

AI 18 ▪ 23632 AI for Content Creation AI4CC ▪ Main conference論文の多くが「画像」生成 vs 直近かつ産業的な話題の中心は「動画」生成 ▪ Content Creation系での、Adobeの存在感の高まり ▪ The Future of Video Generation: Beyond Data and Scale (Tali Dekel, Weizmann Institute) ▪ 動画生成の課題として、physicalな整合性に加え、 camera pose, geometry, characters’ identity, movements, emotionsなどの制御性が低いことを挙げた ▪ 基盤モデル vs タスク特化モデル ▪ 基盤モデルの良さ：Space-time priors ▪ 特化モデルの良さ：計算量、制御性 ▪ これらを組み合わせることが重要と指摘 Workshops ピックアップ: Video Generation

Slide 19

Slide 19 text

AI 19 ▪ 23632 AI for Content Creation AI4CC ▪ The Future of Video Generation: Beyond Data and Scale (Tali Dekel, Weizmann Institute) ▪ Space-Time Features for Text-driven Motion Transfer, CVPR2024 ▪ 外観や形状の変化に対する特徴量記述子(SMM)を定義し、その変化への頑健性 ▪ 時空間特徴損失を導入して時間的に高周波な動きまで生成する ▪ 事前学習済みのText2Video Diﬀusion ModelsからPrior情報を抽出→特化型へ Workshops ピックアップ: Video Generation 引用元:https://diffusion-motion-transfer.github.io/

Slide 20

Slide 20 text

AI 20 ▪ 23632 AI for Content Creation AI4CC ▪ Sora: Video Generation Models as World Simulators (Tim Brooks, OpenAI) ▪ 動画生成の課題として、速度の向上も重要と認識 ▪ 特定のタスクに特化させる場合もアーキテクチャではなくデータで解決すべき ▪ 物理的な破綻や3D的な矛盾ケースはモデルやデータスケールで解決すると思うか？ ▪ ＞ある程度は解決するかもしれないが、アーキテクチャや、物理制約的な損失関数なども含めて検討すべき ▪ ＞どこまで厳密にそのシミュレーションをするか、においてもuser evalが重要 Workshops ピックアップ: Video Generation

Slide 21

Slide 21 text

AI 21 ▪ 23675 First Workshop on Eﬃcient and On Device Generation EDGE ▪ DREAM MACHINE (Jiaming Song, Luma AI） ▪ Luma AIは2021年創業→3D Scanアプリ（NeRF）リリース →GENIE（Text-to-3D）、DREAM MACHINE（Text-to-Video） ▪ 動画生成モデルを開発することで、3D学習データ不足も解消できる可能性に言及 ▪ 動画生成の課題として、physicsやcamera controlに加え、multi-face（顔が両方についてしまう問題）やmorphingの問題（不自然な変形）を挙げた Workshops ピックアップ: Video Generation

Slide 22

Slide 22 text

AI 22 ▪ 一般公開されているWorkshopの録画をリストアップ ▪ 以下のタイトルでYoutube検索すると出てきます。（ComputerVisionFoundation Videosチャンネル） ▪ 各WorkshopのHPはこちらから：CVPR 2024 Workshop List ▪ Track on 3D Vision ▪ 23612 2nd Workshop on Compositional 3D Vision ▪ 23611 - 3rd Monocular Depth Estimation Challenge ▪ 23609 7th International Workshop on Visual Odometry and Computer Vision Applications... ▪ 23610 Second Workshop for Learning 3D with Multi View Supervision ▪ 23608 ViLMa – Visual Localization and Mapping ▪ Track on Applications ▪ 23687 10th IEEE International Workshop on Computer Vision in Sports ▪ 23681 Agriculture Vision Challenges & Opportunities for Computer Vision in Agriculture ▪ 23685 GAZE 2024 6th Workshop on Gaze Estimation and Prediction in the Wild ▪ 23684 MetaFood Workshop MTF ▪ 23683 RetailVision Field Overview and Amazon Deep Dive ▪ 23688 - Workshop on Virtual Try-On Workshops Topics & Records

Slide 23

Slide 23 text

AI 23 ▪ 一般公開されているWorkshopの録画をリストアップ ▪ 以下のタイトルでYoutube検索すると出てきます。（ComputerVisionFoundation Videosチャンネル） ▪ Track on Assistive Technology ▪ 23652 VizWiz Grand Challenge Describing Images and Videos Taken by Blind People ▪ Track on Assortment of Recognition Topics ▪ 23583 2nd Workshop on Scene Graphs and Graph Representation Learning ▪ 23584 Image Matching Local Features and Beyond ▪ Track on Autonomous Driving ▪ 23648 7th Workshop on Autonomous Driving WAD ▪ 23649 Data Driven Autonomous Driving Simulation DDASD ▪ 23651 Populating Empty Cities – Virtual Humans for Robotics and Autonomous Driving ▪ 23650 Vision and Language for Autonomous Driving and Robotics VLADR ▪ Track on Biometrics and Forensics ▪ 23641 6th Workshop and Competition on Aﬀective Behavior Analysis in the wild ▪ 23637 The 5th Face Anti Spooﬁng Workshop Workshops Topics & Records

Slide 24

Slide 24 text

AI 24 ▪ 一般公開されているWorkshopの録画をリストアップ ▪ 以下のタイトルでYoutube検索すると出てきます。（ComputerVisionFoundation Videosチャンネル） ▪ Track on Computational Photography ▪ 23626 20th Workshop on Perception Beyond the Visible Spectrum ▪ 23627 The 5th Omnidirectional Computer Vision Workshop ▪ 23624 The 7th Workshop and Challenge Bridging the Gap between Computational Photography and Visual... ▪ Track on Contemporary discussions and Community building ▪ 23622 LatinX in Computer Vision Research Workshop ▪ 23621 Women in Computer Vision ▪ Track on Content Creation ▪ 23633 AI for 3D Generation ▪ 23632 AI for Content Creation AI4CC ▪ 23631 The Future of Generative Visual Art ▪ 23635 Workshop on Computer Vision for Fashion, Art, and Design ▪ 23634 Workshop on Graphic Design Understanding and Generation GDUG Workshops Topics & Records

Slide 25

Slide 25 text

AI 25 ▪ 一般公開されているWorkshopの録画をリストアップ ▪ 以下のタイトルでYoutube検索すると出てきます。（ComputerVisionFoundation Videosチャンネル） ▪ Track on Eﬃcient Methods ▪ 23578 Eﬃcient Large Vision Models ▪ 23576 Fifth Workshop on Neural Architecture Search ▪ Track on Egocentric & Embodied AI ▪ 23596 First Joint Egocentric Vision EgoVis Workshop ▪ 23598 The 5th Annual Embodied AI Workshop ▪ Track on Emerging Learning Paradigms ▪ 23591 1st Workshop on Dataset Distillation for Computer Vision ▪ Track on Emerging Topics ▪ 23572 Equivariant Vision From Theory to Practice ▪ Track on Foundation Models ▪ 23667 2nd Workshop on Foundation Models ▪ 23668 Foundation Models for Autonomous Systems ▪ 23670 Towards 3D Foundation Models Progress and Prospects Workshops Topics & Records

Slide 26

Slide 26 text

AI 26 ▪ 一般公開されているWorkshopの録画をリストアップ ▪ 以下のタイトルでYoutube検索すると出てきます。（ComputerVisionFoundation Videosチャンネル） ▪ Track on Generative Models ▪ 23672 - 2nd Workshop on Generative Models for Computer Vision ▪ 23675 First Workshop on Eﬃcient and On Device Generation EDGE ▪ 23676 GenAI Media Generation Challenge for Computer Vision Workshop ▪ 23674 ReGenAI First Workshop on Responsible Generative AI ▪ 23673 The First Workshop on the Evaluation of Generative Foundation Models ▪ Track on Human Understanding ▪ 23604 New Challenges in 3D Human Understanding ▪ 23606 Workshop on Human Motion Generation ▪ Track on Medical Vision ▪ 23664 9th Workshop on Computer Vision for Microscopy Image Analysis ▪ 23665 Data Curation and Augmentation in Enhancing Medical Imaging Applications ▪ 23663 Domain adaptation, Explainability and Fairness in AI for Medical Image Analysis Workshops Topics & Records

Slide 27

Slide 27 text

AI 27 ▪ 一般公開されているWorkshopの録画をリストアップ ▪ 以下のタイトルでYoutube検索すると出てきます。（ComputerVisionFoundation Videosチャンネル） ▪ Track on Mobile and Embedded Vision ▪ 23628 Third Workshop of Mobile Intelligent Photography & Imaging ▪ Track on Multimodal Learning ▪ 23567 7th MUltimodal Learning and Applications ▪ 23568 Multimodal Algorithmic Reasoning Workshop ▪ Track on Open World Learning ▪ 23594 VAND 2 0 Visual Anomaly and Novelty Detection ▪ 23595 Visual Perception via Learning in an Open World ▪ Track on Physics, Graphics, Geometry, AR/VR/MR ▪ 23616 Computer Vision for Mixed Reality ▪ 23618 The Sixth Workshop on Deep Learning for Geometric Computing DLGC 2024 Workshops Topics & Records

Slide 28

Slide 28 text

AI 28 ▪ 一般公開されているWorkshopの録画をリストアップ ▪ 以下のタイトルでYoutube検索すると出てきます。（ComputerVisionFoundation Videosチャンネル） ▪ Track on Responsible and Explainable AI ▪ 23642 2nd Workshop on Multimodal Content Moderation mp4 ▪ 23643 The 3rd Explainable AI for Computer Vision XAI4CV Workshop ▪ 23644 The Fifth Workshop on Fair, Data eﬃcient, and Trusted Computer Vision ▪ 23645 Workshop on Responsible Data ▪ Track on Science Applications ▪ 23658 4th Workshop on CV4Animals Computer Vision for Animal Behavior Tracking and Modeling ▪ 23659 AI4Space 2024 ▪ 23660 Computer Vision for Materials Science Workshop ▪ 23661 The Seventh International Workshop on Computer Vision for Physiological Measurement CVPM ▪ Track on Synthetic Data ▪ 23678 SyntaGen Harnessing Generative Models for Synthetic Visual Datasets ▪ 23677 Synthetic Data for Computer Vision Workshops Topics & Records

Slide 29

Slide 29 text

AI 29 ▪ 一般公開されているWorkshopの録画をリストアップ ▪ 以下のタイトルでYoutube検索すると出てきます。（ComputerVisionFoundation Videosチャンネル） ▪ Track on Urban Environments ▪ 23654 1st Workshop on Urban Scene Modeling ▪ 23656 8th AI City Challenge ▪ Track on Video Understanding ▪ 23567 7th MUltimodal Learning and Applications ▪ 23568 Multimodal Algorithmic Reasoning Workshop ▪ 23603 Learning from Procedural Videos and Language What is Next? Workshops Topics & Records

Slide 30

Slide 30 text

AI 30 ▪ 一般公開されているTutorialの録画をリストアップ ▪ 各TutorialのHPはこちらから：CVPR 2024 Tutorial List ▪ 以下のタイトルでYoutube検索すると出てきます。（ComputerVisionFoundation Videosチャンネル） ▪ 23728 full Learning Deep Low dimensional Models mp4 ▪ 23735 Eﬃcient Homotopy full mp4 ▪ 23730 Diﬀusion based Video ▪ 23726 Unifying Graph Neural Networks ▪ 23736 Computational Design of Diverse Morphologies and Sensors for Vision and Robotics ▪ 23725 All You Need To Know about Point Cloud Understanding ▪ 23724 Machine Unlearning in Computer Vision ▪ 23721 Deep Stereo Matching in the Twenties ▪ 23720 All You Need to Know about Self Driving ▪ 23713 Towards Building AGI in Autonomy and Robotics ▪ 23717 From Multimodal LLM to Human level AI ▪ 23715 Contactless AI Healthcare using Cameras and Wireless Sensors ▪ 23716 Disentanglement and Compositionality in Computer Vision ▪ 27319 Edge Optimized Deep Learning ▪ 23733 Full Stack, GPU based Acceleration Tutorials Topics & Records

Slide 31

Slide 31 text

AI 31 ▪ 一般公開されているTutorialの録画をリストアップ ▪ 各TutorialのHPはこちらから：CVPR 2024 Tutorial List ▪ 以下のタイトルでYoutube検索すると出てきます。（ComputerVisionFoundation Videosチャンネル） ▪ 23731 Robustness at Interference ▪ 23729 Object centric Representations in Computer Vision ▪ 23727 Geospatial Computer Vision and Machine Learning for Large Scale Earth Observation Data ▪ 23718 3D 4D Generation and Modeling with Generative Priors ▪ 23734 SCENIC An Open Source Probabilistic Programming System for Data Generation ▪ 23722 End to End Autonomy A New Era of Self Driving ▪ 23714 Edge AI in Action Practical Approaches to Developing and Deploying Optimized Models Tutorials Topics & Records

Slide 32

Slide 32 text

AI 32 03 Main Conference

Slide 33

Slide 33 text

AI 33 03-01 Main Conference 概要/各タスク注目論文 PickUp

Slide 34

Slide 34 text

AI 34 ▪ Oral/Highlight/Poster ▪ Poster重視だった昨年度から変更 ▪ Oral: 発表会場にて90/5≒~18分程度発表 ▪ Highlight: Poster上部に案内板 ▪ Poster: 90分のセッションに約400件 Main Conference概要引用: Workbook: CVPR 2024

Slide 35

Slide 35 text

AI 35 ▪ Best Papers ▪ Generative Image Dynamics（左下図、後で詳説） ▪ Google Research ▪ 静止画像からシームレスなループ動画とインタラクティブな動作を生成 ▪ Rich Human Feedback for Text-to-Image Generation（右下図、後で詳説） ▪ Google Researchなど ▪ Text2Imageモデルの改善のためのフィードバックデータセットとその予測モデル Main Conference: Awarded Papers

Slide 36

Slide 36 text

AI 36 ▪ Best Paper Runners-Up ▪ EventPS: Real-Time Photometric Stereo Using an Event Camera ▪ 北京大学など ▪ イベントカメラを使用した照度差ステレオ（同一視点で照明を変えてステレオ視） Main Conference: Awarded Papers

Slide 37

Slide 37 text

AI 37 ▪ Best Paper Runners-Up ▪ pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction ▪ MITなど ▪ 2枚の入力画像からの3DGS Main Conference: Awarded Papers

Slide 38

Slide 38 text

AI 38 ▪ Best Student Papers ▪ Mip-Splatting: Alias-free 3D Gaussian Splatting ▪ 3DGSの異なるスケールでのレンダリング時のエラーを、2DMip Filterなどで抑制 ▪ BioCLIP: A Vision Foundation Model for the Tree of Life ▪ 生物学分野特化のCLIP。大規模データセットの作成と、階層構造を考慮した汎化。 ▪ Best Student Paper Runners-Up ▪ SpiderMatch: 3D Shape Matching with Global Optimality and Geometric Consistency | CVF Open Access ▪ 3D形状マッチングにおいて、幾何的一貫性を保ちながら大域的最適解を効率的に探索する手法 ▪ Image Processing GNN: Breaking Rigidity in Super-Resolution ▪ 超解像タスクにおいて、GNNを用いてピクセル単位で柔軟に情報集約する手法 ▪ Objects as volumes: A stochastic geometry view of opaque solids ▪ 不透明個体の3次元再構成のための、ボリュメトリックな表現に関する理論的整理 ▪ Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods ▪ 説明モデルを用いて、TransformerとCNNの比較（特にLayer Normの重要性）を比較 Main Conference: Awarded Papers

Slide 39

Slide 39 text

AI 39 ▪ Orals/Highlights, Github Stars多いもの, MIT/Apache-2 License ▪ Github StarsやLicenseは2024/08時点のものです ▪ ※ 本体以外のDataやLoaded Modelのライセンスまでは未確認です ▪ Image Generation ▪ Ranni: Taming Text-to-Image Diﬀusion for Accurate Instruction Following ▪ Github Stars: 200 / License: Apache-2 / Oral ▪ LLMを利用して中間表現を生成し、拡散モデルを複雑な編集やInstructレベルで制御。 Main Conference: 商用利用可ライセンス&注目論文

Slide 40

Slide 40 text

AI 40 ▪ Image Generation ▪ InstanceDiﬀusion: Instance-level Control for Image Generation ▪ Github Stars: 400 / License: Apache-2 / Poster ▪ 生成画像中の各インスタンスに対して、位置（ボックスなど）を指定できるように制御。 Main Conference: 商用利用可ライセンス&注目論文

Slide 41

Slide 41 text

AI 41 ▪ Image Generation ▪ DistriFusion: Distributed Parallel Inference for High-Resolution Diﬀusion Models ▪ Github Stars: 400 / License: MIT / Highlight ▪ 画像のパッチ分割による並列処理化手法。8台のA100で最大6.1倍の高速化。 Main Conference: 商用利用可ライセンス&注目論文

Slide 42

Slide 42 text

AI 42 ▪ Image Generation ▪ Style Aligned Image Generation via Shared Attention ▪ Github Stars: 1100 / License: Apache-2 / Oral ▪ スタイル一貫性を持つ生成のためのAttention共有機構を提案。 Main Conference: 商用利用可ライセンス&注目論文

Slide 43

Slide 43 text

AI 43 ▪ Video Editing ▪ RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models ▪ Github Stars: 200 / License: MIT / Highlight ▪ Diffusion-BasedなText2Imageモデルを活用した、Zero-shotでの動画編集（変換） Main Conference: 商用利用可ライセンス&注目論文

Slide 44

Slide 44 text

AI 44 ▪ Layout Generation ▪ Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation ▪ Github Stars: 80 / License: Apache-2 / Oral ▪ RAGによって既存のレイアウト例を参照したレイアウト生成を行う手法 Main Conference: 商用利用可ライセンス&注目論文

Slide 45

Slide 45 text

AI 45 ▪ Image Restoration ▪ DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks ▪ Github Stars: 230 / License: MIT / Poster ▪ ドキュメントの歪み補正、影除去、外観強化、ぼけ除去、二値化の5つのタスクを統一的に扱う Main Conference: 商用利用可ライセンス&注目論文

Slide 46

Slide 46 text

AI 46 ▪ 3D Gaussian Splatting ▪ そもそも3D Gaussian Splattingとは？ ▪ 3Dの自由視点合成の新手法で、ACM SIGGRAPH2023で提案 ▪ 日本語概要：3D Gaussian Splatting for Real-Time Radiance Field Rendering - Speaker Deck ▪ 3次元のガウス分布の集合として色情報を表現する ▪ NeRFより高品質で、レンダリングも数十倍速い（リアルタイムレンダリング） ▪ ※3DGS元論文は商用利用時に確認必要なため、ライセンスの問い合わせは必要 Main Conference: 商用利用可ライセンス&注目論文 3D Gaussian Splattingの処理フロー・勾配伝搬フロー引用: Kerbl, Bernhard, et al. "3D Gaussian Splatting for Real-Time Radiance Field Rendering." ACM Trans. Graph. 42.4 (2023): 139-1.

Slide 47

Slide 47 text

AI 47 ▪ 3D Gaussian Splatting ▪ SuGaR: Surface-Aligned Gaussian Splatting for Eﬃcient 3D Mesh Reconstruction and High-Quality Mesh Rendering ▪ Github Stars: 1800 / License: MIT / Poster ▪ 3DGS（3次元ガウス分布の集合）から効率的に高品質なメッシュを抽出。 ▪ Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis ▪ Github Stars: 500 / License: MIT / Poster ▪ 動的なシーンのリアルタイムレンダリングのための3DGS。8K解像度で60FPS。 ▪ pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction ▪ Github Stars: 80 / License: Apache-2 / Oral ▪ 2枚の入力画像からの3DGS Main Conference: 商用利用可ライセンス&注目論文

Slide 48

Slide 48 text

AI 48 ▪ 3D Gaussian Splatting ▪ LangSplat: 3D Language Gaussian Splatting ▪ Github Stars: 400 / License: 3DGS元論文に準拠 / Highlight ▪ CLIP + 3DGSで3Dの言語場を構築し、3Dのオブジェクト位置推定とセグメンテーション精度向上 ▪ Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields ▪ Github Stars: 200 / License: 3DGS元論文に準拠 / Highlight ▪ 3DGSとSAMやCLIP-LSegなどの様々な2D基盤モデルを接続して3D化する蒸留フレームワーク ▪ 3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting ▪ Github Stars: 300 / License: MIT / Highlight ▪ 3DGSを用いて単眼画像から、アニメーション可能な人間の3Dレンダリングモデルを作成 ▪ Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis ▪ Github Stars: 400 / License: MIT / Highlight ▪ リアルタイムの人物の新規視点画像合成。2K解像度で25FPS Main Conference: 商用利用可ライセンス&注目論文

Slide 49

Slide 49 text

AI 49 ▪ Human Motion ▪ WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion ▪ Github Stars: 600 / License: MIT / Poster ▪ グローバル座標系にGroundingされた3D Human Motion推定 Main Conference: 商用利用可ライセンス&注目論文

Slide 50

Slide 50 text

AI 50 ▪ Foundation Models ▪ General Object Foundation Model for Images and Videos at Scale ▪ Github Stars: 1000 / License: MIT / Highlight ▪ 画像及び動画における様々なタスクでのzero-shot推論を可能にする基盤モデル Main Conference: 商用利用可ライセンス&注目論文

Slide 51

Slide 51 text

AI 51 ▪ Foundation Models ▪ Florence-2: Advancing a Uniﬁed Representation for a Variety of Vision Tasks ▪ Hugging Face Stars: 900 / License: MIT / Oral ▪ テキストプロンプトを入力とした、ざまざまな画像タスクのための基盤モデル Main Conference: 商用利用可ライセンス&注目論文

Slide 52

Slide 52 text

AI 52 ▪ Segmentation ▪ RobustSAM: Segment Anything Robustly on Degraded Images ▪ Github Stars: 100 / License: MIT / Highlight ▪ 劣化した画像に対応したSAM(Segmentation Anything Model) ▪ Video Object Segmentation ▪ Putting the Object Back into Video Object Segmentation ▪ Github Stars: 500 / License: MIT / Highlight ▪ ノイズが多い状況にも頑健な、クエリフレーム中のObjectをVideoからSegmentationするモデル ▪ Multi-Object Tracking ▪ DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction ▪ Github Stars: 300 / License: MIT / Poster ▪ Diffusion BasedなMOTモデル ▪ Matching Anything by Segmenting Anything ▪ Github Stars: 800 / License: Apache-2 / Highlight ▪ SAMを用いたMOTモデル Main Conference: 商用利用可ライセンス&注目論文

Slide 53

Slide 53 text

AI 53 ▪ Object Detection ▪ RT-DETR: DETRs Beat YOLOs on Real-time Object Detection ▪ Github Stars: 1900 / License: Apache-2 / Poster ▪ NMS後処理なしでYOLOを上回るリアルタイム物体検出モデル ▪ Feature Matching ▪ OmniGlue: Generalizable Feature Matching with Foundation Model Guidance ▪ Github Stars: 400 / License: Apache-2 / Poster ▪ 基盤モデルのGuidanceを利用した、様々なタスクで汎用的に使えるFeature Matching手法 ▪ Depth Estiation ▪ Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data ▪ Github Stars: 6000 / License: Apache-2 / Poster ▪ 大規模未ラベルデータセットを用いた深度推定モデル。6月にv2が発表され高精度化。 ▪ Repurposing Diﬀusion-Based Image Generators for Monocular Depth Estimation ▪ Github Stars: 1900 / License: Apache-2 / Oral ▪ Stable Diﬀusionを元にした深度推定モデル。合成データでのFine-tuningを用いてzeroshot化。 Main Conference: 商用利用可ライセンス&注目論文

Slide 54

Slide 54 text

AI 54 03-02 Main Conference Dive Deep Image & Video Synthesis

Slide 55

Slide 55 text

AI 55 ▪ Best Papers ▪ Rich Human Feedback for Text-to-Image Generation ▪ Generative Image Dynamics ▪ Orals ▪ FreeU: Free Lunch in Diffusion U-Net ▪ Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models ▪ Style Aligned Image Generation via Shared Attention ▪ Instruct-Imagen: Image Generation with Multi-modal Instruction ▪ Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following ▪ Attention Calibration for Disentangled Text-to-Image Personalization ▪ Alchemist: Parametric Control of Material Properties with Diffusion Models ▪ Analyzing and Improving the Training Dynamics of Diffusion Models ▪ MonoHair: High-Fidelity Hair Modeling from a Monocular Video Main Conference: Image & Video Synthesis

Slide 56

Slide 56 text

AI ▪ Best Papers ▪ Rich Human Feedback for Text-to-Image Generation ▪ Generative Image Dynamics ▪ Orals ▪ FreeU: Free Lunch in Diffusion U-Net ▪ Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models ▪ Style Aligned Image Generation via Shared Attention ▪ Instruct-Imagen: Image Generation with Multi-modal Instruction ▪ Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following ▪ Attention Calibration for Disentangled Text-to-Image Personalization ▪ Alchemist: Parametric Control of Material Properties with Diffusion Models ▪ Analyzing and Improving the Training Dynamics of Diffusion Models ▪ MonoHair: High-Fidelity Hair Modeling from a Monocular Video 56 Main Conference: Image & Video Synthesis

Slide 57

Slide 57 text

AI 57 Rich Human Feedback for Text-to-Image Generation datasets: https://github.com/google-research-datasets/richhf-18k scripts: https://github.com/google-research/google-research/tree/master/richhf_18k arxiv: https://arxiv.org/abs/2312.10240 Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, Jiao Sun, Jordi Pont-Tuset, Sarah Young, Feng Yang, Junjie Ke, Krishnamurthy Dj Dvijotham, Katherine M. Collins, Yiwen Luo, Yang Li, Kai J Kohlhoff, Deepak Ramachandran, and Vidhya Navalpakkam Best Paper https://openaccess.thecvf.com/content/CVPR2024/html/Liang_Rich_Human_Feedback_for_ Text-to-Image_Generation_CVPR_2024_paper.html

Slide 58

Slide 58 text

AI ▪ Text to Image 生成の Rich human feedback dataset の作成 (RichHF-18K) ▪ 画像に対する Rich human feedback の予測モデルの提案 ▪ Rich human feedback を用いた Text to Image 生成の洗練方法の提案 58 Summary Dataset: RichHF-18 Rich Automatic Human Feedback (RAHF) Model Model Finetuning Universal guidance w/o guidance w/ score guidance Refining Text-to-Image models Region inpainting Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, and Jiao Sun, et al. "Rich Human Feedback for Text-to-Image Generation". In CVPR 2024.

Slide 59

Slide 59 text

AI Text to Image 生成の Rich human feedback dataset の作成 (RichHF-18K) 59 Data Collection: RichHF-18K [Figure 1. An illustration of our annotation UI] 画像上にポイントをマーク: ・1. アーティファクト/非現実的な領域（赤いポイント）・2. テキストプロンプトに対して整合していない領域（青いポイント）単語をマーク(下線&シェーディング): ・3. 誤っているキーワードスコアを選択(下線): ・4. 信頼性・5. テキストと画像の整合性・6. 美的品質・7. 全体的な品質 Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, and Jiao Sun, et al. "Rich Human Feedback for Text-to-Image Generation". In CVPR 2024. (16K training, 1K validation, 1K test samples) [Figure 2. Histograms of the average scores of image-text pairs in the training set.] Pick-a-pic dataset (NeurIPS 2023) (35,000テキスト、500,000生成画像、2画像間の好み、データセット) から、PaLI(ICLR2023)属性を用い、多様な1.8K画像選定、アノテーション追加。 Annotations [Figure 10. Histograms of the PaLI attributes of the images in the training set.] 多様な画像選定 Positive/Negative評価バランスのとれたスコア分布

Slide 60

Slide 60 text

AI 画像に対する Rich human feedback の予測モデルの提案 60 Method: RAHF (Rich Automatic Human Feedback) model Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, and Jiao Sun, et al. "Rich Human Feedback for Text-to-Image Generation". In CVPR 2024. [Figure 3. Architecture of our rich feedback model.] ViTが出力する Vision tokens、Text-embedモジュールが出力する text tokens、に対し Self-attention で、画像とテキストの情報を融合 RAHF(Rich Automatic Human Feedback) model Vision/Text tokens は Transformer decoer に送られ、誤りkeyword の系列を生成 Vision tokens は特徴マップに再形成され、 Heatmap、Score にマッピング Two stream 構成: Vision streamとText stream で構成派生モデル・Multi-head：Score と Heatmap の種類ごとに1つのヘッドを持つ複数の予測ヘッド・Augmented prompt：Heatmap/Score/Text ごとに1つのヘッド。出力タイプにあわせプロンプト拡張。

Slide 61

Slide 61 text

AI 61 Results: RAHF model evaluation [Figure 5. Examples of implausibility heatmaps] [Figure 6. Examples of misalignment heatmaps] Implausibility heatmap Text-misalignment heatmap Prompt: A snake on a mushroom. Heatmap Prediction [Table 3. Text misalignment heatmap prediction results on the test set.] * GT = 0：空のヒートマップ（良好画像 : test set 144/995(14%))） GT > 0：空でないヒートマップ（課題ある画像 : test set 851/995(86%)） Prompt: photo of a slim asian little girl ballerina with long hair wearing white tights running on a beach from behind nikon D5 [Figure 5. Examples of implausibility heatmaps] Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, and Jiao Sun, et al. "Rich Human Feedback for Text-to-Image Generation". In CVPR 2024. Test set でモデル予測と人アノテーション (GT) の heatmap /point間の Metrics 計算

Slide 62

Slide 62 text

AI 62 Results: RAHF model evaluation [Table 1. Score prediction results on the test set.] Score Prediction * PLCC: Pearson linear correlation coefficient (ピアソンの線形相関係数 ) SRCC: Spearman rank correlation coefficient (スピアマンの順位相関係数 ) [Table 4. Text misalignment prediction results on the test set.] Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, and Jiao Sun, et al. "Rich Human Feedback for Text-to-Image Generation". In CVPR 2024. Test set でモデル予測と人の評価の間の Metrics 計算 Misaligned text prediction [Figure 7. Examples of ratings. “GT” is the ground-truth score (average score from three annotators).] Examples of ratings

Slide 63

Slide 63 text

AI [3] Robin Rombach, et al. "High-resolution image synthesis with latent diffusion models". In CVPR 2022. 63 Learning from rich human feedback (LHF) Fine-tuning Muse[1] with predicted scores (via sample selection [2]) Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, and Jiao Sun, et al. "Rich Human Feedback for Text-to-Image Generation". In CVPR 2024. 予測Score を利用し、画像選択し model Finetuning [2] Jiao Sun, et al. "DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback". arXiv. [1] Huiwen Chang, et al. "Muse: Text-to-image generation via masked generative transformers". In ICML 2023. [1] [2] Human Evaluation Results [Table 5. Human Evaluation Results: Finetuned Muse vs original Muse model preference] 主観評価(割合): Fine-tuned Museが元のMuseよりも、大幅に良い（≫）、わずかに良い（ >）、　　　　　　　ほぼ同じ（ ≈）、わずかに悪い（ <）、大幅に悪い（≪） [Top:Figure 8. Examples illustrating the impact of RAHF on generative models.] Prompt: A cat sleeping on the ground using a shoe as a pillow Before finetuning After finetuning [Bottom: Figure 15. More examples illustrating the impact of RAHF on generative models.] Prompt: Three zebras are standing together in a line Fine-tuned Museによって生成された画像は、元の Museによって生成された画像よりも高評価。 Muse[1]と LDM[3]は異なるため、手法の汎化可能性を示している。

Slide 64

Slide 64 text

AI 64 Learning from rich human feedback (LHF) Predicted scores as universal guidance [4] (with Latent DM [3]) Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, and Jiao Sun, et al. "Rich Human Feedback for Text-to-Image Generation". In CVPR 2024. 予測Score を Universal Guidance として誘導に利用 [4] Arpit Bansal, et al. "Universal guidance for diffusion models". arXiv. [3] Robin Rombach, et al. "High-resolution image synthesis with latent diffusion models". In CVPR 2022. [Figure 8. Examples illustrating the impact of RAHF on generative models.] Prompt: a macro lens closeup of a paperclip w/o guidance w/ score guidance w/o guidance w/ score guidance Prompt: Kitten sushi stained glass window sunset fog. [Figure 15. More examples illustrating the impact of RAHF on generative models..] Aesthetic score Overall score universal guidance

Slide 65

Slide 65 text

AI 65 Learning from rich human feedback (LHF) Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, and Jiao Sun, et al. "Rich Human Feedback for Text-to-Image Generation". In CVPR 2024. Region inpainting with predicted heatmaps and score (via Muse inpainting) [Figure 9. Region inpainting with Muse [1] generative model.] [1] Huiwen Chang, et al. "Muse: Text-to-image generation via masked generative transformers". In ICML 2023.

Slide 66

Slide 66 text

AI ▪ Text to Image 生成の Rich human feedback dataset の作成 (RichHF-18K) ▪ 画像に対する Rich human feedback の予測モデルの提案 ▪ Rich human feedback を用いた Text to Image 生成の洗練方法の提案 66 Summary Dataset: RichHF-18 Rich Automatic Human Feedback (RAHF) Model Model Finetuning Universal guidance w/o guidance w/ score guidance Refining Text-to-Image models Region inpainting Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, and Jiao Sun, et al. "Rich Human Feedback for Text-to-Image Generation". In CVPR 2024.

Slide 67

Slide 67 text

AI 67 Generative Image Dynamics https://openaccess.thecvf.com/content/CVPR2024/html/Li_Generative_Image_Dynamics_C VPR_2024_paper.html project page: https://generative-dynamics.github.io/ arxiv: https://arxiv.org/abs/2309.07906 demo: https://generative-dynamics.github.io/#demo Zhengqi Li, Richard Tucker, Noah Snavely, and Aleksander Holynski Best Paper

Slide 68

Slide 68 text

AI ▪ 長い動画系列の予測での、計算量の多さ、時間的一貫性の欠如、の解消。 ▪ シーンの動きに関する、生成的な画像空間の事前分布をモデル化。 ▪ 動きの表現：単一画像から、Fourier領域で密な長期的なピクセル軌道をモデル化する Spectral Volume を生成。 ▪ 応用：生成されたSpectral Volumeを用い、各種アプリケーションに使用可能。(静止画をシームレスなループビデオに変換、実画像内のオブジェクトと対話的な操作) 68 Summary [Figure 1] Zhengqi Li, Richard Tucker, Noah Snavely, and Aleksander Holynski. "Generative Image Dynamics". In CVPR 2024. https://generative-dynamics.github.io/ demo: https://generative-dynamics.github.io/#demo

Slide 69

Slide 69 text

AI Motion representation 69 Method ・Motion Field: 時間変化する2D変位マップの系列。各時刻での値は Dense Optical Flowを表す [Figure 4. Rendering module.] ・Spectral Volume: ビデオから抽出されたピクセルの運動軌道の時間的フーリエ変換(周波数空間) Diﬀusion modelによる motion 予測周波数空間の denoising model によって Spectral volume Sを予測 Image based Rendering ・Feature extracter: 入力画像からマルチスケールの特徴量抽出・Softmax Splatting: 特徴量に対して時間0から tまでの motion ﬁeld補正・Syenthesis network:欠損コンテンツ補完し、入力画像を精緻化 [Figure 3. Motion prediction module.] Zhengqi Li, Richard Tucker, Noah Snavely, and Aleksander Holynski. "Generative Image Dynamics". In CVPR 2024. https://generative-dynamics.github.io/

Slide 70

Slide 70 text

AI Seamless looping video 70 Application ノイズ Classiﬁer free guidance Motion self-guidance. エネルギー関数：開始フレームと終了フレームでのピクセルの位置と速度ができるだけ近くなるようする Interactive dynamics from a single image Spectral volume を振動モードの基底として利用し、物理応答を記述 Pixel p の運動変位 pixel p の Spectral Volume 時間t におけるモーダル座標 Zhengqi Li, Richard Tucker, Noah Snavely, and Aleksander Holynski. "Generative Image Dynamics". In CVPR 2024. https://generative-dynamics.github.io/

Slide 71

Slide 71 text

AI 71 生成品質と一貫性の向上 Results [Figure 6. Sliding window FID and DTFVD. ] Window size: - sliding window FID: 30 - sliding window DT-FVD: 16 [Figure 5. X-t slices of videos generated by different approaches.] Generated videos/ Interactive demos: https://generative-dynamics.github.io/ Zhengqi Li, Richard Tucker, Noah Snavely, and Aleksander Holynski. "Generative Image Dynamics". In CVPR 2024. https://generative-dynamics.github.io/

Slide 72

Slide 72 text

AI 72 FreeU: Free Lunch in Diffusion U-Net project page: https://chenyangsi.top/FreeU/ arxiv: https://arxiv.org/abs/2309.11497 code: https://github.com/ChenyangSi/FreeU Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu Oral https://openaccess.thecvf.com/content/CVPR2024/html/Si_FreeU_Free_Lunch_in_Diﬀusion _U-Net_CVPR_2024_paper.html

Slide 73

Slide 73 text

AI ▪ Diffusion U-Net のスキップ接続とバックボーンの貢献を再評価し、生成品質を向上させる「FreeU」を提案。 ▪ FreeU は追加のトレーニングなしで既存の拡散モデルに簡単に統合可能。（例：Stable Diffusion, ModelScope, Dreambooth Re-Version, Rerender, ScaleCrafter, Animatediff, and ControlNet.） ▪ 推論時に2つのスケーリングファクターを調整するだけで効果を発揮。 73 Summary Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/ [Figure 1. FreeU.]

Slide 74

Slide 74 text

AI ▪ ノイズ除去プロセス中のノイズからの画像を生成 ▪ 低周波成分：ゆっくりと変化し、変化率が緩やか ▪ 高周波成分：顕著な変化 →ノイズ除去プロセスでは、重要な細かい詳細を保持しながらノイズを取り除く必要がある ▪ 74 Investigation Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/ [Figure 3. Denoising process visualization.] [Figure 4. Relative log amplitudes of Fourier for denoising process. ]

Slide 75

Slide 75 text

AI ▪ Diﬀusion U-Netのノイズ除去 ▪ Backbone : 主にノイズ除去に寄与する ▪ Skip　　 : Decoder に高周波特徴を導入する →高周波情報を伝達し、学習中にU-Netが入力データをより容易に復元できるようにする。 →意図しない結果：推論中にBackbone の本来持つノイズ除去能力が弱める 75 Investigation Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/ [Figure 5. Effect of backbone and skip connection scaling factors (b and s)] [Figure 6. Relative log amplitudes of Fourier with variations of the backbone scaling factor b ] [Figure 7. Fourier relative log amplitudes of backbone, skip, and their fused feature maps. ] Skip特徴には大量の高周波数情報が含まれる backbone 比率を増加させると拡散モデルによって生成画像の高周波成分が抑制 Backbone scaling factor bを増加させると画像品質が大幅に向上する Skip Scaling factor s の増加は画像合成の品質に対して限定的な影響

Slide 76

Slide 76 text

AI FreeU: Free Lunch in Diﬀusion U-Net 76 Method Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/ ・Backbone factor: backbone 特徴を強化・Content-aware backbone enhancement ・Channel-selective backbone enhancement [Figure 2. FreeU Framework. (a) U-Net Skip Features and Backbone Features.] ・Skip factor: 過度に平滑化された texture の問題を軽減・skip特徴の低周波成分を抑制

Slide 77

Slide 77 text

AI 77 Results : Text to Image [Table 3. Quantitative evaluation of text-to-image generation] [Table 4. Quantitative Results of FID and CLIP-score.] [Figure11. Text-to-image generation results of SD-XL with or without FreeU.] Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/ SDXL + FreeU

Slide 78

Slide 78 text

AI 78 Results : Text to Image [Figure 22. 4096 × 4096 SD-XL Images generated by ScaleCrafter with or without FreeU.] Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/

Slide 79

Slide 79 text

AI 79 Results : Text to Image [Figure 22. 4096 × 4096 SD-XL Images generated by ScaleCrafter with or without FreeU.] Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/

Slide 80

Slide 80 text

AI 80 Results : Text to Image Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/ Input Images Dreambooth Dreambooth + FreeU Personalized Text-to-Image

Slide 81

Slide 81 text

AI 81 Results : Text to Image [ure 24. Generated images from ControlNet with and without FreeU enhancement.] Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/ Controlnet Controlnet + FreeU Controlnet Controlnet + FreeU Controlnet

Slide 82

Slide 82 text

AI 82 Results : Text to Image [Figure 25. Generated images from LCM [36] with and without FreeU enhancement.] Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/ Latent Consistency Model

Slide 83

Slide 83 text

AI 83 Results : Text to Image Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/ FreeU: Free Lunch in Diffusion U-Net https://www.youtube.com/watch?v=-CZ5uWxvX30&t=86s

Slide 84

Slide 84 text

AI 84 Results : Text to Video [Table 5. Quantitative evaluation of text-to-video generation.] [Figure 12. Text-to-video generation results of ModelScope [37] with or without FreeU. ] Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/

Slide 85

Slide 85 text

AI 85 Results : Text to Video Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/ FreeU: Free Lunch in Diffusion U-Net https://www.youtube.com/watch?v=-CZ5uWxvX30&t=86s

Slide 86

Slide 86 text

AI 86 Results : Text to Video [Figure 23. Generated videos from Animatediff with and without FreeU enhancement.] Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/

Slide 87

Slide 87 text

AI 87 Results : Text to Video [Figure 23. Generated videos from Animatediff with and without FreeU enhancement.] Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/

Slide 88

Slide 88 text

AI 88 Results : Video to Video Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/ FreeU: Free Lunch in Diffusion U-Net https://www.youtube.com/watch?v=-CZ5uWxvX30&t=86s

Slide 89

Slide 89 text

AI ▪ Diffusion U-Net のスキップ接続とバックボーンの貢献を再評価し、生成品質を向上させる「FreeU」という手法を提案。 ▪ FreeU は追加のトレーニングなしで既存の拡散モデルに簡単に統合可能。（例：Stable Diffusion, ModelScope, Dreambooth Re-Version, Rerender, ScaleCrafter, Animatediff, and ControlNet.） ▪ 推論時に2つのスケーリングファクターを調整するだけで効果を発揮。 89 Summary Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/ [Figure 1. FreeU.]

Slide 90

Slide 90 text

AI 90 Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models project page: https://dangeng.github.io/visual_anagrams/ arxiv: https://arxiv.org/abs/2311.17919 code: https://github.com/dangeng/visual_anagrams Daniel Geng, Inbum Park, and Andrew Owens Oral https://openaccess.thecvf.com/content/CVPR2024/html/Geng_Visual_Anagrams_Generatin g_Multi-View_Optical_Illusions_with_Diﬀusion_Models_CVPR_2024_paper.html

Slide 91

Slide 91 text

AI ▪ Multi-view optical illusion (多視点光学錯覚)の Diﬀusion model による生成 ▪ Simpleな手法、面白い応用 91 Summary Daniel Geng, Inbum Park, and Andrew Owens. "Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models". In CVPR 2024. generated results [Figure 1. Generating Multi-View Illusions.]

Slide 92

Slide 92 text

AI Jigsaw 4 view 3 view ▪ Multi-view optical illusion (多視点光学錯覚)の Diﬀusion model による生成 ▪ Simpleな手法、面白い応用 92 Summary Rotations Inner Circle Patch permutation Daniel Geng, Inbum Park, and Andrew Owens. "Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models". In CVPR 2024. 例

Slide 93

Slide 93 text

AI 93 Method Daniel Geng, Inbum Park, and Andrew Owens. "Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models". In CVPR 2024. ▪ 同時に複数の視点からの画像のノイズを除去。 ▪ 同一ノイズ画像に対し、異なる視点・プロンプトで、それぞれノイズ推定。 ▪ 視点を合わせ、推定ノイズの平均をとる。 ▪ 推定ノイズ平均を用いノイズ除去を行っていく。 [Figure 2. Algorithm Overview]

Slide 94

Slide 94 text

AI 94 Results Daniel Geng, Inbum Park, and Andrew Owens. "Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models". In CVPR 2024. generated results Results: https://dangeng.github.io/visual_anagrams/

Slide 95

Slide 95 text

Slide 96

Slide 96 text

AI 96 ▪ Best Papers ▪ Rich Human Feedback for Text-to-Image Generation ▪ Generative Image Dynamics ▪ Orals ▪ FreeU: Free Lunch in Diffusion U-Net ▪ Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models ▪ Style Aligned Image Generation via Shared Attention ▪ Instruct-Imagen: Image Generation with Multi-modal Instruction ▪ Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following ▪ Attention Calibration for Disentangled Text-to-Image Personalization ▪ Alchemist: Parametric Control of Material Properties with Diffusion Models ▪ Analyzing and Improving the Training Dynamics of Diffusion Models ▪ MonoHair: High-Fidelity Hair Modeling from a Monocular Video Main Conference: Image & Video Synthesis

Slide 97

Slide 97 text

AI 97 以上