How Image Generation AI and Multimodal AI Work and The Applications in CAD

How Image Generation AI and Multimodal AI Work and Their
Applications in CAD Masahiro Oda1, Hirotsugu Takabatake2, Masaki Mori3, Hiroshi Natori4, Kensaku Mori1,5 1Nagoya University, Nagoya, Japan 2Sapporo-Minami-Sanjo Hospital, Sapporo, Japan 3Sapporo-Kosei General Hospital, Sapporo, Japan 4Keiwakai Nishioka Hospital, Sapporo, Japan 5National Institute of Informatics, Tokyo, Japan

Teaching Points • The purposes of our exhibit are: 1.
To confirm what are the differences between previous and recent image generation AIs (VAE, GAN, Diffusion Model (DM)) 2. To learn applications of DM-based image generation AI in CAD 3. To learn what is multimodal AI 4. To learn applications of multimodal AI in CAD Generated images by DM BiomedCLIP (multimodal AI) Real and generated tumor images (Abe et al. 2023) Real Generated

IMAGE GENERATION AI

Image Generation AI • Generate images based on given conditions
• Diffusion Model contributes to improving quality of generated images – Diffusion model generates high quality images compared to previous methods – Many image generation services were emerged • DALL·E, DALL·E 2, DALL·E 3 (OpenAI) • Midjourney (Midjourney) • Stable Diffusion (Stability AI) 4 Generated by DALL-E 2 https://labs.openai.com/ “A Shiba Inu dog wearing a beret and black turtleneck”

Advantages and drawbacks of Image Generation Models • VAE (Variational
Auto Encoder) – Training is stable and easy to check – Difficult to make highly expressive model • GAN (Generative Adversarial Network) – Training is unstable and difficult to check – Highly expressive model • Diffusion Model – Training is stable – Highly expressive model Generated images by StyleGAN2[I1] Generated images by DALL·E 3 （https://openai.com/dall-e-3） [I1] Karras T., et al., Analyzing and Improving the Image Quality of StyleGAN, IEEE/CVF CVPR, 2020

Generated images using GAN and Diffusion Model • Generated colonoscopic
images – GAN – Diffusion Model 6 Blood vessel like textures and light reflections are realistic

Diffusion Model • Diffusion – Concentration of molecules, temperature, energy
becomes uniform over time in a system – Molecules are randomly mixed due to thermal motion • Diffusion Model (DM) – Inspired and developed based on diffusion process – Diffusion Probabilistic Model (DPM)[I2] • Consider process (diffusion process) where noise is gradually added to data until it is transformed into complete noise. Model that approximates data distribution through reverse of this process (reverse diffusion process). – Denoising Diffusion Probabilistic Model (DDPM)[I3] • Applies DPM to image generation 7 [I2] Sohl-Dickstein J., et al., Deep Unsupervised Learning using Nonequilibrium Thermodynamics, ICML, 2015 [I3] Ho J., et al., Denoising Diffusion Probabilistic Models, NeurIPS, 33, 2020

Diffusion Model (DDPM) • Generate image using reverse diffusion (generation)
process – Diffusion process: gradually add noise to image until it is converted into complete noise – DDPM generates images using reverse diffusion process, in the process, noise is gradually removed from image that contains only noise, reconstructing image until final image is obtained 8 𝑝 𝐱𝑇 = 𝑁 𝐱𝑇 ; 0, I 𝑝𝜃 𝐱𝑡−1 |𝐱𝑡 = 𝑁 𝐱𝑡−1 ; 𝜇𝜃 𝐱𝑡 , 𝑡 , Σ𝜃 𝐱𝑡 , 𝑡 Diffusion process 𝐱𝑇 𝐱𝑡 𝐱𝑡−1 𝐱0 Diffusion Model (DDPM) Reverse diffusion process/ Generation process

Implementation of Diffusion Model (DM) • Image generation process is
implemented using deep learning models – Entire generation process commonly contains 1000 steps (T=1000) of noise reduction processes using deep learning models (U-Nets) • Condition of generation process is provided to models – Model generates image based on the condition while reducing noise 9 𝐱𝑇 𝐱𝑡 𝐱𝑡−1 𝐱0 Diffusion Model (DDPM) Conditions specified by text or image 1 step of image conversion (from 𝑡 to 𝑡 − 1) is implemented using deep learning model Encoder Diffusion process Reverse diffusion process/ Generation process

Applications of DM: Image Generation (1/3) • 2D brain image
generation[I4] – DM-based method generated better results compared to GAN-based method • Evaluation criteria: FID, SSIM – Large dataset (31,740 images) was used to train 10 [I4] Pinaya W.H.L., et al., Brain Imaging Generation with Latent Diffusion Models, DGM4MICCAI, 13609, pp.117-126, 2022 Generated images by DM[I4] Real images[I4] Generated images by LSGAN[I4]

Applications of DM: Image Generation (2/3) • 3D brain image
generation[I5] – DM-based method generated better results compared to GAN-based method • Evaluation criteria: SSIM • Cardiac 4D MR image generation[I6] 11 2D slices of generated images by DM[I5] Model structure of 3D image generation[I5] [I5] Dorjsembe Z., et al., Three-Dimensional Medical Image Synthesis with Denoising Diffusion Probabilistic Models, MIDL, 2022 [I6] Kim B., Ye J.C., Diffusion Deformable Model for 4D Temporal Medical Image Generation, MICCAI, 13431, pp.539-548, 2022

Applications of DM: What is the Use of Generated Images?
• Image generation by DM – Realistic images can be generated – Generated images cannot be used in diagnosis • Value of using generated images by DM – Applicable to increase training data of AI CAD (data augmentation) – Use of generated images in training is effective • Size of medical image dataset to train AI model is commonly limited

Applications of DM: Image Generation (3/3) • Use of generated
images in AI training[I7] – Generate mammogram images including tumor using Stable Diffusion • Add variations of position and size by text prompting – Detection accuracy of AI CAD • Trained using real images (300): 83.7% • Trained using real (300) + generated images (5280): 89.2% • Images generated by DM with appropriate prompting contributed to improve AI CAD systems [I7] Kazuya Abe, et al., Artificial Case Image Generation of Breast Cancer Mass by Stable Diffusion and Its Application to Differentiation between Benign and Malignant, JAMIT, OP7-1, pp.142-148, 2023 Real and generated images[I7] Real Generated Real Generated Benign Malignant

Applications of DM: Segmentation (1/3) • Basic use of DM
in segmentation process: MedSegDiff[I8] – Generate segmentation result using DM – Medical image is used as condition of generation process of DM • Higher segmentation accuracies were reported from DM-based methods compared to previous CNN or ViT-based methods 14 [I8] Wu J., et al., MedSegDiff: Medical Image Segmentation with Diffusion Probabilistic Model, MIDL, 2023 Segmentation process in MedSegDiff[I8] Input image Segmentation result Segmentation results[I8]

Applications of DM: Image Translation • Image modality translation –
From CT to MR image, between MR images of different weighting conditions[I9,I10] • Value of using image translation – Applicable for developing AI CAD for modalities where training images are difficult to correct 15 MR to CT image translation by SynDiff[I10] [I9] Lyu Q., Wang G., Conversion Between CT and MRI Images Using Diffusion and Score-Matching Models, arXiv:2209.12104, 2022 [I10] Özbey M., et al., Adversarial Diffusion Models for Unsupervised Medical Image Synthesis, NeurIPS, 2022

Summary of Image Generation AI • Advancement in diffusion model
improved image processing performance • Image generation – Images generated by diffusion model with appropriate prompting contributed to improve AI CAD systems • Image segmentation – Diffusion model-based segmentation methods achieved higher accuracies compared to CNN or ViT-based methods • Image translation – Applicable for developing AI CAD for modalities where training images are difficult to correct

MULTIMODAL AI

Advancement of Multimodal AI • Multimodal AI is improving its
performance recently – Text to image: DALL·E, Midjourney, Stable Diffusion • Key of advancement of multimodal AI – Previously proposed multimodal AIs have strong limitation in application target – Performances of multimodal AIs improved with use of new models (Transformer, DM) and large dataset 18 [M1] Najdenkoska I., et al., Variational Topic Inference for Chest X-Ray Report Generation, MICCAI, LNCS 12903, pp.625-635, 2021 Generation of finding texts from chest X-ray images[M1] Multimodal data: image, text, sound, sensing data...

Multimodal AI: DALL·E 2[M2] • Text encoder – CLIP[M3] is
used – Extracts embedding representation useful for establishing correspondences between text and image – Trained using 400M pairs of text and image 19 [M2] Ramesh A., et al., Hierarchical Text-Conditional Image Generation with CLIP Latents, NeurIPS, 33, 2020 [M3] Radford A., et al., Learning Transferable Visual Models From Natural Language Supervision, arXiv:2103.00020, 2021 Text Image Pre-training of CLIP[M3] Transformer CNN • Image generation – Convert text embedding to image embedding using prior – Two steps of diffusion upsampler model to generate high resolution images from image embedding • 64x64→256x256 pixels，256x256→1024x1023 pixels Text embedding Image embedding Overview of image generation model[M2]

Multimodal AI: Stable Diffusion[M4] • Image generation – DM is
applied to data in embedding space • Image is projected to embedding space based on its content • Advantage in content-based image generation • Reduce computation cost by reducing data processed by DM • Text encoder – Transformer model is used – Trained using 400M pairs of text and image (LAION-400M) 20 [M4] Rombach R., et al., High-Resolution Image Synthesis with Latent Diffusion Models, CVPR, 2022 Overview of Stable Diffusion[M4]

Multimodal AI: Segment Anything Model (SAM)[M5] • Large model having
generalized performance – Performs segmentation and object detection – Accepts texts as prompt – Uses vision transformer model consists of 600M parameters – Trained using 11M images with 1B of region annotations • By training on large-scale datasets, SAM performs automatic recognition of various subjects (e.g., natural images, illustrations, medical images) 21 [M5] Kirillov A., et al., Segment Anything, arXiv:2304.02643, 2023 Images from https://segment-anything.com/ Images from https://segment-anything.com/

Multimodal AI: Segment Anything Model 2 (SAM 2)[M6] • Model
compatible with videos and better performance than SAM – Improved accuracy and speed compared to SAM – Trained on approximately 51,000 videos – Incorporation of temporal attention into the model [M6] Ravi N., et al., SAM 2: Segment Anything in Images and Videos, https://ai.meta.com/research/publications/sam-2-segment-anything-in-images-and-videos/, 2024 Segmented images by SAM 2 from https://github.com/facebookresearch/sam2?tab=readme-ov-file 22 Processing flow of SAM 2 from https://github.com/facebookresearch/sam2?tab=readme-ov-file

Applications of SAM in Medical Image Processing • SAM is
applied in medical image segmentations – Tumor segmentation from pathological image[M7] – Heart segmentation from ultrasound image[M8] – Segmentation from various medical images[M9] 23 [M7] Deng R., et al., Segment Anything Model (SAM) for Digital Pathology: Assess Zero-shot Segmentation on Whole Slide Imaging, arXiv:2304.04155, 2023 [M8] Chen F., et al., The ability of Segmenting Anything Model (SAM) to segment ultrasound images, BioScience Trends, 17(3), pp. 211-218, 2023 [M9] Ma J., et al., Segment Anything in Medical Images, arXiv:2304.12306, 2023 Pathological image[M7] Ultrasound image[M8] Various medical images[M9] • SAM segmented target from all images but accuracy was lower than task-specific model

Multimodal AI: Medical Image Multimodal AI • MedSAM[M10] – Developed
by fine-tuning SAM using medical images • Used 1.6M medical images – Performs multi-modal medical image processing • MedSAM2[M11] – Developed by fine-tuning SAM2 using medical images – Applicable to 3D CT volumes and time-series image processing 24 [M10] Jun Ma, et al., Segment anything in medical images, Nature Communications volume 15, 654, 2024 [M11] Jiayuan Zhu, et al., Medical SAM 2: Segment medical images as video via Segment Anything Model 2, arXiv:2408.00874, 2024 Segmentation results by MedSAM [M10] Segmentation results by MedSAM2 [M11]

Change of AI Development in Age of Large Models •
Development of AI in conventional style 25 Task-specific dataset Small model Large cross-task dataset Large model or Foundation Model Task-specific AI AI having generalized performance • Development of AI using large dataset ✓ AI model can be applied multiple tasks ✓ Model having generalized performance ➢ GPT-3.5 and SAM are examples of foundation model ✓ Build dataset and develop AI for each task

AI Development using Foundation Model • Task-specific AI can be
quickly developed using foundation model – AIs can be developed using small or no dataset 26 Foundation Model Task-specific Models Small number of task-specific data Fine-tuning Few shot learning Change state of models by prompts Zero shot learning Special fine-tuning Reinforcement learning/ Parameter efficient learning/… Task-specific data

Medical AI Development using Foundation Model • Use foundation model
to develop numerous AIs in downstream tasks[M12] 27 Medical Foundation Model FM for Lung FM for Liver FM for Brain FM for Intestine FM for COVID-19 FM for COPD FM for Tumor FM for liver cancer FM for aneurysm FM for Alzheimer’s disease Task-specific General [M12] Shuo Li, Integrating vision and language: Revolutionary foundations in medical imaging AI, Keynote Speech in SPIE Medical Imaging, 2024

Foundation Models for Medical Images (1/2) • Prov-GigaPath[M13] – Foundation
model for pathological images – Developed on over 170K images • Applicable task: image classification • PathAsst[M14] – Multimodal foundation model for pathological images – Developed on over 200K image-text pairs • Applicable tasks: image classification, text generation 28 [M13] GigaPath: Whole-Slide Foundation Model for Digital Pathology, https://www.microsoft.com/en-us/research/blog/gigapath-whole-slide-foundation-model-for-digital-pathology/ [M14] Yuxuan Sun, et al. PathAsst: A Generative Foundation AI Assistant Towards Artificial General Intelligence of Pathology, arXiv:2305.15072, 2024

Foundation Models for Medical Images (2/2) • BiomedCLIP[M15] – Multimodal
foundation model (including CT) – Developed on over 15M image-text pairs • Applicable tasks: image classification, VQA (Visual Question Answering) • BioViL-T[M16] – Multimodal foundation model for X-ray images – Developed on over 170K image-text pairs • Applicable tasks: image classification, text generation 29 [M15] Sheng Zhang, et al. BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs, arxiv:2303.00915, 2024 [M16] Shruthi Bannur, et al., Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing, arXiv:2301.04558, 2023

Summary of Multimodal AI • AI models with generalized performances
across tasks are developed • Changes in development of AI – Large dataset, high performance computers, and large model are essentials in development of high-performance AI – AIs having generalized performance are called Foundation Model • Foundation model – Many foundation models were developed in medical image processing field – Foundation models will be essential component in medical AIs developments

How Image Generation AI and Multimodal AI Work ...

How Image Generation AI and Multimodal AI Work and The Applications in CAD

Masahiro Oda

More Decks by Masahiro Oda

Other Decks in Research

Featured

Transcript

How Image Generation AI and Multimodal AI Work and Their

Teaching Points • The purposes of our exhibit are: 1.

IMAGE GENERATION AI

Image Generation AI • Generate images based on given conditions

Advantages and drawbacks of Image Generation Models • VAE (Variational

Generated images using GAN and Diffusion Model • Generated colonoscopic

Diffusion Model • Diffusion – Concentration of molecules, temperature, energy

Diffusion Model (DDPM) • Generate image using reverse diffusion (generation)

Implementation of Diffusion Model (DM) • Image generation process is

Applications of DM: Image Generation (1/3) • 2D brain image

Applications of DM: Image Generation (2/3) • 3D brain image

Applications of DM: What is the Use of Generated Images?

Applications of DM: Image Generation (3/3) • Use of generated

Applications of DM: Segmentation (1/3) • Basic use of DM

Applications of DM: Image Translation • Image modality translation –

Summary of Image Generation AI • Advancement in diffusion model

MULTIMODAL AI

Advancement of Multimodal AI • Multimodal AI is improving its

Multimodal AI: DALL·E 2[M2] • Text encoder – CLIP[M3] is

Multimodal AI: Stable Diffusion[M4] • Image generation – DM is

Multimodal AI: Segment Anything Model (SAM)[M5] • Large model having

Multimodal AI: Segment Anything Model 2 (SAM 2)[M6] • Model

Applications of SAM in Medical Image Processing • SAM is

Multimodal AI: Medical Image Multimodal AI • MedSAM[M10] – Developed

Change of AI Development in Age of Large Models •

AI Development using Foundation Model • Task-specific AI can be

Medical AI Development using Foundation Model • Use foundation model

Foundation Models for Medical Images (1/2) • Prov-GigaPath[M13] – Foundation

Foundation Models for Medical Images (2/2) • BiomedCLIP[M15] – Multimodal

Summary of Multimodal AI • AI models with generalized performances