Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

How Image Generation AI and Multimodal AI Work ...

Avatar for Masahiro Oda Masahiro Oda
December 14, 2025

How Image Generation AI and Multimodal AI Work and The Applications in CAD

Presented in the Education Exhibit at RSNA 2024.
If you refer to the contents in this slide, please show this citation information:
Masahiro Oda, Hirotsugu Takabatake, Masaki Mori, Hiroshi Natori, Kensaku Mori, ``How Image Generation AI and Multimodal AI Work and The Applications in CAD,'' RSNA2024, PHEE-4, McCormick Place, Chicago (conference: 2024/12/01-12/05)

Avatar for Masahiro Oda

Masahiro Oda

December 14, 2025
Tweet

More Decks by Masahiro Oda

Other Decks in Research

Transcript

  1. How Image Generation AI and Multimodal AI Work and Their

    Applications in CAD Masahiro Oda1, Hirotsugu Takabatake2, Masaki Mori3, Hiroshi Natori4, Kensaku Mori1,5 1Nagoya University, Nagoya, Japan 2Sapporo-Minami-Sanjo Hospital, Sapporo, Japan 3Sapporo-Kosei General Hospital, Sapporo, Japan 4Keiwakai Nishioka Hospital, Sapporo, Japan 5National Institute of Informatics, Tokyo, Japan
  2. Teaching Points • The purposes of our exhibit are: 1.

    To confirm what are the differences between previous and recent image generation AIs (VAE, GAN, Diffusion Model (DM)) 2. To learn applications of DM-based image generation AI in CAD 3. To learn what is multimodal AI 4. To learn applications of multimodal AI in CAD Generated images by DM BiomedCLIP (multimodal AI) Real and generated tumor images (Abe et al. 2023) Real Generated
  3. Image Generation AI • Generate images based on given conditions

    • Diffusion Model contributes to improving quality of generated images – Diffusion model generates high quality images compared to previous methods – Many image generation services were emerged • DALL·E, DALL·E 2, DALL·E 3 (OpenAI) • Midjourney (Midjourney) • Stable Diffusion (Stability AI) 4 Generated by DALL-E 2 https://labs.openai.com/ “A Shiba Inu dog wearing a beret and black turtleneck”
  4. Advantages and drawbacks of Image Generation Models • VAE (Variational

    Auto Encoder) – Training is stable and easy to check – Difficult to make highly expressive model • GAN (Generative Adversarial Network) – Training is unstable and difficult to check – Highly expressive model • Diffusion Model – Training is stable – Highly expressive model Generated images by StyleGAN2[I1] Generated images by DALL·E 3 (https://openai.com/dall-e-3) [I1] Karras T., et al., Analyzing and Improving the Image Quality of StyleGAN, IEEE/CVF CVPR, 2020
  5. Generated images using GAN and Diffusion Model • Generated colonoscopic

    images – GAN – Diffusion Model 6 Blood vessel like textures and light reflections are realistic
  6. Diffusion Model • Diffusion – Concentration of molecules, temperature, energy

    becomes uniform over time in a system – Molecules are randomly mixed due to thermal motion • Diffusion Model (DM) – Inspired and developed based on diffusion process – Diffusion Probabilistic Model (DPM)[I2] • Consider process (diffusion process) where noise is gradually added to data until it is transformed into complete noise. Model that approximates data distribution through reverse of this process (reverse diffusion process). – Denoising Diffusion Probabilistic Model (DDPM)[I3] • Applies DPM to image generation 7 [I2] Sohl-Dickstein J., et al., Deep Unsupervised Learning using Nonequilibrium Thermodynamics, ICML, 2015 [I3] Ho J., et al., Denoising Diffusion Probabilistic Models, NeurIPS, 33, 2020
  7. Diffusion Model (DDPM) • Generate image using reverse diffusion (generation)

    process – Diffusion process: gradually add noise to image until it is converted into complete noise – DDPM generates images using reverse diffusion process, in the process, noise is gradually removed from image that contains only noise, reconstructing image until final image is obtained 8 𝑝 𝐱𝑇 = 𝑁 𝐱𝑇 ; 0, I 𝑝𝜃 𝐱𝑡−1 |𝐱𝑡 = 𝑁 𝐱𝑡−1 ; 𝜇𝜃 𝐱𝑡 , 𝑡 , Σ𝜃 𝐱𝑡 , 𝑡 Diffusion process 𝐱𝑇 𝐱𝑡 𝐱𝑡−1 𝐱0 Diffusion Model (DDPM) Reverse diffusion process/ Generation process
  8. Implementation of Diffusion Model (DM) • Image generation process is

    implemented using deep learning models – Entire generation process commonly contains 1000 steps (T=1000) of noise reduction processes using deep learning models (U-Nets) • Condition of generation process is provided to models – Model generates image based on the condition while reducing noise 9 𝐱𝑇 𝐱𝑡 𝐱𝑡−1 𝐱0 Diffusion Model (DDPM) Conditions specified by text or image 1 step of image conversion (from 𝑡 to 𝑡 − 1) is implemented using deep learning model Encoder Diffusion process Reverse diffusion process/ Generation process
  9. Applications of DM: Image Generation (1/3) • 2D brain image

    generation[I4] – DM-based method generated better results compared to GAN-based method • Evaluation criteria: FID, SSIM – Large dataset (31,740 images) was used to train 10 [I4] Pinaya W.H.L., et al., Brain Imaging Generation with Latent Diffusion Models, DGM4MICCAI, 13609, pp.117-126, 2022 Generated images by DM[I4] Real images[I4] Generated images by LSGAN[I4]
  10. Applications of DM: Image Generation (2/3) • 3D brain image

    generation[I5] – DM-based method generated better results compared to GAN-based method • Evaluation criteria: SSIM • Cardiac 4D MR image generation[I6] 11 2D slices of generated images by DM[I5] Model structure of 3D image generation[I5] [I5] Dorjsembe Z., et al., Three-Dimensional Medical Image Synthesis with Denoising Diffusion Probabilistic Models, MIDL, 2022 [I6] Kim B., Ye J.C., Diffusion Deformable Model for 4D Temporal Medical Image Generation, MICCAI, 13431, pp.539-548, 2022
  11. Applications of DM: What is the Use of Generated Images?

    • Image generation by DM – Realistic images can be generated – Generated images cannot be used in diagnosis • Value of using generated images by DM – Applicable to increase training data of AI CAD (data augmentation) – Use of generated images in training is effective • Size of medical image dataset to train AI model is commonly limited
  12. Applications of DM: Image Generation (3/3) • Use of generated

    images in AI training[I7] – Generate mammogram images including tumor using Stable Diffusion • Add variations of position and size by text prompting – Detection accuracy of AI CAD • Trained using real images (300): 83.7% • Trained using real (300) + generated images (5280): 89.2% • Images generated by DM with appropriate prompting contributed to improve AI CAD systems [I7] Kazuya Abe, et al., Artificial Case Image Generation of Breast Cancer Mass by Stable Diffusion and Its Application to Differentiation between Benign and Malignant, JAMIT, OP7-1, pp.142-148, 2023 Real and generated images[I7] Real Generated Real Generated Benign Malignant
  13. Applications of DM: Segmentation (1/3) • Basic use of DM

    in segmentation process: MedSegDiff[I8] – Generate segmentation result using DM – Medical image is used as condition of generation process of DM • Higher segmentation accuracies were reported from DM-based methods compared to previous CNN or ViT-based methods 14 [I8] Wu J., et al., MedSegDiff: Medical Image Segmentation with Diffusion Probabilistic Model, MIDL, 2023 Segmentation process in MedSegDiff[I8] Input image Segmentation result Segmentation results[I8]
  14. Applications of DM: Image Translation • Image modality translation –

    From CT to MR image, between MR images of different weighting conditions[I9,I10] • Value of using image translation – Applicable for developing AI CAD for modalities where training images are difficult to correct 15 MR to CT image translation by SynDiff[I10] [I9] Lyu Q., Wang G., Conversion Between CT and MRI Images Using Diffusion and Score-Matching Models, arXiv:2209.12104, 2022 [I10] Özbey M., et al., Adversarial Diffusion Models for Unsupervised Medical Image Synthesis, NeurIPS, 2022
  15. Summary of Image Generation AI • Advancement in diffusion model

    improved image processing performance • Image generation – Images generated by diffusion model with appropriate prompting contributed to improve AI CAD systems • Image segmentation – Diffusion model-based segmentation methods achieved higher accuracies compared to CNN or ViT-based methods • Image translation – Applicable for developing AI CAD for modalities where training images are difficult to correct
  16. Advancement of Multimodal AI • Multimodal AI is improving its

    performance recently – Text to image: DALL·E, Midjourney, Stable Diffusion • Key of advancement of multimodal AI – Previously proposed multimodal AIs have strong limitation in application target – Performances of multimodal AIs improved with use of new models (Transformer, DM) and large dataset 18 [M1] Najdenkoska I., et al., Variational Topic Inference for Chest X-Ray Report Generation, MICCAI, LNCS 12903, pp.625-635, 2021 Generation of finding texts from chest X-ray images[M1] Multimodal data: image, text, sound, sensing data...
  17. Multimodal AI: DALL·E 2[M2] • Text encoder – CLIP[M3] is

    used – Extracts embedding representation useful for establishing correspondences between text and image – Trained using 400M pairs of text and image 19 [M2] Ramesh A., et al., Hierarchical Text-Conditional Image Generation with CLIP Latents, NeurIPS, 33, 2020 [M3] Radford A., et al., Learning Transferable Visual Models From Natural Language Supervision, arXiv:2103.00020, 2021 Text Image Pre-training of CLIP[M3] Transformer CNN • Image generation – Convert text embedding to image embedding using prior – Two steps of diffusion upsampler model to generate high resolution images from image embedding • 64x64→256x256 pixels,256x256→1024x1023 pixels Text embedding Image embedding Overview of image generation model[M2]
  18. Multimodal AI: Stable Diffusion[M4] • Image generation – DM is

    applied to data in embedding space • Image is projected to embedding space based on its content • Advantage in content-based image generation • Reduce computation cost by reducing data processed by DM • Text encoder – Transformer model is used – Trained using 400M pairs of text and image (LAION-400M) 20 [M4] Rombach R., et al., High-Resolution Image Synthesis with Latent Diffusion Models, CVPR, 2022 Overview of Stable Diffusion[M4]
  19. Multimodal AI: Segment Anything Model (SAM)[M5] • Large model having

    generalized performance – Performs segmentation and object detection – Accepts texts as prompt – Uses vision transformer model consists of 600M parameters – Trained using 11M images with 1B of region annotations • By training on large-scale datasets, SAM performs automatic recognition of various subjects (e.g., natural images, illustrations, medical images) 21 [M5] Kirillov A., et al., Segment Anything, arXiv:2304.02643, 2023 Images from https://segment-anything.com/ Images from https://segment-anything.com/
  20. Multimodal AI: Segment Anything Model 2 (SAM 2)[M6] • Model

    compatible with videos and better performance than SAM – Improved accuracy and speed compared to SAM – Trained on approximately 51,000 videos – Incorporation of temporal attention into the model [M6] Ravi N., et al., SAM 2: Segment Anything in Images and Videos, https://ai.meta.com/research/publications/sam-2-segment-anything-in-images-and-videos/, 2024 Segmented images by SAM 2 from https://github.com/facebookresearch/sam2?tab=readme-ov-file 22 Processing flow of SAM 2 from https://github.com/facebookresearch/sam2?tab=readme-ov-file
  21. Applications of SAM in Medical Image Processing • SAM is

    applied in medical image segmentations – Tumor segmentation from pathological image[M7] – Heart segmentation from ultrasound image[M8] – Segmentation from various medical images[M9] 23 [M7] Deng R., et al., Segment Anything Model (SAM) for Digital Pathology: Assess Zero-shot Segmentation on Whole Slide Imaging, arXiv:2304.04155, 2023 [M8] Chen F., et al., The ability of Segmenting Anything Model (SAM) to segment ultrasound images, BioScience Trends, 17(3), pp. 211-218, 2023 [M9] Ma J., et al., Segment Anything in Medical Images, arXiv:2304.12306, 2023 Pathological image[M7] Ultrasound image[M8] Various medical images[M9] • SAM segmented target from all images but accuracy was lower than task-specific model
  22. Multimodal AI: Medical Image Multimodal AI • MedSAM[M10] – Developed

    by fine-tuning SAM using medical images • Used 1.6M medical images – Performs multi-modal medical image processing • MedSAM2[M11] – Developed by fine-tuning SAM2 using medical images – Applicable to 3D CT volumes and time-series image processing 24 [M10] Jun Ma, et al., Segment anything in medical images, Nature Communications volume 15, 654, 2024 [M11] Jiayuan Zhu, et al., Medical SAM 2: Segment medical images as video via Segment Anything Model 2, arXiv:2408.00874, 2024 Segmentation results by MedSAM [M10] Segmentation results by MedSAM2 [M11]
  23. Change of AI Development in Age of Large Models •

    Development of AI in conventional style 25 Task-specific dataset Small model Large cross-task dataset Large model or Foundation Model Task-specific AI AI having generalized performance • Development of AI using large dataset ✓ AI model can be applied multiple tasks ✓ Model having generalized performance ➢ GPT-3.5 and SAM are examples of foundation model ✓ Build dataset and develop AI for each task
  24. AI Development using Foundation Model • Task-specific AI can be

    quickly developed using foundation model – AIs can be developed using small or no dataset 26 Foundation Model Task-specific Models Small number of task-specific data Fine-tuning Few shot learning Change state of models by prompts Zero shot learning Special fine-tuning Reinforcement learning/ Parameter efficient learning/… Task-specific data
  25. Medical AI Development using Foundation Model • Use foundation model

    to develop numerous AIs in downstream tasks[M12] 27 Medical Foundation Model FM for Lung FM for Liver FM for Brain FM for Intestine FM for COVID-19 FM for COPD FM for Tumor FM for liver cancer FM for aneurysm FM for Alzheimer’s disease Task-specific General [M12] Shuo Li, Integrating vision and language: Revolutionary foundations in medical imaging AI, Keynote Speech in SPIE Medical Imaging, 2024
  26. Foundation Models for Medical Images (1/2) • Prov-GigaPath[M13] – Foundation

    model for pathological images – Developed on over 170K images • Applicable task: image classification • PathAsst[M14] – Multimodal foundation model for pathological images – Developed on over 200K image-text pairs • Applicable tasks: image classification, text generation 28 [M13] GigaPath: Whole-Slide Foundation Model for Digital Pathology, https://www.microsoft.com/en-us/research/blog/gigapath-whole-slide-foundation-model-for-digital-pathology/ [M14] Yuxuan Sun, et al. PathAsst: A Generative Foundation AI Assistant Towards Artificial General Intelligence of Pathology, arXiv:2305.15072, 2024
  27. Foundation Models for Medical Images (2/2) • BiomedCLIP[M15] – Multimodal

    foundation model (including CT) – Developed on over 15M image-text pairs • Applicable tasks: image classification, VQA (Visual Question Answering) • BioViL-T[M16] – Multimodal foundation model for X-ray images – Developed on over 170K image-text pairs • Applicable tasks: image classification, text generation 29 [M15] Sheng Zhang, et al. BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs, arxiv:2303.00915, 2024 [M16] Shruthi Bannur, et al., Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing, arXiv:2301.04558, 2023
  28. Summary of Multimodal AI • AI models with generalized performances

    across tasks are developed • Changes in development of AI – Large dataset, high performance computers, and large model are essentials in development of high-performance AI – AIs having generalized performance are called Foundation Model • Foundation model – Many foundation models were developed in medical image processing field – Foundation models will be essential component in medical AIs developments