$30 off During Our Annual Pro Sale. View Details »

What Is Vision Foundation Model and How It Used...

Avatar for Masahiro Oda Masahiro Oda
December 17, 2025

What Is Vision Foundation Model and How It Used to Develop CAD

Presented in the Education Exhibit at RSNA 2025.
If you refer to the contents in this slide, please show this citation information:
Masahiro Oda, Hirotsugu Takabatake, Masaki Mori, Hiroshi Natori, Kensaku Mori, ``What Is Vision Foundation Model and How It Used to Develop CAD,'' RSNA2025, PHEE-17, McCormick Place, Chicago (conference: 2025/11/30-12/04)

Avatar for Masahiro Oda

Masahiro Oda

December 17, 2025
Tweet

More Decks by Masahiro Oda

Other Decks in Research

Transcript

  1. What Is Vision Foundation Model and How It Used to

    Develop CAD Masahiro Oda 1, Hirotsugu Takabatake 2, Masaki Mori 3, Hiroshi Natori 4, Kensaku Mori 1,5 1Nagoya University, Nagoya, Japan 2Sapporo -Minami-Sanjo Hospital, Sapporo, Japan 3Sapporo -Kosei General Hospital, Sapporo, Japan 4Keiwakai Nishioka Hospital, Sapporo, Japan 5National Institute of Informatics, Tokyo, Japan
  2. Teaching Points The purposes of our exhibit are: 1. To

    learn what Vision Foundation Model is and how it is different from conventional AI models 2. To learn how Vision Foundation Model changes AI development 3. To learn how to develop Vision Foundation Models 4. To learn applications of Vision Foundation Model
  3. What is Vision Foundation Model (VFM) • Traditional AI models

    – Task -specific and limited generalizability for different task. • Vision Foundation Model (VFM) – Deep learning models having generalization capacity of image processing. – Applicable to many tasks. Tumor AI Model for tumor detection MedSAM: Foundation model for medical image [Ma24] Foundation model [Bommasani22] [Bommasani22] Rishi Bommasani , et al., On the Opportunities and Risks of Foundation Models, arxiv:2108.07258, 2022 [Ma24] Jun Ma, et al., Segment anything in medical images, Nature Communications volume 15, 654, 2024 4
  4. How Foundation Model Changes AI Development • Traditional style of

    AI development • AI development using VFM Annotated dataset Task -specific Model Supervised learning ⚫ Data collection, annotation, and training are required for each task. ⚫ Burden of annotation is heavy. Dataset (Annotation is not necessary ) Self -supervised learning Fine-tuning Small annotated dataset VFM AI model for colon CAD AI model for … AI model for lung CAD ⚫ AI models for many downstream tasks can be built from VFM. ⚫ Dataset w/o annotation can be used for pre -training. 5
  5. Advantages of AI Development using VFM • AI models for

    various tasks can be created from VFM by fine -tuning. • Fine-tuning can be performed even with small datasets. Task -specific AI Model Small task - specific dataset VFM: Vision Foundation Model Task -specific AI Model Task -specific AI Model Fine-tuning Few shot learning Zero shot learning 6
  6. (Ideal) AI Development using Medical FM • Develop various task

    -specific AI models from Medical FM [Li24] 7 Medical FM Lung FM Eye FM Brain FM Colon FM AI for COVID-19 CAD AI for COPD CAD AI for corneal disease CAD AI for retinal disease CAD AI for aneurism CAD AI for Alzheimer’s disease CAD Task -specific General [Li24] Shuo Li, Integrating vision and language: Revolutionary foundations in medical imaging AI, Keynote Speech in SPIE Medi cal Imaging, 2024
  7. VFM for General Images: Segment Anything Model (SAM) [SAM] •

    Large model having generalized performance. – Performs segmentation and object detection. – Accepts texts as prompt. – Uses vision transformer model consists of 600M parameters. – Trained using 11M images with 1B of region annotations. • By training on large -scale datasets, SAM performs automatic recognition of various subjects. (e.g., natural images, illustrations, medical images) 8 [SAM] Kirillov A., et al., Segment Anything, arXiv:2304.02643, 2023 Images from https://segment -anything.com/ Images from https://segment -anything.com/
  8. VFM for General Images: Segment Anything Model 2 (SAM 2)

    [SAM2] • Model compatible with videos and better performance than SAM – Improved accuracy and speed compared to SAM. – Trained on approximately 51,000 videos. – Incorporation of temporal attention into the model. [SAM2] Ravi N., et al., SAM 2: Segment Anything in Images and Videos, https://ai.meta.com/research/publications/sam -2-segment -anything-in-images-and-videos/, 2024 Segmented images by SAM 2 from https://github.com/facebookresearch/sam2?tab=readme -ov-file 9 Processing flow of SAM 2 from https://github.com/facebookresearch/sam2?tab=readme -ov-file
  9. Medical VFMs for Multi -modal Image • MedSAM [Ma24] –

    For 2D medical image segmentation. – Built from approximately 1.57 million of various medical image modalities. • CT, MR, X-ray, US, mammography, OCT, endoscopy, dermatology, fundus, pathology – Constructed from SAM, with only box prompts available. • MedSAM -2[Zhu24] – Developed for segmentation of 3D medical images (CT) and videos. – Constructed from SAM 2. 10 [Ma24] Jun Ma, et al., Segment anything in medical images, Nature Communications volume 15, 654, 2024 [Zhu24] Jiayuan Zhu, et al. Medical SAM 2: Segment medical images as video via Segment Anything Model 2, arXiv:240800874, 2024 Image(multi-modal) Image(multi-modal) Segmentation results of MedSAM [Ma24] Segmentation results of MedSAM -2 [Zhu24]
  10. Medical VFMs for Multi -modal Image • BiomedCLIP [Zhang24] –

    Applicable to image classification tasks, VQA tasks, etc. – Built with multimodal data of images and text. • Constructed from PMC -15M, consisting of about 15 million figure-caption pairs. 11 [Zhang24] Sheng Zhang, et al. BiomedCLIP : a multimodal biomedical foundation model pretrained from fifteen million scientific image -text pairs, arxiv:2303.00915, 2024 BiomedCLIP [Zhang24] Image(multi-modal)-text
  11. Medical VFMs for Single -modal Image • BioViL -T[Bannur23] –

    Applicable to image classification tasks, text generation tasks, etc. – Built with X-ray images and radiology reports. • Built from MIMIC -CXR v2, consisting of more than 170,000 X -ray image –text pairs. • RETFound [Zhou23] – Developed for fundus images and OCT image processing. – Built with approximately 1.6 million images. 12 [Nannur23] Shruthi Bannur , et al., Learning to Exploit Temporal Structure for Biomedical Vision -Language Processing, arXiv:2301.04558, 2023 [Zhou23] Yukun Zhou, et al. A foundation model for generalizable disease detection from retinal images, Nature, 622, 156 -163, 2023 BioViL -T[Bannur23] Image(single -modal)-text Image(single -modal) RETFound [Zhou23]
  12. Medical VFMs for Single -modal Image • Prov -GigaPath [GigaPath

    ] – Developed for pathological image processing. – Built with 1.3 billion patch images obtained from more than 170,000 WSI images. • PathAsst [Sun24] – Applicable to pathology image classification, text generation, etc. – Built with multimodal data of pathological images and text. • Developed from over 200,000 figure -caption pairs in PubMed. 13 [GigaPath ] GigaPath : Whole-Slide Foundation Model for Digital Pathology, https://www.microsoft.com/en -us/research/blog/gigapath -whole-slide-foundation-model-for-digital-pathology/ [Sun24] Yuxuan Sun, et al. PathAsst : A Generative Foundation AI Assistant Towards Artificial General Intelligence of Pathology, arXiv:2305.15072, 2024 Prov -GigaPath [GigaPath ] PathAsst [Sun24] Image(single -modal)-text Image(single -modal)
  13. AI Development using VFM Dataset w/o annotation Self -supervised learning

    Fine-tuning Dataset w/ annotation VFM AI model for lung CAD AI model for colon CAD AI model for … Pre -training • Self -supervised learning is conducted using dataset w/o annotation. • A large amount of data is utilized. • Substantial computational resources are required. Fine -tuning • AI model is created for downstream tasks. • Transfer learning is performed on foundation model using annotated data. • AI models can be built even from small datasets. 15
  14. Self -supervised Learning Methods for VFM • Contrastive Learning (CL)

    – Examples: SimCLR , DINO – Multimodal version: CLIP • Masked Image Modeling (MIM) – Based on BERT’s self -supervised learning through masked language modeling. – Examples: Masked Autoencoder • CL + MIM – Contrastive Masked Autoencoder 16
  15. Self -supervised Learning Methods for VFM • Contrastive Learning (CL)

    – Examples: SimCLR , DINO – Multimodal version: CLIP • Masked Image Modeling (MIM) – Based on BERT’s self -supervised learning through masked language modeling. – Examples: Masked Autoencoder • CL + MIM – Contrastive Masked Autoencoder 17
  16. Contrastive Learning (CL) • Learns image features representing similarities and

    differences between images that have been modified through data augmentation. Data augmentation Positive pair: Increase similarity of features. Negative pair : Decrease similarity of features. 18
  17. Learning from Positive Pairs in CL • Learns image features

    of objects – Associates features between local and global parts of an object. – Associates features among the components of an object. – Acquires features that are robust to color variations. Crop, Scale Crop, Scale Color jittering 19
  18. Example of CL: SimCLR [Chen20] • Images augmented through data

    augmentation are input into encoder to compute image features. • Encoder is trained so that image features originating from the same source are close, while those from different images are distant. [Chen20] Ting Chen, et al. “A simple framework for contrastive learning of visual representations,” ICML 20, pp.1597 –1607, 2020 Data augmentation Encoder Image feature Positive pair closer Negative pair farther Images within a minibatch Loss function 20
  19. CL: Relationship Between SimCLR Pre -training Settings and Performance [Chen20]

    • Minibatch size and number of epochs – Larger batch sizes lead to higher accuracy. – Increasing the number of epochs also improves accuracy. Minibatch size , number of epochs, and performance [Chen20] [Chen20] Ting Chen, et al. “A simple framework for contrastive learning of visual representations,” ICML 20, pp.1597 –1607, 2020 21
  20. Self -supervised Learning Methods for VFM • Contrastive Learning (CL)

    – Examples: SimCLR , DINO – Multimodal version: CLIP • Masked Image Modeling (MIM) – Based on BERT’s self -supervised learning through masked language modeling. – Examples: Masked Autoencoder • CL + MIM – Contrastive Masked Autoencoder 22
  21. Masked Image Modeling (MIM) • Parts of image are masked,

    and model predicts the masked regions from the remaining visible parts. • Through this process, the model learns the shapes of objects and their spatial relationships within the image. – MIM works well with Vision Transformers ( ViT), which divide images into patches. Partially masked image Input image MIM Reconstruction result of masked regions 23
  22. Patch Division in Vision Transformer ( ViT )[Dosovitskiy21] • Image

    is divided into 16×16-pixel patches for processing. • Model evaluates correlations between patches and assigns higher weights to those most useful for decision -making (Self -Attention). Input image Output Weighting based on correlations (Multi-Head Self -Attention) Patch division MLP Feature extraction Repeat multiple times Position embedding [Dosovitskiy21] Alexey Dosovitskiy , et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” ICLR, 2021 24
  23. MIM: Masked Autoencoder (MAE) [He22] • Learning with encoder -decoder

    architecture – Encoder is ViT and processes only the unmasked patches to improve computational efficiency. – Decoder is small ViT that reconstructs the image from the tokens. [He22] Kaiming He, et al. “Masked autoencoders are scalable vision learners,” CVPR, 2022 Masked Autoencoder [He22] Only unmasked patches are fed into encoder. Decoder reconstructs image from both patch tokens and mask tokens. Loss is computed between the reconstructed image and the target image. ViT Small ViT 25
  24. MIM: Relationship Between MAE Pre -training Settings and Performance [He22]

    • Patch masking ratio – Highest performance achieved with 75% masking ratio. • Data augmentation during pre -training – Using only random cropping and scaling yields the best performance. – Adding color jittering reduces performance. • Number of decoder blocks – High performance is maintained even with a small number of blocks (e.g., 2, 4, or 8). [He22] Kaiming He, et al. “Masked autoencoders are scalable vision learners,” CVPR, 2022 Patch masking ratio and performance [He22] Use of data augmentation and performance [He22] Number of decoder blocks and performance [He22] 26
  25. Self -supervised Learning Methods for VFM • Contrastive Learning (CL)

    – Examples: SimCLR , DINO – Multimodal version: CLIP • Masked Image Modeling (MIM) – Based on BERT’s self -supervised learning through masked language modeling. – Examples: Masked Autoencoder • CL + MIM – Contrastive Masked Autoencoder 27
  26. CL + MIM • Combining two pre -training methods –

    Contrastive learning primarily captures object -level image features and, when using ViT, focuses on intra -patch features. – MIM, on the other hand, mainly learns relationships between patches in ViT. • Method: Contrastive Masker Autoencoder (CMAE) [Huang24] – Performs contrastive learning while masking parts of the image. – Achieves higher performance compared to conventional contrastive learning and MIM methods. [Huang24] Zhicheng Huang, et al. “Contrastive Masked Autoencoders are Stronger Vision Learners,” IEEE PAMI, 46(4), pp.2506 -2517, 2024 Contrastive Masked Autoencoder [Huang24] 28
  27. Application of VFM • Pre-trained VFM is applicable to develop

    various AI models in downstream tasks (CAD systems). • We evaluated how VFM contributes to improve performances of AI models in downstream tasks. 30 Dataset w/o annotation Self -supervised learning Fine-tuning Dataset w/ annotation VFM AI model for lung CAD AI model for colon CAD AI model for … Downstream task
  28. Development of VFMs for Evaluation • Developed VFMs for fundus

    image processing using JOIR (Japan Ocular Imaging Registry) dataset. JOIR fundus image dataset Pre -trained model EfficientNet -B1 Gender classification model Fine-tuning Pre-training using contrastive learning ( SimCLR ) Pre -trained model ViT-Large Gender classification model Fine-tuning Pre-training using masked image modeling (MAE) 31
  29. About Japan Ocular Imaging Registry (JOIR) • Database constructed by

    the Japanese Ophthalmological Society. • Multimodal ophthalmology dataset collected from multiple medical institutions in Japan. – Includes fundus images, OCT images, anterior segment images, etc. – Contains 3,558,696 data • Very useful large -scale dataset for AI development. 32 [JOIR website] http://www.joir.jp/index.html
  30. Evaluation of VFMs on Downstream Task • Downstream task –

    Gender classification from fundus image. • EfficientNet -B1 model pre -trained with SimCLR – Classification accuracy was 92.3%. • ViT-Large model pre -trained with MAE – Classification accuracy was 87.5%. 33 Male or Female Classification model
  31. Performance Changes Depending on Amount of Training Data for Downstream

    Task • With / without using SimCLR pre-trained model. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 130 (0.1%) 261 (0.2%) 654 (0.5%) 1309 (1%) 2620 (2%) 6550 (5%) 13101 (10%) 26204 (20%) 52408 (40%) 78612 (60%) 104817 (80%) 131022 (100%) Classification accuracy Number of training images for downstream tasks 基盤モデル不使用 基盤モデル使用 Higher accuracy when more data is available 34 Higher accuracy when using VFM w/o pre-training w/ pre-training (VFM)
  32. Performance Changes Depending on Amount of Training Data for Downstream

    Task • With / without using MAE pre-trained model. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 130 (0.1%) 261 (0.2%) 654 (0.5%) 1309 (1%) 2620 (2%) 6550 (5%) 13101 (10%) 26204 (20%) 52408 (40%) 78612 (60%) 104817 (80%) 131022 (100%) Classification accuracy Number of training images for downstream tasks 基盤モデル不使用 基盤モデル使用 Accuracy is maintained even when data is limited 35 Higher accuracy when using VFM w/o pre-training w/ pre-training (VFM)
  33. Applications of VFM for Fundus Image • Diagnosis assistance –

    Diabetic retinopathy, age -related maculopathy, retinal detachment, … • Patient -related parameter estimation – Age, gender, … • Metabolic syndrome -related parameter estimation – Abdominal circumference, systolic/diastolic blood -pressures, blood glucose, BMI, … VFM for fundus image CAD for diabetic retinopathy CAD for maculopathy Age Gender AC SYS DIA BG BMI 36
  34. Applications of VFM (More General) • AI development for many

    applications – Reduce load on data correction for each task. • Vision -language model (VLM) development – Automated disease assessment – Improve explainability of AI VFM AI for lung CAD AI for mammo CAD AI for … VFM Vision-language model for retinal images [Holland24] [Holland24] Robbie Holland, et al., Specialist vision -language models for clinical ophthalmology, arXiv:2407.08410, 2024 37
  35. To Develop More Generalized VFMs • Large -scale datasets are

    required to build VFMs – Due to privacy issues, cross -institutional data collection is difficult • Use of Federated Learning (FL) – Enables FM development without moving data outside medical institutions, by exchanging model weights instead • FL -based development of FMs – Federated EndoViT : FM for surgical scene recognition [Kirchner25] – FEDKIM: approach for knowledge injection from client to FM [Wang25] 38 [Kirchner25] Max Kirchner, et al. “Federated EndoViT : Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections,” IEEE TMI, 2025 (in print) [Wang25] Xiaochen Wang, et al., “FEDKIM: Adaptive Federated Knowledge Injection into Medical Foundation Models,“ https://arxiv.org/abs/2408.10276 , 2025 Federated EndoViT [Kirchner25] FEDKIM [Wang25]
  36. Summary • What is Vision Foundation Model (VFM) – VFM

    is deep learning model having generalization capacity of image processing. – AI models for various tasks can be created from VFM by fine -tuning it. – Many medical VFMs are proposed. • How to develop VFM – Self -supervised learning using large dataset is performed to develop. • Applications of VFM – Image-based CAD systems. – Multimodal (vision -language) CAD systems. 39