Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The thin line between reconstruction, classific...

The thin line between reconstruction, classification, and hallucination in brain decoding

CRC Workshop at Rauischholzhausen Castle (Germany)
2024.7.2

Preprint:
https://arxiv.org/abs/2405.10078

Abstract:
Visual image reconstruction aims to recover arbitrary stimulus/perceived images from brain activity. To achieve this, especially with limited training data, it is crucial that the model leverages a compositional representation that spans the image space, with each feature effectively mapped fr brain activity. In light of these considerations, we critically assessed recent “photorealistic” reconstructions based on generative AIs applied to a large-scale fMRI/stimulus dataset (Natural Scene Dataset, NSD). We found a notable decrease in the reconstruction performance with a different dataset specifically designed to prevent train–test overlaps (Deeprecon). The target features of NSD images revealed a strikingly limited diversity with a small number of semantic clusters shared between the training and test sets. Simulations also showed a lack of generalizability with a small number of clusters. This can be explained by “rank deficient prediction,” where any input is mapped into the subspace spanned by training features. By diversifying the training set with the number of clusters that linearly scales with the feature dimension, the decoders exhibited improved generalizability beyond the trained clusters, achieving compositional prediction. It is also important to note that text/semantic features alone are insufficient for a complete mapping to the visual space, even if they are perfectly predicted from brain activity. Building on these observations, we argue that recent “photorealistic” reconstructions may predominantly be a blend of classification into trained categories and the generation of convincing yet inauthentic images (hallucinations) through text-to-image diffusion. To avoid such spurious reconstructions, we offer guidelines for developing generalizable methods and conducting reliable evaluations.

Yuki Kamitani

July 02, 2024
Tweet

More Decks by Yuki Kamitani

Other Decks in Science

Transcript

  1. The thin line between reconstruction, classification and hallucination in brain

    decoding Yuki Kamitani Kyoto University & ATR http://kamitani-lab.ist.i.kyoto-u.ac.jp @ykamit Pierre Huyghe ‘Uumwelt’ (2018)
  2. Kyoto University and ATR Ken Shirakawa Yoshihiro Nagano Shuntaro Aoki

    Misato Tanaka Yusuke Muraki Tomoyasu Horikawa (NTT) Guohua Shen (UEC) Kei Majima (NIRS) Grants JSPS KAKENHI, JST CREST, NEDO Acknowledgements
  3. Reconstruct Test image Takagi and Nishimoto, 2023 Ozcelik and VanRullen,

    2023 Reconstruction • Reconstruction from visual features + text feature-guided diffusion • Natural Scene Dataset (NSD; Allen et al., 2022)
  4. Treatise on Man (Descartes, 1677) World, Brain, and Mind Ernst

    Mach’s drawing of his own visual scene (Mach, 1900)
  5. Fechner’s inner and outer psychophysics Richer contents revealed? #SBJOEFDPEJOH Brain

    decoding as psychological measurement Not a parlor trick!
  6. Brain decoding Let the machine to recognize brain activity patterns

    that humans cannot recognize (Kamitani & Tong, Nature Neuroscience 2005) Machine learning prediction Neural mind-reading via shared representation
  7. (Miyawaki, Uchida, Yamashita, Sato, Morito,Tanabe, Sadato, Kamitani, Neuron 2008) Presented

    Reconstructed Presentedɹɹ Reconstructed Visual image reconstruction
  8. … ~30 zeros 10 x 10 binary “pixels” 2100 =

    10000000ɾɾɾpossible images Brain data for only a tiny subset of images can be measured
  9. Multi-scale Image bases + + + Presented image (contrast) Reconstructed

    image (contrast) fMRI signals Modular (compositional) decoding with local contrast features Training: ~400 random images Test: Images not used in training (arbitrary images; 2^100)
  10. Classification vs. reconstruction Classification: • Classes are predefined and shared

    between train and test Reconstruction: • Ability to predict arbitrary instances in the space of interest • Zero-shot prediction: Beyond training outputs <=> “Double dipping” How to build a reconstruction model with limited data? • Compositional representation: Instances are represented by a combination of elemental features (e.g., pixels, wavelets) • Effective mapping from brain activity to each elemental feature
  11. Visual image reconstruction by decoding local contrasts (Neuron, 2008) Decoding

    dream contents in semantic categories (Science, 2013) Low High ? ? DNN
  12. Figure 2: An illustration of the architecture of our CNN,

    explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264– 4096–4096–1000. neurons in a kernel map). The second convolutional layer takes as input the (response-normalized Krizhevsky et al., 2012 D N N 1ɹ D N N 2 D N N 3 D N N 4 D N N 5 D N N 6 D N N 7 D N N 8 convolutional layers fully-connected layers • Won the object recognition challenge in 2012 • 60 million parameters and 650,000 neurons (units) Nameless, faceless features of DNN • Trained with 1.2 million annotated images to classify 1,000 object categories
  13. −4 0 4 8 True Predicted Unit #562 of CNN8

    predicted from VC Feature value (Horikawa and Kamitani, 2015; Nature Communications 2017; Nonaka et al., 2019) Brain-to-DNN decoding (translation) and hierarchical correspondence
  14. true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC

    Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_u FMD_8s:VC Scenetrain_ GOD5_FMD GOD5_Scen GOD5_FMD true GODtrain_u FMD_8s:VC Scenetrain_ GOD5_FMD GOD5_Scen GOD5_FMD
  15. (Chen, Horikawa, Majima, Aoki, Abdelhack, Tanaka, Kamitani, Science Advances 2023)

    Reconstruction Generator Decoded DNN features Test: Illusory images fMRI activity Stimulus DNN features Stimulus DNN features Training: Natural images Decoder
  16. Translator–generator pipeline Shirakawa, Nagano, Tanaka, Aoki, Majima, Muraki, Kamitani, arxiv

    2024 • Translator: Align brain activity to a machine’s latent features (neural- to-machine latent translation) • Generator: Recover an image from the latent features Latent features: DNN features of an image. Compositional representation that spans the image space of interest.
  17. Realistic output using text-guided methods Reconstruct Test image Takagi and

    Nishimoto, 2023 Ozcelik and VanRullen, 2023 Reconstruction • Reconstruction from visual features + text feature (ClIP)-guided diffusion • Natural Scene Dataset (NSD; Allen et al., 2022) Takagi & Nishimoto (2023) Ozcelik & VanRullen (2023) Shen et al. (2019)
  18. Cherry-picking from multiple generations • Takagi & Nishimoto (2023) generated

    multiple images and selected the best one, a highly questionable procedure • Plausible reconstructions even from random brain data • Something more than cherry-picking? Takagi & Nishimoto (2023)
  19. Failure of replication with different dataset Test image • The

    text-guided methods fail to generalize • Shen et al. (2019) method shows consistent performance across datasets Test image Replication with different dataset
  20. Issues with NSD • Only ~40 semantic clusters, described by

    single words • Significant overlap between training and test sets • For each test image, visually similar images are found in the training set
  21. Miyawaki et al. (2008) Train 440 random binary images 1200

    natural images Test Simple shapes (+ independent set of random images) Train/test splits in our previous studies Shen et al. (2019) Natural images from different categories + simple shapes • Designed to avoid visual and semantic overlaps. Testing out-of- distribution/domain generalization • Why artificial images in test? Pitfalls of naturalistic approaches • The brain has evolved with “natural” images but we can perceive artificial images, too. Models should account for this • Risk of unintended shortcuts with an increasing scale/complexity of data analysis. Use artificial images for control
  22. Failure of zero-shot prediction • CLIP almost always fail to

    identify the true sample against training samples • The predictions for the cluster excluded from training fall on the other clusters • Severe limitation to predict out of training ≒ classification Excluded from training
  23. Failed recovery from true features • The text-guided methods cannot

    recover the original image from the true features • But the outputs have realistic appearances (hallucination) “Realistic reconstructions” may primarily be a blend of 1. Classification into trained categories 2. Hallucination: the generation of convincing yet inauthentic images through text-to-image diffusion Takagi & Nishimoto (2023) Ozcelik & VanRullen (2023) Shen et al. (2019)
  24. Output dimension collapse A regression model’s output collapses to a

    subspace spanned by the training latent features Brain: Latent feature: Weight of Ridge regression: Prediction from brain data: The prediction is a linear weighted summation of the training latent features (if all input dimensions are used for the prediction of each latent feature)
  25. Is out-of-distribution prediction possible? Simulation with clustered data 101 102

    103 Number of training cluster (keeping total sample size) 0.0 0.2 0.4 0.6 0.8 1.0 Cluster identification 0 In Out Chance level Prediction accuracy Latent feature space (Y) • Diverse training features enables effective out-of-distribution prediction. • Compositional prediction: The ability to predict in unseen domains by a combination of predicted features
  26. Whole space? Effective axes? How much diversity necessary? Latent feature

    space (Y) Training latent feature should cover: Training data (latent features, Y) should be diverse enough but • Not necessary to be exponentially diverse • The necessary diversity scales linearly with the dimension of latent features
  27. Questionable train/test split Nishimoto et al. 2011 • (Note: Retrieval

    via an encoding model rather than reconstruction) • 37/48 scenes in test contain nearly identical frames in train • Presumably temporally adjacent frames were split into train and test Shared categories between training and test • “EEG image reconstruction” (e.g., Kavasidis et al., 2017): 2000 images of 40 object categories in ImageNet, shared between train and test • “Music reconstruction” (Denk et al., 2023): 540 music pieces from 10 music genres shared between train and test
  28. • Original images can be recovered from the true features

    of higher CNN layers by pixel optimization with weak prior • Large receptive fields do not necessarily impair neural coding if the number/density of units is fixed (Zhang and Sejnowski, 1999; Majima et al., 2017) Image-level information is thrown away in hierarchical processing?
  29. Caveats with evaluation by identification • Given a prediction, identify

    the most similar one among candidates • Even with the quality of distinguishing two broad categories (e.g., dark vs. bright), the pairwise identification accuracy can reach 75% • High identification does not imply good reconstruction •Use multiple evaluations, including visual inspections
  30. How are we fooled by generative AIs? Introducing Generative AIs

    • Our (old) intuition: Only truthful things look realistic • But now, generative AIs are producing a lot of realistic-looking but not truthful things • Researchers need to update the intuition to better estimate Pr[T|R] T: Truthful R: Realistic-looking
  31. Illusions in AI-driven scientific research Messeri & Crockett (2024) Alchemist:

    "It's shining golden... I’ve found how to make gold!” *MMVTJPOPG FYQMBOBUPSZEFQUI *MMVTJPOPG FYQMBOBUPSZCSFBEUI *MMVTJPOPG PCKFDUJWJUZ
  32. Summary •Brain decoding as psychological measurement •Reconstruction: Zero-shot prediction of

    arbitrary instances using compositional latent representation •Classification+hallucination by text-guided methods? •Cherry-picking •Low-diversity data with train/test overlap •Recovery failure: Misspecification of latent features •Output dimension collapse and linear scaling of diversity • Image information preserved across visual hierarchy • How are we fooled by generative AIs