Slide 1

Slide 1 text

The thin line between reconstruction, classification and hallucination in brain decoding Yuki Kamitani Kyoto University & ATR http://kamitani-lab.ist.i.kyoto-u.ac.jp @ykamit Pierre Huyghe ‘Uumwelt’ (2018)

Slide 2

Slide 2 text

Kyoto University and ATR Ken Shirakawa Yoshihiro Nagano Shuntaro Aoki Misato Tanaka Yusuke Muraki Tomoyasu Horikawa (NTT) Guohua Shen (UEC) Kei Majima (NIRS) Grants JSPS KAKENHI, JST CREST, NEDO Acknowledgements

Slide 3

Slide 3 text

Reconstruct Test image Takagi and Nishimoto, 2023 Ozcelik and VanRullen, 2023 Reconstruction • Reconstruction from visual features + text feature-guided diffusion • Natural Scene Dataset (NSD; Allen et al., 2022)

Slide 4

Slide 4 text

Treatise on Man (Descartes, 1677) World, Brain, and Mind Ernst Mach’s drawing of his own visual scene (Mach, 1900)

Slide 5

Slide 5 text

Fechner’s inner and outer psychophysics Slide: Cheng Fan

Slide 6

Slide 6 text

#VUUPO QSFTT Fechner’s inner and outer psychophysics

Slide 7

Slide 7 text

Fechner’s inner and outer psychophysics Richer contents revealed? #SBJOEFDPEJOH Brain decoding as psychological measurement Not a parlor trick!

Slide 8

Slide 8 text

Brain decoding Let the machine to recognize brain activity patterns that humans cannot recognize (Kamitani & Tong, Nature Neuroscience 2005) Machine learning prediction Neural mind-reading via shared representation

Slide 9

Slide 9 text

(Miyawaki, Uchida, Yamashita, Sato, Morito,Tanabe, Sadato, Kamitani, Neuron 2008) Presented Reconstructed Presentedɹɹ Reconstructed Visual image reconstruction

Slide 10

Slide 10 text

… ~30 zeros 10 x 10 binary “pixels” 2100 = 10000000ɾɾɾpossible images Brain data for only a tiny subset of images can be measured

Slide 11

Slide 11 text

Multi-scale Image bases + + + Presented image (contrast) Reconstructed image (contrast) fMRI signals Modular (compositional) decoding with local contrast features Training: ~400 random images Test: Images not used in training (arbitrary images; 2^100)

Slide 12

Slide 12 text

Classification vs. reconstruction Classification: • Classes are predefined and shared between train and test Reconstruction: • Ability to predict arbitrary instances in the space of interest • Zero-shot prediction: Beyond training outputs <=> “Double dipping” How to build a reconstruction model with limited data? • Compositional representation: Instances are represented by a combination of elemental features (e.g., pixels, wavelets) • Effective mapping from brain activity to each elemental feature

Slide 13

Slide 13 text

Visual image reconstruction by decoding local contrasts (Neuron, 2008) Decoding dream contents in semantic categories (Science, 2013) Low High ? ? DNN

Slide 14

Slide 14 text

Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264– 4096–4096–1000. neurons in a kernel map). The second convolutional layer takes as input the (response-normalized Krizhevsky et al., 2012 D N N 1ɹ D N N 2 D N N 3 D N N 4 D N N 5 D N N 6 D N N 7 D N N 8 convolutional layers fully-connected layers • Won the object recognition challenge in 2012 • 60 million parameters and 650,000 neurons (units) Nameless, faceless features of DNN • Trained with 1.2 million annotated images to classify 1,000 object categories

Slide 15

Slide 15 text

−4 0 4 8 True Predicted Unit #562 of CNN8 predicted from VC Feature value (Horikawa and Kamitani, 2015; Nature Communications 2017; Nonaka et al., 2019) Brain-to-DNN decoding (translation) and hierarchical correspondence

Slide 16

Slide 16 text

Deep image reconstruction ʢShen, Horikawa, Majima, Kamitani, bioRxiv 2017; Plos CB 2019ʣ

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_unav5_new:VC FMD_8s:VC Scenetrain_unav5_8s:VC GOD5_FMD5_unav_8s:VC GOD5_Scene5_unav_8s:VC GOD5_FMD5_Scene5_unav_8s:VC true GODtrain_u FMD_8s:VC Scenetrain_ GOD5_FMD GOD5_Scen GOD5_FMD true GODtrain_u FMD_8s:VC Scenetrain_ GOD5_FMD GOD5_Scen GOD5_FMD

Slide 19

Slide 19 text

Imagined Reconstruction ʢShen, Horikawa, Majima, Kamitani, bioRxiv 2017; Plos CB 2019; Movie by M. Tanakaʣ Mental imagery

Slide 20

Slide 20 text

Illusions (Cf., Shimojo, Kamitani, Nishida, Science 2001)

Slide 21

Slide 21 text

(Chen, Horikawa, Majima, Aoki, Abdelhack, Tanaka, Kamitani, Science Advances 2023) Reconstruction Generator Decoded DNN features Test: Illusory images fMRI activity Stimulus DNN features Stimulus DNN features Training: Natural images Decoder

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

Translator–generator pipeline Shirakawa, Nagano, Tanaka, Aoki, Majima, Muraki, Kamitani, arxiv 2024 • Translator: Align brain activity to a machine’s latent features (neural- to-machine latent translation) • Generator: Recover an image from the latent features Latent features: DNN features of an image. Compositional representation that spans the image space of interest.

Slide 24

Slide 24 text

Realistic output using text-guided methods Reconstruct Test image Takagi and Nishimoto, 2023 Ozcelik and VanRullen, 2023 Reconstruction • Reconstruction from visual features + text feature (ClIP)-guided diffusion • Natural Scene Dataset (NSD; Allen et al., 2022) Takagi & Nishimoto (2023) Ozcelik & VanRullen (2023) Shen et al. (2019)

Slide 25

Slide 25 text

Cherry-picking from multiple generations • Takagi & Nishimoto (2023) generated multiple images and selected the best one, a highly questionable procedure • Plausible reconstructions even from random brain data • Something more than cherry-picking? Takagi & Nishimoto (2023)

Slide 26

Slide 26 text

Failure of replication with different dataset Test image • The text-guided methods fail to generalize • Shen et al. (2019) method shows consistent performance across datasets Test image Replication with different dataset

Slide 27

Slide 27 text

Issues with NSD • Only ~40 semantic clusters, described by single words • Significant overlap between training and test sets • For each test image, visually similar images are found in the training set

Slide 28

Slide 28 text

Miyawaki et al. (2008) Train 440 random binary images 1200 natural images Test Simple shapes (+ independent set of random images) Train/test splits in our previous studies Shen et al. (2019) Natural images from different categories + simple shapes • Designed to avoid visual and semantic overlaps. Testing out-of- distribution/domain generalization • Why artificial images in test? Pitfalls of naturalistic approaches • The brain has evolved with “natural” images but we can perceive artificial images, too. Models should account for this • Risk of unintended shortcuts with an increasing scale/complexity of data analysis. Use artificial images for control

Slide 29

Slide 29 text

Failure of zero-shot prediction • CLIP almost always fail to identify the true sample against training samples • The predictions for the cluster excluded from training fall on the other clusters • Severe limitation to predict out of training ≒ classification Excluded from training

Slide 30

Slide 30 text

Failed recovery from true features • The text-guided methods cannot recover the original image from the true features • But the outputs have realistic appearances (hallucination) “Realistic reconstructions” may primarily be a blend of 1. Classification into trained categories 2. Hallucination: the generation of convincing yet inauthentic images through text-to-image diffusion Takagi & Nishimoto (2023) Ozcelik & VanRullen (2023) Shen et al. (2019)

Slide 31

Slide 31 text

Output dimension collapse A regression model’s output collapses to a subspace spanned by the training latent features Brain: Latent feature: Weight of Ridge regression: Prediction from brain data: The prediction is a linear weighted summation of the training latent features (if all input dimensions are used for the prediction of each latent feature)

Slide 32

Slide 32 text

Is out-of-distribution prediction possible? Simulation with clustered data 101 102 103 Number of training cluster (keeping total sample size) 0.0 0.2 0.4 0.6 0.8 1.0 Cluster identification 0 In Out Chance level Prediction accuracy Latent feature space (Y) • Diverse training features enables effective out-of-distribution prediction. • Compositional prediction: The ability to predict in unseen domains by a combination of predicted features

Slide 33

Slide 33 text

Whole space? Effective axes? How much diversity necessary? Latent feature space (Y) Training latent feature should cover: Training data (latent features, Y) should be diverse enough but • Not necessary to be exponentially diverse • The necessary diversity scales linearly with the dimension of latent features

Slide 34

Slide 34 text

Questionable train/test split Nishimoto et al. 2011 • (Note: Retrieval via an encoding model rather than reconstruction) • 37/48 scenes in test contain nearly identical frames in train • Presumably temporally adjacent frames were split into train and test Shared categories between training and test • “EEG image reconstruction” (e.g., Kavasidis et al., 2017): 2000 images of 40 object categories in ImageNet, shared between train and test • “Music reconstruction” (Denk et al., 2023): 540 music pieces from 10 music genres shared between train and test

Slide 35

Slide 35 text

• Original images can be recovered from the true features of higher CNN layers by pixel optimization with weak prior • Large receptive fields do not necessarily impair neural coding if the number/density of units is fixed (Zhang and Sejnowski, 1999; Majima et al., 2017) Image-level information is thrown away in hierarchical processing?

Slide 36

Slide 36 text

Caveats with evaluation by identification • Given a prediction, identify the most similar one among candidates • Even with the quality of distinguishing two broad categories (e.g., dark vs. bright), the pairwise identification accuracy can reach 75% • High identification does not imply good reconstruction •Use multiple evaluations, including visual inspections

Slide 37

Slide 37 text

How are we fooled by generative AIs? Introducing Generative AIs • Our (old) intuition: Only truthful things look realistic • But now, generative AIs are producing a lot of realistic-looking but not truthful things • Researchers need to update the intuition to better estimate Pr[T|R] T: Truthful R: Realistic-looking

Slide 38

Slide 38 text

Illusions in AI-driven scientific research Messeri & Crockett (2024) Alchemist: "It's shining golden... I’ve found how to make gold!” *MMVTJPOPG FYQMBOBUPSZEFQUI *MMVTJPOPG FYQMBOBUPSZCSFBEUI *MMVTJPOPG PCKFDUJWJUZ

Slide 39

Slide 39 text

Summary •Brain decoding as psychological measurement •Reconstruction: Zero-shot prediction of arbitrary instances using compositional latent representation •Classification+hallucination by text-guided methods? •Cherry-picking •Low-diversity data with train/test overlap •Recovery failure: Misspecification of latent features •Output dimension collapse and linear scaling of diversity • Image information preserved across visual hierarchy • How are we fooled by generative AIs