Geometric Modeling of Crystal Structures using Transformers

Slide 1

Slide 1 text

1 Geometric Modeling of Crystal Structures using Transformers Tatsunori Taniai Senior Researcher OMRON SINIC X Corporation Seminar at RIKEN Center for AIP June 5th, 2025, RIKEN AIP Nihonbashi Office 15:00 – 16:00

Slide 2

Slide 2 text

2 Tatsunori Taniai ―Short Bio― • 2017: Ph.D. from UTokyo (Advisor: Prof. Yoichi Sato) • 2017-2019: PD at RIKEN AIP (Discrete Optimization Unit led by Dr. Maehara) • Since 2019: Senior Researcher at OMRON SINIC X Stereo & flow & motion seg [CVPR 17] Semantic correspondence [CVPR 16] Stereo depth estimation [CVPR 14, TPAMI 18] Binary MRF optimization [BMVC 12, CVPR 15] My PhD thesis was about discrete optimization for low-level computer vision tasks such as stereo and segmentation, without any “learning”

Slide 3

Slide 3 text

3 Tatsunori Taniai ―Short Bio― • 2017: Ph.D. from UTokyo (Advisor: Prof. Yoichi Sato) • 2017-2019: PD at RIKEN AIP (Discrete Optimization Unit led by Dr. Maehara) • Since 2019: Senior Researcher at OMRON SINIC X In the deep learning era, I have been seeking methodologies for integrating physical or algorithmic principles into deep learning-based methods. Physics-based self-supervised learning [Taniai & Maehara, ICML 18] Neural A* for learning path planning problems [Yonetani & Taniai+, ICML 21] Transformer encoders for crystals [Taniai+, ICLR 24] [Ito & Taniai+, ICLR 25]

Slide 4

Slide 4 text

4 Substances as 3D point clouds of atoms Molecules Proteins Crystals • 3D structures of up to 100s of atoms • Huge molecules with 1k to 10k atoms • Coded as 1D amino- acid sequences • Infinite number of atoms with periodicity • Focus in this talk Substances are atoms forming stable structures in 3D space

Slide 5

Slide 5 text

5 Transformer encoders for understanding contexts He runs a company . Article Pronoun Verb or noun Noun Period Subject Object (business org) Article EOS Verb (jog or manage?) Self-attention with 1D sequential positions Subject Object (business org) Article EOS Verb (manage) Self-attention with 1D sequential positions Self-attention can estimate context-dependent meanings of words in text

Slide 6

Slide 6 text

6 Transformer encoders for understanding contexts He runs a dog . Article Pronoun Verb or noun Noun Period Subject Object (animal) Article EOS Verb (jog or manage?) Self-attention with 1D sequential positions Subject Object (pet) Article EOS Verb (train/exercise) Self-attention with 1D sequential positions Self-attention can estimate context-dependent meanings of words in text

Slide 7

Slide 7 text

7 Transformer encoders for analyzing substance structures H O H Oxygen Hydrogen Hydrogen Structure of H2 O molecule "H2O Molecule" (https://skfb.ly/6QWvZ) by Mehdi Mirzaie is licensed under Creative Commons Attribution (http://creativecommons.org/licenses/by/4.0/).

Slide 8

Slide 8 text

8 Transformer encoders for analyzing substance structures H O H Oxygen Hydrogen Hydrogen Abstract state of H in H2O Abstract state of O in H2O Abstract state of H in H2O Self-attention with 3D spatial positions Self-attention with 3D spatial positions Use self-attention to evolve atomic states in given spatial configurations (Atomic token) Task-related state of H in H2O Task-related state of O in H2O Task-related state of H in H2O

Slide 9

Slide 9 text

9 Geometric deep learning for materials science Property prediction Structure prediction Foundation models Input Output • High-throughput screening of materials • Basic benchmark tasks for material encoders • Need invariant encoders • Generate novel structures (e.g., inverse design) • Find stabler structures • Predict chemical reactions • Need equivariant decoders • Predict high-level functionalities of materials • Map materials space • Material encoders as an interface to multimodal FMs Today’s main topic Briefly introduce our recent results at the end

Slide 10

Slide 10 text

10 Crystalformer: Infinitely Connected Attention for Periodic Structure Encoding Tatsunori Taniai OMRON SINIC X Corporation Ryo Igarashi OMRON SINIC X Corporation Yuta Suzuki Toyota Motor Corporation Naoya Chiba Tohoku University Kotaro Saito Randeft Inc. Osaka University Yoshitaka Ushiku OMRON SINIC X Corporation Kanta Ono Osaka University The Twelfth International Conference on Learning Representations May 7th through 11th, 2024 at Messe Wien Exhibition and Congress Center Vienna, Austria 2024

Slide 11

Slide 11 text

11 Materials science and crystal structures Materials science Explore and develop new materials with useful properties and functionalities, such as superconductors and battery materials. Crystal structure • Source code of material. • Infinitely repeating, periodic arrangement of atoms in 3D space. • Described by a minimum repeatable pattern called unit cell. Crystal structure of NaCl Unit cell

Slide 12

Slide 12 text

12 Material property prediction Crystal structure Material properties • Formation energy • Total energy • Bandgap • Energy above hull • etc. Neural network • High-throughput alternative to physics simulation. • Material screening to accelerate material discovery and development. • Interfaces to multimodal foundation models via crystal encoders. (unit cell)

Slide 13

Slide 13 text

13 Interatomic message passing layers Periodic SE(3) invariant prediction Material properties • Formation energy • Total energy • Bandgap • Energy above hull • etc. Rotation Translation Periodic boundary shift Crystal structure • Evolve the state feature of each unit cell atom via interatomic interactions. • For property prediction, networks need to be invariant under periodic SE(3) transformations of atomic positions. • While not our focus, force prediction requires networks to be SE(3) equivariant.

Slide 14

Slide 14 text

14 Crystals Molecules Advances in ML Advances in material representation learning Many geometric GNNs • Duvenaud+ 2015 • Kearnes+ 2016 • Gilmer+ 2017 Success of Transformers • Graphormer [Ying+ 2021] • Equiformer [Liao+ 2023] CNNs (2011-) • ResNet [He+ 2015] GNNs (2015-) • PointNet [Qi+ 2016] • DeepSets [Zaheer+ 2017] • GCN [Kipf & Welling 2017] • GIN [Xu+ 2018] Transformers (2017-) • Transformer [Vaswani+ 2017] • BERT [Devlin+ 2018] • Image generation [Parmar+ 2018] • ViT [Dosovitskiy+ 2020] 3D arrangement of finite atoms 3D arrangement of infinite atoms Many geometric GNNs • CGCNN [Xie & Grossman, 2018] • SchNet [Schütt+ 2018] • MEGNet [Chen+ 2019] Emergence of Transformers • Matformer [Yan+ 2022] Graphormer (2021) demonstrated the effectiveness of fully connected self-attention for molecules. But its applicability to infinitely periodic crystal structures remains an open question.

Slide 15

Slide 15 text

15 Atomic state evolution by self-attention Molecule Fully connected self-attention for finite elements Relative position representations and • Encode relative position between atoms and . • | | scalar bias for softmax logits. • | | : vector bias for value features. • Distance-based representations ensure SE(3) invariance. Atom-wise state features • , , and : linear projections of input atom-wise state feature . • : output atom-wise state feature.

Slide 16

Slide 16 text

16 Atomic state evolution by self-attention Molecule Crystal structure Fully connected self-attention for finite elements Infinitely connected self-attention for periodic elements Unit cell (𝒏) and (𝒏) encode relative position (𝒏) to reflect periodic unit cell shifts 𝒏.

Slide 17

Slide 17 text

17 Unit cell Interpretation as neural potential summation (𝑟) where Distance-decay attention

Slide 18

Slide 18 text

18 Interpretation as neural potential summation where Distance-decay attention Interpreted as interatomic energy calculations in abstract feature space • 𝑟 ─ Abstract interatomic potential between atoms and ( ) • ── Abstract influences on atom from atom ( ) Analogy to potential summation in physics simulations • For example, the electric potential energy between one and many particles, with electric charges , is calculated as 𝐽𝑖 = σ 1 4 0 − .

Slide 19

Slide 19 text

19 Periodic spatial encoding Periodic edge encoding Performed as finite-element self-attention Infinitely connected attention can be performed just like standard self-attention for finite elements with new position encoding and .

Slide 20

Slide 20 text

20 Evaluations on the Materials Project dataset Form E. Bandgap Bulk mod. Shear mod. Train/Val/Test 60000/5000/4239 60000/5000/4239 4664/393/393 4664/392/393 MAE unit eV/atom eV log(GPa) log(GPa) CGCNN [Xie & Grossman, 2018] SchNet [Schütt+, 2018] MEGNet [Chen+, 2019] GATGNN [Louis+, 2020] M3GNet [Chen & Ong, 2022] ALIGNN [Choudhary & DeCost, 2021] Matformer [Yan+, 2022] PotNet [Lin+, 2023] 0.031 0.033 0.03 0.033 0.024 0.022 0.021 0.0188 0.292 0.345 0.307 0.28 0.247 0.218 0.211 0.204 0.047 0.066 0.06 0.045 0.05 0.051 0.043 0.04 0.077 0.099 0.099 0.075 0.087 0.078 0.073 0.065 Crystalformer 0.0198 0.201 0.0399 0.0692 Consistently outperforms most of the existing methods in various property prediction tasks, while competitive with the GNN-based SOTA, PotNet [Lin+, 2023]. Win

Slide 21

Slide 21 text

21 Evaluations on the JARVIS-DFT 3D 2021 dataset Form E. Total E. Bandgap (OPT) Bandgap (MBJ) E hull Train/Val/Test 44578/5572/5572 44578/5572/5572 44578/5572/5572 14537/1817/1817 44296/5537/5537 MEA unit eV/atom eV/atom eV eV eV CGCNN [Xie & Grossman, 2018] SchNet [Schütt+, 2018] MEGNet [Chen+, 2019] GATGNN [Louis+, 2020] M3GNet [Chen & Ong, 2022] ALIGNN [Choudhary & DeCost, 2021] Matformer [Yan+, 2022] PotNet [Lin+, 2023] 0.063 0.045 0.047 0.047 0.039 0.0331 0.0325 0.0294 0.078 0.047 0.058 0.056 0.041 0.037 0.035 0.032 0.2 0.19 0.145 0.17 0.145 0.142 0.137 0.127 0.41 0.43 0.34 0.51 0.362 0.31 0.3 0.27 0.17 0.14 0.084 0.12 0.095 0.076 0.064 0.055 Crystalformer 0.0319 0.0342 0.131 0.275 0.0482 Win Consistently outperforms most of the existing methods in various property prediction tasks, while competitive with the GNN-based SOTA, PotNet [Lin+, 2023].

Slide 22

Slide 22 text

22 Model efficiency comparison • Our model achieves higher efficiency than SOTA methods, such as PotNet (GNN-based) and Matfomer (transformer-based). • Our architecture remains simple and closely follows the original transformer encoder, unlike Matformer, which involves many architectural modifications. Arch. type Train/Epoch Total train Test/Mater. # Params # Params/Block PotNet [Lin+, 2023] Matformer [Yan+, 2022] Crystalformer GNN Transformer Transformer 43 s 60 s 32 s 5.9 h 8.3 h 7.2 h 313 ms 20.4 ms 6.6 ms 1.8 M 2.9 M 853 K 527 K 544 K 206 K Multi-head attention (Figure 2) + Concat Linear + Feed forward Self-attention block Self-attention block Self-attention block Self-attention block Pooling Feed forward Train and test times are evaluated on JARVIS-DFT 3D (formulation energy) dataset.

Slide 23

Slide 23 text

23 Fourier-space attention for long-range interactions Spatial Fourier transform Long-tail Gaussians with large σ in real space become short-tail Gaussians in Fourier space (or reciprocal space), enabling long-range interatomic interactions via self-attention.  Decays slowly with increasing |n| when σ is large Huge improvement over SOTA (0.055)

Slide 24

Slide 24 text

24 Rethinking the role of frames for SE(3)-invariant crystal structure modeling Tatsunori Taniai* OMRON SINIC X Corporation Ryo Igarashi OMRON SINIC X Corporation Yusei Ito* ONRON SINIC X Intern Osaka University (D1) Yoshitaka Ushiku OMRON SINIC X Corporation Kanta Ono Osaka University The Thirteenth International Conference on Learning Representations May 24th through 28th, 2025 at Singapore EXPO Singapore 2025

Slide 25

Slide 25 text

25 Atomic state evolution by self-attention Infinitely connected self-attention for crystals (Crystalformer [Taniai+, ICLR 25]) Distance-decay attention. Linear projection of radial basis functions (RBF) that encode a distance into a soft one-hot vector. Distance-based models ensure invariance under SE(3) transformations (rotation and translation) but have limited expressive power. Two types of position encodings

Slide 26

Slide 26 text

26 Enhancing model expressivity under rotation invariance Invariant features Frames Equivariant features • Use distances btw pairs  Limited expressivity • Use angles btw triplets  Requires modeling many combinations of 3-body interactions • Standardize the orientations • Find structure-aligned coord. systems (e.g., using PCA). • ☺ No restriction on the architectural design • Use spherical tensors in SO(3)-equivariant nets. •  Restricted nonlinearity •  Heavy and limited angular resolution •  Mathematically difficult Our focus 𝜃 Canonical representation Image from torch-harmonics [Bonev+, 2023] 𝒆 1 𝒆2 𝒆3

Slide 27

Slide 27 text

27 Challenges and questions in frame-based crystal modeling Crystals are infinite Unit cells are artificial What are frames for? • How can we define a standard orientation for such structures? • Apparently different slices can represent the same crystal. • Should we rely on such arbitrary representations? • There are many possible ways to construct frames. • What makes a good frame? • Is orientation normalization alone sufficient? Canonical representation

Slide 28

Slide 28 text

28 Rethinking the role of frames Frames are ultimately used in GNNs’ message passing layers to derive richer yet invariant information than distance for the message func ← . Message passing layer  Not rotation invariant ☺ Rotation invariant Frame transformation (e.g., eigenvecs of PCA) 𝐹 = 𝒆 1 𝒆 2 𝒆 3 𝑇 • = − : relative position • ← : message from to • : scalar weight

Slide 29

Slide 29 text

29 Dynamic frames Let’s dynamically construct a frame 𝐹𝑖 for each target and each layer such that it normalizes the orientation of the local structure represented by . Message passing layer • Self-attention weights show which atoms actively interact with the target atom . • Use as a mask on the structure. 𝒆 1 𝒆2 Masked structure viewed from Dynamic frame

Slide 30

Slide 30 text

30 Dynamic frames: some analogies Message passing layer Image from https://www.vlfeat. org/overview/sift.h tml Rot-invariant local features Normalization layers Batch Norm Linear layer Layer Norm Linear layer Dynamic frames are expected to better normalize the structural information before passing it to the message function.

Slide 31

Slide 31 text

31 Dynamic frames: definitions Weighted PCA frames • Compute a weighted covariance matrix for each target atom: • Compute orthonormal eigenvectors 𝒆1 , 𝒆2 , 𝒆3 of Σ , corresponding to 𝜆1 ≥ 𝜆2 ≥ 𝜆3 , as the frame axes. Max frames (weight-prioritized point selection with orthogonalization) • Select 1 with the maximum weight and set 𝒆1 ← ത 1 . • Compute 𝒆2 similarly while ensuring orthogonality (i.e., 𝒆1 ⋅ 𝒆2 = 0). • Set 𝒆3 ← 𝒆1 × 𝒆2 . To ensure SE(3) invariance, we constrain to be a rotation matrix.

Slide 32

Slide 32 text

32 CrystalFramer: Crystalformer + dynamic frames Extend the distance-based edge feature term of Crystalformer by incorporating 3D direction vectors via dynamic frames. Frame-projected 3D direction vector Distance Invariant edge features

Slide 33

Slide 33 text

33 Evaluations on the JARVIS-DFT 3D 2021 dataset Comparisons between dynamic frames and their static counterparts (weighted PCA vs PCA; max vs static local) show that dynamic frames outperform conventional static frames. E form. E total. BG (OPT) BG (MBJ) E hull Matformer (Yan et al., 2022) 0.0325 0.035 0.137 0.30 0.064 PotNet (Lin et al., 2023) 0.0294 0.032 0.127 0.27 0.055 eComFormer (Yan et al., 2024) 0.0284 0.032 0.124 0.28 0.044 iComFormer (Yan et al., 2024) 0.0272 0.0288 0.122 0.26 0.047 Crystalformer (Taniai et al., 2024) 0.0306 0.0320 0.128 0.274 0.0463 ─ w/ PCA frames (Duval et al., 2023) 0.0325 0.0334 0.144 0.292 0.0568 ─ w/ lattice frames (Yan et al., 2024) 0.0302 0.0323 0.125 0.274 0.0531 ─ w/ static local frames 0.0285 0.0292 0.122 0.261 0.0444 ─ w/ weighted PCA frames (proposed) 0.0287 0.0305 0.126 0.279 0.0444 ─ w/ max frames (proposed) 0.0263 0.0279 0.117 0.242 0.0471 Counterparts

Slide 34

Slide 34 text

34 Other work in our group

Slide 35

Slide 35 text

35 Neural structure fields (NeSF) for material decoding • Unlike point clouds of 3D surfaces in CV, decoding atomic systems is challenging due to their unknown and variable number of atoms. • We propose to represent point-based structures as continuous vector fields: Naoya Chiba*, Yuta Suzuki*, Tatsunori Taniai, Ryo Igarashi, Kotaro Saito, Yoshitaka Ushiku, Kanta Ono. Neural structure fields with application to crystal structure autoencoders. Communication Materials (2023). Latent of a structure 3D query point given arbitrarily Information about the nearest atom • 3D displacement vector • Categorical distribution of atomic species (e.g., H, He,…)

Slide 36

Slide 36 text

36 Field-based crystal autoencoders Achieved better reconstructions compared to conventional voxel- based decoding. Input Voxel-based Field-based Naoya Chiba*, Yuta Suzuki*, Tatsunori Taniai, Ryo Igarashi, Kotaro Saito, Yoshitaka Ushiku, Kanta Ono. Neural structure fields with application to crystal structure autoencoders. Communication Materials (2023). Voxel-based: often fails to reconstruct some atoms. Field-based Too many atoms Too few atoms

Slide 37

Slide 37 text

37 CLaSP: CLIP-like multimodal learning for materials science • CV has fostered large-scale datasets of images with textual annotations (e.g., ImageNet, MS-COCO), enabling multimodal learning between text and images (CLIP, 2021). • Materials science lacks such resources, mainly due to the difficulty of crowdsourcing. • Instead, we leverage a public database of 400k materials with publication metadata (titles and abstracts) to enable contrastive learning between text and structure. Yuta Suzuki, Tatsunori Taniai, Ryo Igarashi, Kotaro Saito, Naoya Chiba, Yoshitaka Ushiku, Kanta Ono. Bridging Text and Crystal Structures: Literature-driven Contrastive Learning for Materials Science. Machine Learning: Science and Technology (2025). Also appeared at NerIPS 2024 AI4Mat and CVPR 2025 MM4Mat.

Slide 38

Slide 38 text

38 Application: text-based retrieval of crystal structures Key finding: literature-driven learning enables models to predict high-level functionalities of crystal structures. Yuta Suzuki, Tatsunori Taniai, Ryo Igarashi, Kotaro Saito, Naoya Chiba, Yoshitaka Ushiku, Kanta Ono. Bridging Text and Crystal Structures: Literature-driven Contrastive Learning for Materials Science. Machine Learning: Science and Technology (2025). Also appeared at NerIPS 2024 AI4Mat and CVPR 2025 MM4Mat. Better

Slide 39

Slide 39 text

39 Application: visualization of materials space In the resulting embedding space, structures with similar properties automatically form clusters. t-SNE visualization Yuta Suzuki, Tatsunori Taniai, Ryo Igarashi, Kotaro Saito, Naoya Chiba, Yoshitaka Ushiku, Kanta Ono. Bridging Text and Crystal Structures: Literature-driven Contrastive Learning for Materials Science. Machine Learning: Science and Technology (2025). Also appeared at NerIPS 2024 AI4Mat and CVPR 2025 MM4Mat.

Slide 40

Slide 40 text

40 Summary of this talk Crystalformer [Taniai+, ICLR 2024] – A natural extension of standard transformers for periodic crystal structures – Distance-decay attention abstractly mimics energy calculations in physics – Fourier-space attention captures long-range interatomic interactions CrystalFramer [Ito & Taniai+, ICLR 2025] – Introduces dynamic frames derived from attention mechanisms to enhance expressive power Applications – Crystal encoders are immediately applicable to high-throughput property prediction – Serve as core components in embedding learning and multimodal foundation models Future directions – Extend to equivariant networks for structure and force-field prediction – Such networks are essential for generative modeling (e.g., diffusion models)