Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning for Materials (Lecture 4)

Aron Walsh
January 30, 2024

Machine Learning for Materials (Lecture 4)

Aron Walsh

January 30, 2024

More Decks by Aron Walsh

Other Decks in Science


  1. Aron Walsh Department of Materials Centre for Processable Electronics Machine

    Learning for Materials 4. Materials Data and Representations
  2. Course Contents 1. Course Introduction 2. Materials Modelling 3. Machine

    Learning Basics 4. Materials Data and Representations 5. Classical Learning 6. Artificial Neural Networks 7. Building a Model from Scratch 8. Recent Advances in AI 9. and 10. Research Challenge
  3. Data-Driven Materials Research Pettifor maps A series of work on

    structural classification of compounds and alloys D. G. Pettifor, Materials Science and Technology 4, 675 (1988)
  4. Data-Driven Materials Research D. G. Pettifor, Materials Science and Technology

    4, 675 (1988) Hand-built features Mendeleev number is used for efficient grouping of structure types (to capture periodic trends)
  5. Data-Driven Materials Research Structure-property correlations Connect crystal structure with measurable

    properties (mechanical, electronic, etc.) Early analysis was manual and often focused on linear relations with physics-informed features J. C. Phillips, Rev. Mod. Phys. 42, 317 (1970)
  6. Data Representation Image from https://www.deeplearningbook.org Choice of units or coordinate

    system can greatly impact model performance Changes in data scale and distribution will impact the ability of a given model to make predictions
  7. Where to Find Materials Data? • Manual collection – go

    through papers, extract data and tabulate (takes time) • Accelerated collection – use of natural language processing (requires model and workflow) • Pre-built databases – excellent when they exist in your area (may require access fees) • Automated experiments – generate your own data over a given parameter space (expensive)
  8. Why Share Materials Data? • Reproducibility – allow direct comparison

    with published literature beyond static tables and figures, e.g. raw spectra and diffraction patterns • Reuse – facilitate meta-studies comparing results from multiple experiments, e.g. variation in UV-vis spectra for different samples • Statistical models – power of machine learning depends on the quantity, quality, and diversity of training data
  9. Common Forms of Data Sharing • Supporting information with publications

    – often in the form of static pdf files (increasingly obsolete) • Data repositories – most institutions offer data upload portals, but often lack guidelines and metadata, e.g. zip or tar files • Community-specific repositories – best option if available, usually in a common format and searchable, with error detection
  10. FAIR Data Standards https://www.howtofair.dk/what-is-fair • Findable: discoverable by humans &

    machines with metadata & persistent identifiers (e.g. DOI) • Accessible: archived in long-term storage with clear access terms (e.g. CC open license) • Interoperable: exchangeable between different applications and systems using open file formats • Reusable: well documented and curated with clear terms and conditions on usage
  11. Database Access Mode Advantage Disadvantage Web browser No knowledge of

    database software is required Often one material at a time – slow for large datasets Data file All data is downloaded as one (e.g. zip or tar) file Specialist software often needed; data is not up-to-date API* (e.g. Python) Access latest data with advanced queries Some programming knowledge required *API = Application Programming Interface Tip: Keep a record of the database version you are using; data can change
  12. Data Security • Privacy: protection of personal data e.g. General

    Data Protection Regulation (GDPR) • Encryption: protocols for data storage and transfer e.g. public key encryption, hashing • Access control: limiting users or specific computers e.g. passwords, firewalls • Data integrity: avoiding corruption or modification e.g. data provenance tracking, regular versioning • Data recovery: robust backup system e.g. RAID array for distributed local storage, cloud storage Not all databases are public, e.g. companies and academic-industrial collaborations
  13. Data Provenance Projects can combine data from many sources. Provenance

    graphs are one way to link them https://www.aiida.net/sections/graph_gallery.html Connections between structures, calculations, and data Graph for a project on 324 covalent organic frameworks
  14. Class Outline Materials Data and Representations A. Data sources and

    formats B. Crystal representations • Compositional • Structural • Graphs
  15. Representation of Materials Minimal representation Input: Atomic number, Z Coordinates,

    R Output: Properties Ab initio quantum mechanics (QM) Model performance depends on the choice of compositional and structural features ! " 𝐇|Ψ = ⟩ E|Ψ electronic wavefunction Effective representation Input: Feature vector, X Output: Properties Machine learning (ML) 𝑦 = 𝑓 𝐗, 𝚯 learned weights
  16. How to Best Represent a Molecule? Networks of atoms (nodes)

    connected by bonds (edges) J. J. Sylvester, Am. J. Math. 1, 64 (1878)
  17. How to Best Represent a Material? Many possible materials features

    from micro to macroscopic length scales Image: Courtesy of Taylor Sparks (University of Utah) Wavefunctions or electron density Local atomic connectivity Grain size and orientation Shape
  18. One-Hot Encoding We can use an n-dimensional vector to categorise

    the atomic number of the elements in a compound [100000000...] H He Li Be B C N O F…. [000001010...] H He Li Be B C N O F…. Element Compound '1' indicates the presence of that specific element and '0' for others
  19. Hand-Built (Local) Vectors We can define elemental feature vectors based

    on standard properties of the elements 22 dimensional Magpie representation from L. Ward et al, npj Comp. Mater. 2, 16028 (2016) https://github.com/WMD-group/ElementEmbeddings
  20. Hand-Built (Local) Vectors We can define elemental feature vectors based

    on standard properties of the elements X(Fe2 O3 ) = [2X(Fe) + 3X(O)]/5 https://github.com/WMD-group/ElementEmbeddings X1 X2 X3 … Xn Fe 0.5 0.1 0.0 0.8 O 0.3 0.3 0.1 0.6 Fe2 O3 0.4 0.2 0.1 0.7 Different types of pooling is possible (e.g. max, min, mean)
  21. Learned (Distributed) Vectors SkipAtom 200 D Structure graph pooling Mat2Vec

    200 D Literature word embedding https://github.com/WMD-group/ElementEmbeddings We can learn elemental feature vectors to encode information as part of model training
  22. Element Embeddings Toolkit by Anthony Onwuli to access and modify

    elemental and compositional representations https://github.com/WMD-group/ElementEmbeddings
  23. Learned Chemical Similarity Quantify with distance (e.g. Chebyshev), similarity (e.g.

    Cosine), or correlation (e.g. Pearson) metrics cos 𝜃 = 𝑨 ' 𝑩 𝑨 𝑩 Cosine similarity B A Anthony Onwuli et al, Digital Discovery 2, 1558 (2023) Bi H H Bi
  24. Learned Chemical Similarity Dimensionality reduction confirms a natural clustering of

    elements into “groups” Principal Component Analysis (PCA) Anthony Onwuli et al, Digital Discovery 2, 1558 (2023)
  25. Class Outline Materials Data and Representations A. Data sources and

    formats B. Crystal representations • Compositional • Structural • Graphs
  26. Crystallography High symmetry crystal: MgO Cubic 8 atom unit cell

    a = b = c Low symmetry crystal: BiVO4 Monoclinic 24 atom unit cell a ≠ b ≠ c
  27. Crystallography 7 crystal systems, 14 Bravais lattices, 230 space groups,

    103 prototype structures Conventional crystallographic description Unit cell size & shape: a, b, c, ⍺, β, ɣ Atomic positions: x, y, z Problem for ML: standard representations are not invariant* *with respect to atomic permutation, unit cell rotations, and translations
  28. Unit Cell Transformations The same structure is described in each

    case 4 5 6 0 0 0 0.5 0.5 0.5 𝑎 𝑏 𝑐 𝑥) 𝑦) 𝑧) 𝑥* 𝑦* 𝑧* Two-atom orthorhombic unit cell Atomic permutation 4 5 6 0.5 0.5 0.5 0 0 0 Crystal rotation Unit cell translation 4 5 6 0.0 0.5 0.5 0.5 0 0 5 4 6 0.5 0.5 0.5 0 0 0 ML models based on variant representations may struggle to generalise
  29. Structural Representations Many structural descriptors have been developed Several are

    implemented in https://singroup.github.io/dscribe • Smooth Overlap of Atomic Positions (Bartók et al, 2013) - radial expansion in spherical harmonics • Coulomb Matrix (Rupp et al, 2012) - mimics electrostatic interactions (qi qj /rij ) • Many Body Tensor Representation (Huo et al, 2017) - distribution of structural motifs • Atomic Cluster Expansion (Drautz, 2019) - body-order expansion (two-body distances, angles…)
  30. Real Space Grid Voxels (three-dimensional pixels) used in computer graphics

    can describe a unit cell Image courtesy of Taylor Sparks (University of Utah) Used in early materials ML, but not recommended for structure
  31. Pairwise Interatomic Distances Coulomb matrix is a global descriptor that

    mimics the electrostatic interaction between nuclei Implemented in https://singroup.github.io/dscribe Sine matrix is a modification that accounts for periodicity
  32. Invariant Structural Representations Atomic Cluster Expansion (ACE) provides a systematic

    representation of atomic environments through radial (R) and angular (Y) terms 𝜙 𝑟 = 𝑅! 𝑌! " Site basis function 𝑨𝒊 = ' $%&'()*+,- 𝜙 𝑟 Permutation invariance 𝑩𝒊 = ) 𝑨𝒊 𝑑𝑄 Rotation (Q) invariance R. Drautz, Phys. Rev. B. 99, 014104 (2019); arXiv:2311.16326 (2023) Product basis B forms a body-order expansion Property = 𝑓(𝑩𝒊 , 𝚯) ACE has been used in linear and deep learning models for materials weights
  33. ML Powered Molecular Dynamics J. D. Morrow, J. L. A.

    Gardner and V. Deringer, J. Chem. Phys. 158, 121501 (2023) Octahedral tilt correlation Classical models are being complemented by machine learning force fields (MLFF) Two start-of-the-art implementations are MACE and Allegro, based on equivariant neural network regression
  34. ML Powered Molecular Dynamics Octahedral tilt correlation Enable large-scale simulations

    of complex materials such as organic-inorganic solids Simulation of the metal halide perovskite CsPbI3 A 69,120 atom molecular dynamics simulation within the atomic cluster expansion (ACE) formalism W. J. Baldwin et al, Small 2303565 (2023) Ask Xia (GTA)
  35. Class Outline Materials Data and Representations A. Data sources and

    formats B. Crystal representations • Compositional • Structural • Graphs
  36. Graphs Graphs are a representation common to many domains and

    problems Image courtesy of Michael Bronstein (University of Oxford)
  37. Graph Components Nodes (or Vertices), Edges, Global Attributes Chemical systems

    N – atoms E – bonds G – unit cell or material properties N Edge Edge Edge Global N N Vectors can be associated with each component to encode & exchange information
  38. Graph Components Nodes (or Vertices), Edges, Global Attributes Image from

    https://distill.pub/2021/gnn-intro Graphs can be fully connected (every node connected to every other node), but sparse connections are often used
  39. Graph Components Nodes (or Vertices), Edges, Global Attributes Image from

    https://distill.pub/2021/gnn-intro For chemical problems, nearest-neighbour connectivity is common, as used in “ball and stick” representations
  40. Standard crystallographic representation of materials Fractional positions xyz of atoms

    within a unit cell formed of lattice vectors abc Effective for humans Crystal graph representation Nodes (atoms) connected by edges (bonds). Multiple edges can describe periodicity Effective for ML models Crystal Graphs T. Xie and J. C. Grossman, Phys. Rev. Lett. 120, 145301 (2018)
  41. Materials Graphs Nodes can be used to represent larger structural

    units of a crystal or even entire grains M. Dai et al, npj Comp. Mater. 7, 103 (2021)
  42. Multi-Scale Representations Ongoing efforts to combine features that bridge from

    the micro to macroscale; from atoms to devices S. B. Torrisi et al, APL Machine Learning 1, 020901 (2023)
  43. Class Outcomes 1. Describe the importance of materials data for

    research and development 2. Demonstrate an understanding of the types of data that are shared in the materials community 3. Explain the ways that the composition and structure of a material can be featurised Activity: Chemical space