Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning for Materials (Lecture 7)

Aron Walsh
February 11, 2024

Machine Learning for Materials (Lecture 7)

Aron Walsh

February 11, 2024
Tweet

More Decks by Aron Walsh

Other Decks in Science

Transcript

  1. Aron Walsh
    Department of Materials
    Centre for Processable Electronics
    Machine Learning for Materials
    7. Building a Model from Scratch

    View full-size slide

  2. Course Contents
    1. Course Introduction
    2. Materials Modelling
    3. Machine Learning Basics
    4. Materials Data and Representations
    5. Classical Learning
    6. Artificial Neural Networks
    7. Building a Model from Scratch
    8. Recent Advances in AI
    9. and 10. Research Challenge

    View full-size slide

  3. Class Outline
    Building a Model from Scratch
    A. Data Preparation
    B. Model Choice
    C. Training and Testing

    View full-size slide

  4. Data Preparation
    Data sets
    in the wild
    Data sets
    in tutorials
    Gremlins (1984); Adapted from @TowardsAI

    View full-size slide

  5. Data Preparation
    • Data sources
    • Data cleaning and pre-processing
    • Feature engineering
    • Feature scaling and normalisation
    Materials data must be refined and structured
    to build effective and robust statistical models

    View full-size slide

  6. Data Sources
    Data sets can be static (majority of cases)
    Data collection à Model training
    Primary choices are: (i) literature collection;
    (ii) databases; (iii) experiments or simulations
    Data sets can be dynamic (e.g. active learning)
    Data collection à Model training à Data collection…

    View full-size slide

  7. Data Sources
    Data should be representative of your problem
    but does not need to be all-encompassing
    Effective local features are transferrable
    Primary choices are: (i) literature collection;
    (ii) databases; (iii) experiments or simulations
    Data size required depends on model complexity
    Rule of thumb: 100-1000 data points per feature
    for classical ML (10 features = 102-104 training set)
    Reminder: data sources covered in Lecture 4

    View full-size slide

  8. Data Cleaning and Pre-processing
    Beware of bias. Visualise data distributions as
    summary statistics don’t tell the full story
    “Same Stats, Different Graphs”; http://dx.doi.org/10.1145/3025453.3025912
    Each 2D dataset has the same summary statistics to two
    decimal places: x =54.26, y = 47.83, σx
    = 16.76,
    σy
    = 26.93, Pearson r = -0.06
    - -

    View full-size slide

  9. Data Cleaning and Pre-processing
    Beware of bias. Visualise data distributions as
    summary statistics don’t tell the full story
    “Same Stats, Different Graphs”; http://dx.doi.org/10.1145/3025453.3025912
    Six data distributions, each with the same 1st quartile,
    median, and 3rd quartile values (and the same box plot)
    Box plot: 50% data
    Q1 Q3
    Min Max
    whisker whisker

    View full-size slide

  10. Data Cleaning and Pre-processing
    Materials datasets are often heavily biased
    with skewed property distributions
    Calculated band gaps from density functional theory (PBE functional)

    View full-size slide

  11. Data Cleaning and Pre-processing
    Check for missing, outlier, and noisy data
    Identify: use data exploration techniques
    e.g. summary statistics, visualisation, profiling
    Impute: fill in missing values
    e.g. using mean imputation or regression
    Cap: set thresholds for extreme values (“winsorising”)
    Remove: Simply remove instances with
    missing/erroneous values from your dataset
    Other data transformations may be suitable (e.g. log, cube root)

    View full-size slide

  12. Feature Engineering
    I. Goodfellow, Y. Bengio, A. Courville, “Deep Learning”
    Classical ML
    Crafting of features
    tailored to domain
    knowledge and
    problem specifics due
    to limitations of
    simpler models

    View full-size slide

  13. Feature Engineering
    Deep Learning
    Use simple inputs and
    automatically learn
    features, benefiting from
    complex architectures
    (e.g. CNNs)
    I. Goodfellow, Y. Bengio, A. Courville, “Deep Learning”

    View full-size slide

  14. Feature Engineering
    K. V. Chuang and M. J. Keiser, Science 362, 6416 (2018)

    View full-size slide

  15. Feature Engineering
    Choice of many of compositional, structural
    and property features for materials
    Feature selection: chose the most relevant
    features to improve performance (iterative approach)
    Dimensionality reduction: useful for high-dimensional
    data, e.g. principal component analysis (PCA)
    Aggregation: combine data over dimension(s),
    e.g. mean value over space (r), time (t), wavelength (λ)
    Reminder: you have domain (expert) knowledge to engineer meaningful features

    View full-size slide

  16. Standardisation: centre distribution around
    0 with unit variance, e.g. xstandard
    = (x-x)/std(x)
    Min-max scaling: rescale to a range (usually 0-1),
    e.g. xscaled
    = (x-min(x))/(max(x)-min(x))
    Robust scaling: adjust for outliers using median &
    interquartile range, e.g. xrscaled
    = (x-median(x))/IQR(x))
    Feature Scaling and Normalisation
    Uniformity in feature scales may enhance
    model stability and convergence
    -

    View full-size slide

  17. Class Outline
    Building a Model from Scratch
    A. Data Preparation
    B. Model Choice
    C. Training and Testing

    View full-size slide

  18. Model Choice
    https://xkcd.com/1838/

    View full-size slide

  19. https://scikit-learn.org/stable/modules/clustering.html

    View full-size slide

  20. Model Choice
    Balance accuracy, generalisation, and
    transparency for materials predictions
    Goal: ensure the model is suitable for your task
    e.g. property prediction, classification, clustering
    Data size: for small datasets, simpler models with
    fewer parameters are preferable
    Complexity: simpler models are more transparent;
    don’t rush to the latest deep learning if not needed

    View full-size slide

  21. Complexity Trade-off
    Model
    Accuracy
    Model Interpretability
    Non-linear,
    high cost
    Neural
    networks
    Classical
    ML models
    Linear,
    low cost
    Linear
    regression
    Low
    High
    Low High
    “Use the simplest
    model that solves
    your problem” ANON
    There are many model variants and exceptions to the schematic

    View full-size slide

  22. Complexity Trade-off
    A. Mignan and M. Broccardo, Nature 574, E1 (2019)
    Published deep
    learning model
    13,451 parameters
    One neuron
    2 parameters

    View full-size slide

  23. Deep learning choices
    Layers: input, hidden, output
    Activation functions: sigmoid, ReLU…
    Topology: feedforward, convolutional…
    Model Architecture
    The structure of a model influences
    its learning capability and complexity
    Optimal architecture should enhance
    feature extraction, model capacity, task suitability
    Best practice is to compare to a baseline model, e.g.
    most frequent class (classification) or mean value (regression)

    View full-size slide

  24. Class Outline
    Building a Model from Scratch
    A. Data Preparation
    B. Model Choice
    C. Training and Testing

    View full-size slide

  25. Training and Testing
    https://www.instagram.com/redpenblackpen

    View full-size slide

  26. Model
    assessment
    Supervised ML Model Workflow
    Initial dataset
    x, y
    Data cleaning and
    feature engineering
    The exact workflow depends on the type of problem and available data
    Model
    training and
    validation
    Final
    model
    xnew
    ypredict
    Test (20%)
    xtest
    , ytest
    Train (80%)
    xtrain
    , ytrain
    Human
    time
    intensive
    Computer
    time
    intensive
    Production

    View full-size slide

  27. Model Training
    Iteratively optimise, validate, and fine-tune
    models for reliable and robust predictions
    Key training choices
    Loss function: quantify the difference between
    model predictions and target values, e.g. MSE
    Optimisation algorithm: update model parameters
    to minimise the loss function, e.g. stochastic gradient
    descent (SGD), adaptive moment estimation (ADAM)
    ADAM: D. P. Kingma and J. Ba, arXiv.1712.6980 (2014)

    View full-size slide

  28. Model Evaluation
    Validation set: Subset of training data used to
    fine-tune hyperparameters and prevent overfitting
    Cross-validation (CV): Divide training data into
    multiple subsets for training and validation
    Test set: Separate “holdout” dataset used to
    evaluate final performance and predictive power
    Warning: “test” and “validation” can be used interchangeably by some communities
    Evaluate models through data splitting for
    training (validation) & testing (final assessment)

    View full-size slide

  29. Cross-Validation (CV)
    https://scikit-learn.org/stable/modules/cross_validation.html
    Assess performance on multiple portions
    of the dataset. Choice in how the data is split
    k-fold CV
    Iteratively train on
    k-1 folds
    Stratified k-fold CV
    Ensure even class
    distribution
    Leave-one-out CV
    For small datasets
    Monte Carlo CV
    Random sampling

    View full-size slide

  30. Cross-Validation (CV)
    S. Verma, M. Rivera, D. O. Scanlon and A. Walsh, J. Chem. Phys. 156, 134116 (2022)
    For heterogeneous data, random splits are not ideal.
    An alternative is to cluster the data first
    Visualising molecular datasets in global chemical space using UMAP

    View full-size slide

  31. Hyperparameter Tuning
    Optimal choice of settings that impact model
    performance and learning during training
    Well-tuned hyperparameters prevent overfitting,
    improve convergence, and enhance model generalisation
    Tuning strategies
    Grid search: exhaustive (within grid), but expensive
    Random search: efficient, but may miss solutions
    Optimisation: evolutionary, Bayesian… efficient, but
    complex (introduce their own parameters)

    View full-size slide

  32. Training Data Dependence
    T. Viering and M. Loog, arXiv 2103.10948 (2021)
    Learning curves provide a visual representation
    of model performance with dataset size
    Model B
    Model A
    Single number
    model comparisons
    overlook data size
    dependence
    Convergence of
    training and
    validation errors
    indicate a
    balanced model

    View full-size slide

  33. Avoid “p-hacking” (Data Dredging)
    “Big little lies” A. M. Stefan and F. D. Schönbrodt, R. Open Soc. Sci. 10, 220346 (2023)
    Manipulation of data and analysis techniques
    to achieve statistically significant results
    Term comes from hacking the p-value:
    p-value = P(observed result | null hypothesis is true)
    No statistical relationship
    between variables
    Misuse of ML methods
    Selective train & test sets
    Data leakage
    Deliberate outlier exclusion
    Improper rounding
    p-hacked
    (<0.05)
    (<0.005)
    baseline

    View full-size slide

  34. Checklist for ML Research Reports
    N. Artrith et al, Nature Chemistry 13, 505 (2021)
    Useful for project planning too: https://www.nature.com/articles/s41557-021-00716-z

    View full-size slide

  35. Beyond (Average) Supervised Models
    J. Schrier et al, J. Am. Chem. Soc. 40, 21699 (2023)
    The most interesting materials, and the emergence of
    unexpected properties, can be outliers

    View full-size slide

  36. Class Outcomes
    1. Knowledge of ML model development process
    2. Identify and mitigate overfit and
    underfit regimes in model training
    3. Selection of appropriate performance model
    evaluation techniques
    Activity:
    Crystal hardness revisited

    View full-size slide