Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Semi-supervised Learning Approaches For Microstructure Classification

Semi-supervised Learning Approaches For Microstructure Classification

Daniel Wheeler

July 22, 2022
Tweet

More Decks by Daniel Wheeler

Other Decks in Science

Transcript

  1. 1
    Materials Science & Engineering Department Computational Materials Sci. Lab.
    Semi-supervised Learning Approaches For
    Microstructure Classification
    Courtney Kunselman1, Vahid Attari1, Levi McClenny2, Ulisses Braga-Neto2, Raymundo Arroyavea1,3
    1Department of Materials Science and Engineering, Texas A&M University
    2Department of Electrical Engineering, Texas A&M University
    3Department of Mechanical Engineering, Texas A&M University
    CHiMaD Workshop
    April 21, 2020

    View full-size slide

  2. 2
    Materials Science & Engineering Department Computational Materials Sci. Lab. 2
    It is all about exploration!
    Mars 2020 is a Mars rover mission
    by NASA's Mars Exploration
    Program that includes
    the Perseverance rover with a
    planned launch on 17 July 2020,
    and touch down in Jezero crater on
    Mars on 18 February 2021.

    View full-size slide

  3. 3
    Materials Science & Engineering Department Computational Materials Sci. Lab. 3
    Why?
    • 1st element: Acceleration of materials development and deployment
    • 2nd element: The creation and curation of large scale materials databases
    has been widely cited as a critical required component for the acceleration
    of materials development and deployment.
    – [Niezgoda et al. 2013]
    • This has been a recognized need by the materials community since the
    1970’s .
    – [Materials Science and Engineering -- Volume II, The Needs, Priorities, and Opportunities for
    Materials Research]
    • 3rd element: Forging these links requires quantitative analysis, and while
    processing parameters and property observations are generally easily
    quantifiable—they tend to be represented as objects that exist in a relatively
    low dimensional space
    – [Kunselman et al. 2019]
    The need
    Solution
    Develop
    Improve

    View full-size slide

  4. 4
    Materials Science & Engineering Department Computational Materials Sci. Lab. 4
    The current ongoing work
    • In particular, we seek to address the following questions:
    1. How to we generate microstructure databases?
    2. What is the appropriate way to represent the generated high-dimensional
    microstructure space for inclusion into the database?
    3. How do we forge the links between the process-structure-property paradigm in
    different ensemble of materials and the generated microstructure databases?

    View full-size slide

  5. 5
    Materials Science & Engineering Department Computational Materials Sci. Lab. 5
    The Strategy for Propagation of Uncertainty
    Output
    Input
    Multi-scale phase-field framework
    Attari et al Acta Materialia (183), 2020, 452-470

    View full-size slide

  6. 6
    Materials Science & Engineering Department Computational Materials Sci. Lab. 6
    The Strategy for Propagation of Uncertainty
    Output
    Input
    Multi-scale phase-field framework
    Attari et al Acta Materialia (183), 2020, 452-470

    View full-size slide

  7. 7
    Materials Science & Engineering Department Computational Materials Sci. Lab. 7
    Quantification of uncertainty in a CALPHAD model
    CALPHAD
    !"#" = %
    &
    '()*+ + '-."/0123-2* + '/*24"-3 56
    Phase diagram
    Attari et al Acta Materialia (183), 2020, 452-470
    Input parameters
    '()*+
    7
    8-, : = ∑-
    8-. 0>-
    7
    + ?: ∑-
    8-@A(8-) + ∑-
    ∑DEF
    8-8D
    ∑G
    HI-D
    7
    (8- − 8D)
    where
    νI-D
    ∅ = νM-D
    ∅ + νN-D
    ∅ . :

    View full-size slide

  8. 8
    Materials Science & Engineering Department Computational Materials Sci. Lab. 8
    Propagation of uncertainty in chain of models: CALPHAD-Microelasticity-Phase-field
    CALPHAD
    Microelasticity
    !"#" = %
    &
    '()*+
    + '-."/0123-2*
    + '/*24"-3
    56
    Phase diagram Elasto-chemical
    microstructure space
    Phase-field inputs
    Attari et al Acta Materialia (183), 2020, 452-470
    Composition
    Kinetic
    CALPHAD
    Input parameters
    7
    88
    9:
    7
    8;
    9:
    7
    <<
    9:
    7
    88
    9=
    7
    8;
    9=
    7
    <<
    9:
    6
    >
    9:
    6
    >
    9=
    ?--
    @

    View full-size slide

  9. 9
    Materials Science & Engineering Department Computational Materials Sci. Lab. 9
    Propagation of uncertainty in chain of models: Quantities of interests
    CALPHAD
    Microelasticity
    !"#" = %
    &
    '()*+
    + '-."/0123-2*
    + '/*24"-3
    56
    Phase diagram Elasto-chemical
    microstructure space
    Phase-field inputs
    Attari et al Acta Materialia (183), 2020, 452-470
    Composition
    Kinetic
    CALPHAD
    Input parameters
    7
    88
    9:
    7
    8;
    9:
    7
    <<
    9:
    7
    88
    9=
    7
    8;
    9=
    7
    <<
    9:
    6
    >
    9:
    6
    >
    9=
    ?--
    @
    Features

    View full-size slide

  10. 10
    Materials Science & Engineering Department Computational Materials Sci. Lab. 10
    Input space Structure space
    Characterization with physical
    descriptors
    Property space
    Microstructure quantification and analysis
    Microelasticity
    Composition
    Kinetic
    CALPHAD
    c c
    Mass scattering Interface scattering
    More anisotropic
    Finer domain

    View full-size slide

  11. 11
    Materials Science & Engineering Department Computational Materials Sci. Lab. 11
    General overview of what we did
    • Forging the process/structure/property links requires quantitative
    analysis.
    – This has been a recognized need by the materials community since the
    1970’s [Materials Science and Engineering -- Volume II, The Needs,
    Priorities, and Opportunities for Materials Research].
    – The creation and curation of large scale materials databases has been
    widely cited as a critical required component for the acceleration of
    materials development and deployment [Niezgoda et al. 2013].
    • Pattern recognition and data analysis using the physical descriptors.
    http://microstructures.net
    Open Phase-field
    Microstructure
    Database

    View full-size slide

  12. 12
    Materials Science & Engineering Department Computational Materials Sci. Lab. 12
    Motivation
    • Uncovering links between processing conditions, microstructure, and
    properties
    Process Structure Property
    Forward Propagation
    Inverse problem
    Cause and Effect/Trial and Error
    Goal-Oriented Design
    Structure space is
    high-dimensional
    and difficult to
    navigate, making
    its efficient
    characterization a
    pressing research
    interest.
    Discrete classes sharing
    structural features are
    identified and automated
    classifiers are trained over
    a feature space which
    provides adequate
    discrimination between
    these classes

    View full-size slide

  13. 13
    Materials Science & Engineering Department Computational Materials Sci. Lab. 13
    Miscibility boundary engineering Increasing composition for a constant temperature, T
    Spherical precipitates Bicontinuous Spherical precipitates

    View full-size slide

  14. 14
    Materials Science & Engineering Department Computational Materials Sci. Lab. 14
    Current status
    • Supervised learning aims to learn a function that, given a sample of data and
    desired outputs, approximates a function that maps inputs to outputs.
    – Need an expert for each specific material system
    – Expensive (both time and money)
    – Subjective
    – Manual labeling
    • Semi-supervised learning aims to label unlabeled data points using
    knowledge learned from a small number of labeled data points.
    • Unsupervised learning does not have (or need) any labeled outputs, so its
    goal is to infer the natural structure present within a set of data points.
    • Almost all microstructure classification in the literature comes from:
    – Well-studied material systems with widely-accepted classes (i.e. steels)
    – Simulations concerned with building a specific structure (each batch of data is
    created with a specific class label)
    • So, there is very little guidance as to how to classify when the ground truth
    labeling is partially or completely unknown.
    Classification
    Supervised
    Image-level
    Pixel-level
    Object-level
    Semi-supervised
    Image-level
    Pixel-level
    Unsupervised
    Image-level Pixel-level
    Segmentation

    View full-size slide

  15. 15
    Materials Science & Engineering Department Computational Materials Sci. Lab. 15
    Dataset
    • 10,000 phase field simulations taken from OPMD database
    Bicontinuous
    20%
    Precipitate
    59%
    Ambiguous
    21%
    DISTRIBUTION OF
    MICROSTRUCTURE TYPES
    Bicontinuous
    Precipitates
    Ambiguous
    (Unlabeled)

    View full-size slide

  16. 16
    Materials Science & Engineering Department Computational Materials Sci. Lab. 16
    Image Processing
    • Binarization through the Otsu Method
    – Iterates through all possible thresholds and chooses the one where the sum of
    background and foreground variances is at a minimum
    The raw image Binarized version Black autocorrelation
    function

    View full-size slide

  17. 17
    Materials Science & Engineering Department Computational Materials Sci. Lab. 17
    Featurization, The microstructure function m(x,l)
    • For each image, the microstructure function !(#, %) is defined as a wide-sense stationary
    stochastic process in which ℎ is random variable associated with the probability of finding a
    specific local state at spatial position #.
    – ℎ ∈ ), complete set of local states of interest
    – # is an index, similar to time for a signal
    • The two-point correlation function is then defined as:
    *+,+,
    -.
    , -/
    = 1 ! -.
    , % ! -/
    , %′ ,
    • But stationarity of ! allows us to reduce *+,+3 to a function of one spatial variable 4 = -/
    − -.
    :
    *+,+,
    4 = 1 ! -, % ! - + 4, %′ .
    • Microstructure function discretizing and periodic boundary conditions and a primitive basis gives:
    *+,+,
    4 =
    1
    :
    ;
    <
    =<
    + =<>4
    +,
    ,

    View full-size slide

  18. 18
    Materials Science & Engineering Department Computational Materials Sci. Lab. 18
    Featurization, The Two-Point Correlation Function, !
    ","$
    (&)
    • !","$
    (&) is an stochastic microstructure descriptor.
    • Black phase autocorrelations ((),)
    (&)) were calculated for each image
    • This results in a 262,144-dimensional feature space
    – Dimensionality reduction is necessary: Incremental PCA analysis

    View full-size slide

  19. 19
    Materials Science & Engineering Department Computational Materials Sci. Lab. 19
    Featurization
    Normalized Two-Point Correlation Function, !"##$,$&(#)
    • Why normalized?
    – Normalizing )*,*
    # removes the strong relationship with volume fraction, which allows a PCA decomposition
    of +,--
    .,.& # to be used as a discriminative feature space based on structural information.
    • We define the normalized two-point correlation function as:
    +,--
    .,.&
    # =
    0 1 2, 3 1 2 + #, 35 − 0 1 2, 3 0[1 2 + #, 35 ]
    9:-[1 2, 3 ] 9:-[1 2 + #, 35 ]
    .
    • For local state < at both endpoints, this reduces to:
    +,--*,*
    # =
    )*,*
    # − =*
    >
    =*
    − =
    *
    >
    where =* is the black phase volume fraction.

    View full-size slide

  20. 20
    Materials Science & Engineering Department Computational Materials Sci. Lab. 20
    Dimension Reduction
    • High-confidence data split into:
    – Training set (1,536, ~80%)
    – Validation set (384, ~20%)
    • IPCA performed on a combination of labeled
    training and ambiguous sets
    – Validation set then projected into this
    space
    Explained variance of the first five Principal
    Components (PC) for decompositions of
    !","
    (%) and '(%%b,b(%).

    View full-size slide

  21. 21
    Materials Science & Engineering Department Computational Materials Sci. Lab. 21
    Supervised Classification, Baseline SVM
    • Data was mean-centered at zero and scaled to unit variance.
    • Hyperparameters (! > 0 for all and $ for Gaussian kernel) optimized through exhaustive
    grid search with 5-fold cross validation on labeled training set.
    – Soft margin SVM, the larger C gets, the stricter the boundary becomes.
    • Provides a performance baseline on high-confidence data and a method of visualizing
    high-dimensional data.
    ! 10
    $ 0.01
    Training Error 0.0358
    Test Error 0.0547
    Collect training data.
    Assemble features with a
    property that stores the
    known class label
    Instantiate a
    classifier. Set its
    parameters if
    necessary.
    Train the
    classifier using
    the training data.
    Classify an image
    or feature
    collection.
    Estimate
    classification
    error
    Workflow for classification:

    View full-size slide

  22. 22
    Materials Science & Engineering Department Computational Materials Sci. Lab. 22
    Thoughts
    • While the Baseline SVM works well on the high-confidence data, the
    boundary is uninformed by most of the data points closest to it
    – Probably cannot be trusted to label the ambiguous set…
    • We need to assign labels to some of the ambiguous microstructures.

    View full-size slide

  23. 23
    Materials Science & Engineering Department Computational Materials Sci. Lab. 23
    Semi-Supervised Classification, Transductive SVMs
    • Uses both labeled and unlabeled data
    – Unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable
    improvement in learning accuracy.
    • There is always the danger that the wrong semi-supervised method will deteriorate classification
    performance instead of improving it
    – To alleviate this concern, a collection of methods with very different mathematical structures are used and the
    subset identified through their consensus is added to the training set
    • Supervised method modified to be semi-supervised
    • Now the problem is to find a boundary in the low density area of the labeled and unlabeled data while still
    maximizing training accuracy
    min
    1
    2
    & ' + )*
    +
    ,-*
    .
    ξ,
    + )'
    +
    0-*
    1
    2
    ξ0
    ∋ 4
    50
    ∈ −1,1 for < = 1, … , ?
    • Unfortunately, this problem is NP-hard
    – But, many methods have been proposed which approximate it

    View full-size slide

  24. 24
    Materials Science & Engineering Department Computational Materials Sci. Lab. 24
    Semi-Supervised Classifiers
    • Method 1: Safe Semi-Supervised Support Vector Machine (S4VM),
    – Transductive SVM approximation algorithm which simultaneously considers
    multiple low-density separators instead of chasing one local minimum.
    • Method 2: Label Propagation (LP),
    – Graph-based method created for the semi-supervised problem
    • Method 3: COP-KMEANS Clustering (CKM),
    – Unsupervised method modified to be semi-supervised
    • Method 4: Modified Yarowsky Algorithm (MY),
    – Very popular self-training approach because it is easy to understand and can be
    a wrapper for any existing classifier.
    • Method 5: Updated training set,

    View full-size slide

  25. 25
    Materials Science & Engineering Department Computational Materials Sci. Lab. 25
    Semi-supervised Consensus Results

    View full-size slide

  26. 26
    Materials Science & Engineering Department Computational Materials Sci. Lab. 26
    Updated SVM
    • The four methods agreed on 301/519 of the initially unlabeled data (~58%)
    • This subset was added to the initially labeled training set, and an Updated
    SVM was optimized/trained over this set in the same fashion as the
    baseline
    Is there a significant difference?
    Classifier ! " Training Error
    (initially labeled)
    Test Error
    Baseline 10 0.01 0.0358 0.0547
    Updated 10 0.01 0.0397 0.0599

    View full-size slide

  27. 27
    Materials Science & Engineering Department Computational Materials Sci. Lab. 27
    McNemar’s Test
    • Non-parametric statistical hypothesis test used to compare dependent
    categorical outputs
    – !"
    : $%&
    '
    = $%)
    '
    • If * + , ≥ 10: χ1 = &2) 3
    &%)
    • If * + , < 10: 5~5789:78;<(* + ,, 0.5)
    • Per popular convention, looking for significance at the 95% confidence level
    Conclusion: The difference in performance of baseline and updated
    SVMs on the high-confidence data is not statistically significant.

    View full-size slide

  28. 28
    Materials Science & Engineering Department Computational Materials Sci. Lab. 28
    Semi-Supervised Error Estimation
    • Since there is a set of data for which we do not know the labels, we cannot
    use traditional error estimation techniques for model validation. A tedious
    derivation leads to the conclusion:
    ! = # $ ∈ &'
    !'
    + # $ ∈ &)
    !)
    ⇒ ̂
    ! = ,
    # $ ∈ &'
    ̂
    !'
    + ,
    # $ ∈ &)
    ̂
    !)
    .
    • This means that we can get a semi-supervised error estimate through
    separate supervised and unsupervised error estimates (with probabilities
    estimated by total numbers of labeled and labeled sample points).
    • ̂
    !) can be estimated through a traditional test set – but what about ̂
    !'?
    • Labeled subpopulation &)
    • Unlabeled subpopulation &'

    View full-size slide

  29. 29
    Materials Science & Engineering Department Computational Materials Sci. Lab. 29
    Unsupervised Error Estimation
    A recent paper by Platanios et al. outlines a method of unsupervised
    error estimation through constrained optimization:
    • Let ! be a set of classifiers and let "# and $# be the agreement rate
    and error rate of !, respectively
    – "#: probability that all classifiers in ! assign the same label
    – $#: probability that all classifiers in ! assign the wrong label
    • Say ! = {'(
    , '*
    , '+
    }. Then the solution vector that we are solving for is
    - = [$ (
    , $ *
    , $ +
    , $ (,*
    , $ (,+
    , $ *,+
    , $ (,*,+
    ]
    E. A. Platanios et al., “Estimating accuracy from unlabeled data,” in Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, pp. 682– 691, AUAI Press, 2014.

    View full-size slide

  30. 30
    Materials Science & Engineering Department Computational Materials Sci. Lab. 30
    Unsupervised Error Estimation
    • Some set theory and
    combinatorics lead to the
    conclusion
    !"
    = $"
    + 1 + '
    ()*
    |"|
    −1 ( '
    -⊂"
    - )(
    $-
    ,
    • which gives equality constraints.
    Inequality constraints include
    $": " 12
    ≤ min
    7∈"
    $"\7
    min
    :∈"
    $:
    ≤ 0.5
    • Alternatively, a recommended objective
    function for minimization is:
    >*
    ? = '
    ": " 12
    $"
    − @
    :∈"
    $:
    2
    • which minimizes dependence between
    individual error rates. We also added
    >2
    ? = '
    :∈"
    $:
    ,
    • which can be optimistic, but takes the
    dependence of error rates completely out of
    the problem.

    View full-size slide

  31. 31
    Materials Science & Engineering Department Computational Materials Sci. Lab. 31
    Semi-supervised Error Estimation
    • The labeled, unlabeled, and overall error estimates for all five classifiers:

    View full-size slide

  32. 32
    Materials Science & Engineering Department Computational Materials Sci. Lab. 32
    Summary and Conclusions
    • Microstructure characterization and classification has been
    identified as an important step in building processing-
    structure-property linkages for the ultimate goal of materials
    by design.
    • Stochastic descriptors over physical descriptors for
    microstructure representation.
    • We
    – Developed a semi-supervised classification framework
    which identifies the largest “safe” subset to add to the
    training set
    • Data-driven discovery
    – Defined and utilized a novel semi-supervised error
    estimation technique
    • As we move to generalize our work to the more general
    microstructure classification problem, we must recognize
    that class taxonomy will be both ambiguous and dynamic.
    !"
    # = %
    &: & ()
    #&
    − +
    ,∈&
    #,
    )
    .!
    ./
    = 0. 20
    34/5/67
    3!
    .8
    ./
    = −9
    34/5/67
    38

    View full-size slide

  33. 33
    Materials Science & Engineering Department Computational Materials Sci. Lab. 33
    Thanks
    http://microstructures.net

    View full-size slide

  34. 34
    Materials Science & Engineering Department Computational Materials Sci. Lab.

    View full-size slide