Semi-supervised Learning Approaches For Microstructure Classification

Slide 1

Slide 1 text

1 Materials Science & Engineering Department Computational Materials Sci. Lab. Semi-supervised Learning Approaches For Microstructure Classification Courtney Kunselman1, Vahid Attari1, Levi McClenny2, Ulisses Braga-Neto2, Raymundo Arroyavea1,3 1Department of Materials Science and Engineering, Texas A&M University 2Department of Electrical Engineering, Texas A&M University 3Department of Mechanical Engineering, Texas A&M University CHiMaD Workshop April 21, 2020

Slide 2

Slide 2 text

2 Materials Science & Engineering Department Computational Materials Sci. Lab. 2 It is all about exploration! Mars 2020 is a Mars rover mission by NASA's Mars Exploration Program that includes the Perseverance rover with a planned launch on 17 July 2020, and touch down in Jezero crater on Mars on 18 February 2021.

Slide 3

Slide 3 text

3 Materials Science & Engineering Department Computational Materials Sci. Lab. 3 Why? • 1st element: Acceleration of materials development and deployment • 2nd element: The creation and curation of large scale materials databases has been widely cited as a critical required component for the acceleration of materials development and deployment. – [Niezgoda et al. 2013] • This has been a recognized need by the materials community since the 1970’s . – [Materials Science and Engineering -- Volume II, The Needs, Priorities, and Opportunities for Materials Research] • 3rd element: Forging these links requires quantitative analysis, and while processing parameters and property observations are generally easily quantifiable—they tend to be represented as objects that exist in a relatively low dimensional space – [Kunselman et al. 2019] The need Solution Develop Improve

Slide 4

Slide 4 text

4 Materials Science & Engineering Department Computational Materials Sci. Lab. 4 The current ongoing work • In particular, we seek to address the following questions: 1. How to we generate microstructure databases? 2. What is the appropriate way to represent the generated high-dimensional microstructure space for inclusion into the database? 3. How do we forge the links between the process-structure-property paradigm in different ensemble of materials and the generated microstructure databases?

Slide 5

Slide 5 text

5 Materials Science & Engineering Department Computational Materials Sci. Lab. 5 The Strategy for Propagation of Uncertainty Output Input Multi-scale phase-field framework Attari et al Acta Materialia (183), 2020, 452-470

Slide 6

Slide 6 text

6 Materials Science & Engineering Department Computational Materials Sci. Lab. 6 The Strategy for Propagation of Uncertainty Output Input Multi-scale phase-field framework Attari et al Acta Materialia (183), 2020, 452-470

Slide 7

Slide 7 text

7 Materials Science & Engineering Department Computational Materials Sci. Lab. 7 Quantification of uncertainty in a CALPHAD model CALPHAD !"#" = % & '()*+ + '-."/0123-2* + '/*24"-3 56 Phase diagram Attari et al Acta Materialia (183), 2020, 452-470 Input parameters '()*+ 7 8-, : = ∑- 8-. 0>- 7 + ?: ∑- 8-@A(8-) + ∑- ∑DEF 8-8D ∑G HI-D 7 (8- − 8D) where νI-D ∅ = νM-D ∅ + νN-D ∅ . :

Slide 8

Slide 8 text

8 Materials Science & Engineering Department Computational Materials Sci. Lab. 8 Propagation of uncertainty in chain of models: CALPHAD-Microelasticity-Phase-field CALPHAD Microelasticity !"#" = % & '()*+ + '-."/0123-2* + '/*24"-3 56 Phase diagram Elasto-chemical microstructure space Phase-field inputs Attari et al Acta Materialia (183), 2020, 452-470 Composition Kinetic CALPHAD Input parameters 7 88 9: 7 8; 9: 7 << 9: 7 88 9= 7 8; 9= 7 << 9: 6 > 9: 6 > 9= ?-- @

Slide 9

Slide 9 text

9 Materials Science & Engineering Department Computational Materials Sci. Lab. 9 Propagation of uncertainty in chain of models: Quantities of interests CALPHAD Microelasticity !"#" = % & '()*+ + '-."/0123-2* + '/*24"-3 56 Phase diagram Elasto-chemical microstructure space Phase-field inputs Attari et al Acta Materialia (183), 2020, 452-470 Composition Kinetic CALPHAD Input parameters 7 88 9: 7 8; 9: 7 << 9: 7 88 9= 7 8; 9= 7 << 9: 6 > 9: 6 > 9= ?-- @ Features

Slide 10

Slide 10 text

10 Materials Science & Engineering Department Computational Materials Sci. Lab. 10 Input space Structure space Characterization with physical descriptors Property space Microstructure quantification and analysis Microelasticity Composition Kinetic CALPHAD c c Mass scattering Interface scattering More anisotropic Finer domain

Slide 11

Slide 11 text

11 Materials Science & Engineering Department Computational Materials Sci. Lab. 11 General overview of what we did • Forging the process/structure/property links requires quantitative analysis. – This has been a recognized need by the materials community since the 1970’s [Materials Science and Engineering -- Volume II, The Needs, Priorities, and Opportunities for Materials Research]. – The creation and curation of large scale materials databases has been widely cited as a critical required component for the acceleration of materials development and deployment [Niezgoda et al. 2013]. • Pattern recognition and data analysis using the physical descriptors. http://microstructures.net Open Phase-field Microstructure Database

Slide 12

Slide 12 text

12 Materials Science & Engineering Department Computational Materials Sci. Lab. 12 Motivation • Uncovering links between processing conditions, microstructure, and properties Process Structure Property Forward Propagation Inverse problem Cause and Effect/Trial and Error Goal-Oriented Design Structure space is high-dimensional and difficult to navigate, making its efficient characterization a pressing research interest. Discrete classes sharing structural features are identified and automated classifiers are trained over a feature space which provides adequate discrimination between these classes

Slide 13

Slide 13 text

13 Materials Science & Engineering Department Computational Materials Sci. Lab. 13 Miscibility boundary engineering Increasing composition for a constant temperature, T Spherical precipitates Bicontinuous Spherical precipitates

Slide 14

Slide 14 text

14 Materials Science & Engineering Department Computational Materials Sci. Lab. 14 Current status • Supervised learning aims to learn a function that, given a sample of data and desired outputs, approximates a function that maps inputs to outputs. – Need an expert for each specific material system – Expensive (both time and money) – Subjective – Manual labeling • Semi-supervised learning aims to label unlabeled data points using knowledge learned from a small number of labeled data points. • Unsupervised learning does not have (or need) any labeled outputs, so its goal is to infer the natural structure present within a set of data points. • Almost all microstructure classification in the literature comes from: – Well-studied material systems with widely-accepted classes (i.e. steels) – Simulations concerned with building a specific structure (each batch of data is created with a specific class label) • So, there is very little guidance as to how to classify when the ground truth labeling is partially or completely unknown. Classification Supervised Image-level Pixel-level Object-level Semi-supervised Image-level Pixel-level Unsupervised Image-level Pixel-level Segmentation

Slide 15

Slide 15 text

15 Materials Science & Engineering Department Computational Materials Sci. Lab. 15 Dataset • 10,000 phase field simulations taken from OPMD database Bicontinuous 20% Precipitate 59% Ambiguous 21% DISTRIBUTION OF MICROSTRUCTURE TYPES Bicontinuous Precipitates Ambiguous (Unlabeled)

Slide 16

Slide 16 text

16 Materials Science & Engineering Department Computational Materials Sci. Lab. 16 Image Processing • Binarization through the Otsu Method – Iterates through all possible thresholds and chooses the one where the sum of background and foreground variances is at a minimum The raw image Binarized version Black autocorrelation function

Slide 17

Slide 17 text

17 Materials Science & Engineering Department Computational Materials Sci. Lab. 17 Featurization, The microstructure function m(x,l) • For each image, the microstructure function !(#, %) is defined as a wide-sense stationary stochastic process in which ℎ is random variable associated with the probability of finding a specific local state at spatial position #. – ℎ ∈ ), complete set of local states of interest – # is an index, similar to time for a signal • The two-point correlation function is then defined as: *+,+, -. , -/ = 1 ! -. , % ! -/ , %′ , • But stationarity of ! allows us to reduce *+,+3 to a function of one spatial variable 4 = -/ − -. : *+,+, 4 = 1 ! -, % ! - + 4, %′ . • Microstructure function discretizing and periodic boundary conditions and a primitive basis gives: *+,+, 4 = 1 : ; < =< + =<>4 +, ,

Slide 18

Slide 18 text

18 Materials Science & Engineering Department Computational Materials Sci. Lab. 18 Featurization, The Two-Point Correlation Function, ! ","$ (&) • !","$ (&) is an stochastic microstructure descriptor. • Black phase autocorrelations ((),) (&)) were calculated for each image • This results in a 262,144-dimensional feature space – Dimensionality reduction is necessary: Incremental PCA analysis

Slide 19

Slide 19 text

19 Materials Science & Engineering Department Computational Materials Sci. Lab. 19 Featurization Normalized Two-Point Correlation Function, !"##$,$&(#) • Why normalized? – Normalizing )*,* # removes the strong relationship with volume fraction, which allows a PCA decomposition of +,-- .,.& # to be used as a discriminative feature space based on structural information. • We define the normalized two-point correlation function as: +,-- .,.& # = 0 1 2, 3 1 2 + #, 35 − 0 1 2, 3 0[1 2 + #, 35 ] 9:-[1 2, 3 ] 9:-[1 2 + #, 35 ] . • For local state < at both endpoints, this reduces to: +,--*,* # = )*,* # − =* > =* − = * > where =* is the black phase volume fraction.

Slide 20

Slide 20 text

20 Materials Science & Engineering Department Computational Materials Sci. Lab. 20 Dimension Reduction • High-confidence data split into: – Training set (1,536, ~80%) – Validation set (384, ~20%) • IPCA performed on a combination of labeled training and ambiguous sets – Validation set then projected into this space Explained variance of the first five Principal Components (PC) for decompositions of !"," (%) and '(%%b,b(%).

Slide 21

Slide 21 text

21 Materials Science & Engineering Department Computational Materials Sci. Lab. 21 Supervised Classification, Baseline SVM • Data was mean-centered at zero and scaled to unit variance. • Hyperparameters (! > 0 for all and $ for Gaussian kernel) optimized through exhaustive grid search with 5-fold cross validation on labeled training set. – Soft margin SVM, the larger C gets, the stricter the boundary becomes. • Provides a performance baseline on high-confidence data and a method of visualizing high-dimensional data. ! 10 $ 0.01 Training Error 0.0358 Test Error 0.0547 Collect training data. Assemble features with a property that stores the known class label Instantiate a classifier. Set its parameters if necessary. Train the classifier using the training data. Classify an image or feature collection. Estimate classification error Workflow for classification:

Slide 22

Slide 22 text

22 Materials Science & Engineering Department Computational Materials Sci. Lab. 22 Thoughts • While the Baseline SVM works well on the high-confidence data, the boundary is uninformed by most of the data points closest to it – Probably cannot be trusted to label the ambiguous set… • We need to assign labels to some of the ambiguous microstructures.

Slide 23

Slide 23 text

23 Materials Science & Engineering Department Computational Materials Sci. Lab. 23 Semi-Supervised Classification, Transductive SVMs • Uses both labeled and unlabeled data – Unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy. • There is always the danger that the wrong semi-supervised method will deteriorate classification performance instead of improving it – To alleviate this concern, a collection of methods with very different mathematical structures are used and the subset identified through their consensus is added to the training set • Supervised method modified to be semi-supervised • Now the problem is to find a boundary in the low density area of the labeled and unlabeled data while still maximizing training accuracy min 1 2 & ' + )* + ,-* . ξ, + )' + 0-* 1 2 ξ0 ∋ 4 50 ∈ −1,1 for < = 1, … , ? • Unfortunately, this problem is NP-hard – But, many methods have been proposed which approximate it

Slide 24

Slide 24 text

24 Materials Science & Engineering Department Computational Materials Sci. Lab. 24 Semi-Supervised Classifiers • Method 1: Safe Semi-Supervised Support Vector Machine (S4VM), – Transductive SVM approximation algorithm which simultaneously considers multiple low-density separators instead of chasing one local minimum. • Method 2: Label Propagation (LP), – Graph-based method created for the semi-supervised problem • Method 3: COP-KMEANS Clustering (CKM), – Unsupervised method modified to be semi-supervised • Method 4: Modified Yarowsky Algorithm (MY), – Very popular self-training approach because it is easy to understand and can be a wrapper for any existing classifier. • Method 5: Updated training set,

Slide 25

Slide 25 text

25 Materials Science & Engineering Department Computational Materials Sci. Lab. 25 Semi-supervised Consensus Results

Slide 26

Slide 26 text

26 Materials Science & Engineering Department Computational Materials Sci. Lab. 26 Updated SVM • The four methods agreed on 301/519 of the initially unlabeled data (~58%) • This subset was added to the initially labeled training set, and an Updated SVM was optimized/trained over this set in the same fashion as the baseline Is there a significant difference? Classifier ! " Training Error (initially labeled) Test Error Baseline 10 0.01 0.0358 0.0547 Updated 10 0.01 0.0397 0.0599

Slide 27

Slide 27 text

27 Materials Science & Engineering Department Computational Materials Sci. Lab. 27 McNemar’s Test • Non-parametric statistical hypothesis test used to compare dependent categorical outputs – !" : $%& ' = $%) ' • If * + , ≥ 10: χ1 = &2) 3 &%) • If * + , < 10: 5~5789:78;<(* + ,, 0.5) • Per popular convention, looking for significance at the 95% confidence level Conclusion: The difference in performance of baseline and updated SVMs on the high-confidence data is not statistically significant.

Slide 28

Slide 28 text

28 Materials Science & Engineering Department Computational Materials Sci. Lab. 28 Semi-Supervised Error Estimation • Since there is a set of data for which we do not know the labels, we cannot use traditional error estimation techniques for model validation. A tedious derivation leads to the conclusion: ! = # $ ∈ &' !' + # $ ∈ &) !) ⇒ ̂ ! = , # $ ∈ &' ̂ !' + , # $ ∈ &) ̂ !) . • This means that we can get a semi-supervised error estimate through separate supervised and unsupervised error estimates (with probabilities estimated by total numbers of labeled and labeled sample points). • ̂ !) can be estimated through a traditional test set – but what about ̂ !'? • Labeled subpopulation &) • Unlabeled subpopulation &'

Slide 29

Slide 29 text

29 Materials Science & Engineering Department Computational Materials Sci. Lab. 29 Unsupervised Error Estimation A recent paper by Platanios et al. outlines a method of unsupervised error estimation through constrained optimization: • Let ! be a set of classifiers and let "# and $# be the agreement rate and error rate of !, respectively – "#: probability that all classifiers in ! assign the same label – $#: probability that all classifiers in ! assign the wrong label • Say ! = {'( , '* , '+ }. Then the solution vector that we are solving for is - = [$ ( , $ * , $ + , $ (,* , $ (,+ , $ *,+ , $ (,*,+ ] E. A. Platanios et al., “Estimating accuracy from unlabeled data,” in Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, pp. 682– 691, AUAI Press, 2014.

Slide 30

Slide 30 text

30 Materials Science & Engineering Department Computational Materials Sci. Lab. 30 Unsupervised Error Estimation • Some set theory and combinatorics lead to the conclusion !" = $" + 1 + ' ()* |"| −1 ( ' -⊂" - )( $- , • which gives equality constraints. Inequality constraints include $": " 12 ≤ min 7∈" $"\7 min :∈" $: ≤ 0.5 • Alternatively, a recommended objective function for minimization is: >* ? = ' ": " 12 $" − @ :∈" $: 2 • which minimizes dependence between individual error rates. We also added >2 ? = ' :∈" $: , • which can be optimistic, but takes the dependence of error rates completely out of the problem.

Slide 31

Slide 31 text

31 Materials Science & Engineering Department Computational Materials Sci. Lab. 31 Semi-supervised Error Estimation • The labeled, unlabeled, and overall error estimates for all five classifiers:

Slide 32

Slide 32 text

32 Materials Science & Engineering Department Computational Materials Sci. Lab. 32 Summary and Conclusions • Microstructure characterization and classification has been identified as an important step in building processing- structure-property linkages for the ultimate goal of materials by design. • Stochastic descriptors over physical descriptors for microstructure representation. • We – Developed a semi-supervised classification framework which identifies the largest “safe” subset to add to the training set • Data-driven discovery – Defined and utilized a novel semi-supervised error estimation technique • As we move to generalize our work to the more general microstructure classification problem, we must recognize that class taxonomy will be both ambiguous and dynamic. !" # = % &: & () #& − + ,∈& #, ) .! ./ = 0. 20 34/5/67 3! .8 ./ = −9 34/5/67 38

Slide 33

Slide 33 text

33 Materials Science & Engineering Department Computational Materials Sci. Lab. 33 Thanks http://microstructures.net

Slide 34

Slide 34 text

34 Materials Science & Engineering Department Computational Materials Sci. Lab.