Machine Learning In Astronomy: An (Incomplete) Overview

Slide 1

Slide 1 text

Machine Learning In Astronomy: An (Incomplete) Overview Daniela Huppenkothen SRON Netherlands Institute for Space Research ! Tiana_Athriel " dhuppenkothen ! [email protected] Watch out for sticky notes pointing to talks in this symposium

Slide 14

Slide 14 text

NuSTAR NASA/Goddard Sedaghat & Mahabal (2018) Image Differencing with ConvNets • image differencing is the basis of fi nding transients in optical surveys • requires complex (and potentially expensive) pipeline • goal: predict difference image from reference/science image using a ConvNet 4 Nima Sedaghat Conv1 7x7 : 128 Conv2 5x5 : 256 Conv3 5x5 : 512 Conv3_1 3x3 : 512 Conv4 3x3 : 1024 Conv4_1 3x3 : 1024 Conv5 3x3 : 1024 Conv5_1 3x3 : 1024 Conv6 3x3 : 2048 Conv6_1 3x3 : 2048 Input Output Upconv0 4x4 : 32 Upconv2 4x4 : 128 Upconv4 4x4 : 512 Upconv5 4x4 : 1024 Upconv1 4x4 : 64 Upconv3 4x4 : 256 L1 Loss Figure 3. Our suggested fully convolutional encoder-decoder network architecture. The captions on top/bottom of each layer show the kernel size, followed by the number of feature maps. Each arrow represents a convolution layer with a kernel size of 3x3 and stride and padding values of 1, which preserves the spatial dimensions. Dotted lines represent the skip connections. Low-resolution outputs are depicted on top of each up-convolution layer with the corresponding loss. After each (Up-)convolution layer there is a ReLU layer which is not displayed here. the transients found, and the ground-truth images contain just the transients. If anything, finding such blended point sources should make finding point sources in the field (i.e. away from other sources) easier. Unlike most other surveys, the CRTS images are obtained without a filter, but that too is not something that directly concerns our method. We gathered 214 pairs of publicly available jpeg images from SN Hunt and split this dataset in to training, validation and test subsets of 102, 26 and 86 members respectively. The reference images are typically made by stacking ⇠ 20 older images of the same area. The science image is a single 30-second exposure. The pixels are 200.5 ⇥ 200.5 and thus The annotations on real images are not required to be accurate, as the main responsibility of this dataset is to provide the network with real examples of the sky. This lack of accuracy is compensated by the synthetic samples with precise positions. 4.2.2 Synthetic Data To make close-to-real synthetic training samples, we need re- 8 Nima Sedaghat Reference Science ZOGY - D ZOGY - (Scorr>5 ) TransiNet Figure 6. Image subtraction examples using ZOGY and TransiNet for a set of CRTS Supernova Hunt images. The first column has the deep reference images, second column contains the science images which have a transient source and are a shallower version of the

Slide 15

Slide 15 text

NuSTAR NASA/Goddard Wu & Peek (2020) Estimating Galaxy Spectra from Images • galaxy spectra are expensive to obtain • future surveys like Rubin and the Roman Space Telescope will generate vast amounts of galaxy images • goal: robustly predict galaxy spectra from broadband imaging Figure 2: A schematic of our methodology. A pretrained VAE maps SDSS spectra to six latent variables, which we use as training targets (upper). We optimize a CNN to estimate the latent variables from grizy galaxy images (lower). Our best model comprises a deconvolution stem, resnet-like CNN body, and fully-connected layer head. While the loss function compares targets and predictions in the six-dimensional latent space, we show examples of the decoded spectra for visual comparison. 2.2 Galaxy image cutouts We have obtained images of galaxies in five broad-band filters (grizy) from Data Release 2 of the Pan-STARRS 1 survey (https://panstarrs.stsci.edu/). The 224 ⇥ 224 image cutouts are delivered in FITS format with an angular scale of 0.2500 pixel 1. We augment the data using D4 dihedral group transformations. Most images have other astronomical objects in them, which may convey details about the galaxy environment (or inject irrelevant information about background or foreground sources). Although a small number of cutouts have imaging artifacts, we do not attempt to remove them in this work. 2.3 Network deconvolution CNNs are supervised machine learning algorithms that can encapsulate detailed representations of images. In recent years, CNNs have become widely used for astronomical classification and multivariate regression tasks, in addition to other computer vision problems [1,7]. A recently proposed architecture modification introduces deconvolution layers [11], which remove pixel-wise and channel-wise correlations from input features. Network deconvolution allows for efficient Figure 1: Randomly selected examples of PanSTARRS grizy image inputs and reconstructed galaxy spectrum outputs. VAE-decoded targets and predictions are shown in black and red, respectively. See also: * Canameras: Identifying strong gravitational lenses in current and future large-scale imaging surveys * Richards: Automated Detection of Galactic cirrus structures: a deep learning approach

Slide 16

Slide 16 text

NuSTAR NASA/Goddard Feinstein et al (2020) Finding Stellar Flares in TESS Data testing the CNN are taken from G¨ unther et al. (2020), who searched for flares in the first two sectors of the TESS mission. The light curves consist of integrated flux measurements taken at two minute cadence over roughly 27 days; they were made publicly available with the first TESS data releases through the Mikul- ski Archive for Space Telescopes (MAST). Similarly to G¨ unther et al. (2020), we split each light curve into individual orbits, and normalized the Simple Aperture Pho- tometry flux (SAP flux) separately for each orbit. For supervised learning tasks, neural networks require input data that are uniformly sampled to train prop- erly. For the inputs to the CNN implemented here, we used a data set of one-dimensional time series where all elements have the same number of 2-minute cadences. We found that a length of 200 cadences provided enough information about the baseline flux surrounding a given flare. Longer baselines often predicted high probabilities for both rotational signatures and flares instead of just flares. This baseline also provided ample flare and non- flare sets to train, validate, and test on. Following the methods of Pearson et al. (2018), we ensured all known flare peak times from the G¨ unther et al. (2020) catalog were centered at the 100th cadence (i.e. centered). Each of these light curve snippets are hereafter referred to as a “sample.” All of the discussed steps (e.g. training and ensembling a series of CNN models) in this section are incorporated into the open-source Python package stella.2 stella and the CNN architecture described here, is specifically tailored for finding flares in TESS short-cadence light curves and should not be applied to other photometric time-series data. 2.2. Labels We used a binary labeling scheme of “flare” and “non- flare” for the samples (see Figure 1 for examples of the samples). For the flare examples, we used the peak times of flares identified by G¨ unther et al. (2020). Non-flare samples were centered on locations in the light curves at least 100 cadences from a flare. Our final training set contains 5389 hand-labeled flare examples and created 17684 non-flare examples for a 30% positive class data set. We then randomly divided the data set into training (80%), validation (10%), and test (10%) sets. We used the validation set to tune the network and train those at low-energy, were identified in the original catalog and therefore have a “non-flare” label in the training set (Figure 4; false negatives). Second, we found the catalog is o↵ in peak flare time for some cases and therefore have been classified as false positives when evaluating the validation set. This is because the flare was not at the center cadence of the example. Figure 1. Samples in the training set. Using flares identified in G¨ unther et al. (2020), we created a training set of non-flares (top) and flares (bottom), each of equal 200 cadence length. The light curves were not normalized. We include within the non-flare cases some examples of obvious spot modulation (upper right) so the CNN will ignore this variability and focus on the characteristic flare shape. 2.3. Network Architecture & Training Our CNN architecture, shown in Figure 2, is implemented in tf.keras, which is TensorFlow’s (Abadi et al. 2016) open source, high-level implementation of the Keras API specification (Chollet & others 2018). The network consists of a one-dimensional convolutional column with global max pooling and dropout, the results of which are flattened and fed into a series of fully connected (or “dense”) layers ending in a sigmoid function that produces an output in the range [0,1]. This output loosely represents the “score” of how likely a given Young stellar activity number of model parameters while increasing general- ization (e.g., Lin et al. 2013). Dropout helps prevent model over-fitting by randomly “dropping” (or setting to zero) some fraction of the output neurons in a given layer during training to prevent the model from becom- ing overly dependent on any of its features (Srivastava et al. 2014). Training neural networks involves inputting samples and then minimizing a cost function that measures how far o↵ the network’s predictions are from the truth. This is done through back propagation, which updates the model parameters to reduce the value of the cost function. For model training, we used the Adam optimiza- tion algorithm (Kingma & Ba 2014) to minimize the binary cross-entropy error function. The Adam optimizer was run with a learning rate of ↵ = 10 3 (this con- trols the degree to which the weights are updated with each iteration), exponential decay rates of = 0.9 and = 0.999 (for the first and second moment estimates), and ✏ = 10 8 (a small number to prevent any division by zero in the implementation). 2.4. Model Evaluation The exact model architecture, kernel sizes, etc. were chosen based on a trial and error approach to avoid overfitting the model. Over-fitting was evaluated using four standard machine learning metrics: accuracy, precision, recall, and average precision. Accuracy is the fraction of correct classifications by the model for both classes (flares and non-flares), at a given threshold for decid- SIGMOID OUTPUT (0,1) CONV-7-16 MAXPOOL-2 DROPOUT 0.1 CONV-3-64 MAXPOOL-2 DROPOUT 0.1 FLATTEN DENSE-32 DROPOUT 0.1 LIGHT CURVES Figure 2. The architecture of the stella CNN. The tra 14 Feinstein et al. Figure 14. Flare rates for our sample broken down by age and colored by e↵ective temperature, where purple bins represent • stellar fl ares affect the early stages of exoplanet evolution • study fl are rates in young stars • current methods remove low- amplitude fl ares • fi nd stellar fl ares with an ensemble CNN See also: * Rusticus: A Transit Detection Algorithm Based on Recurrent Neural Networks

Slide 19

Slide 19 text

NuSTAR NASA/Goddard Cobb et al (2019) Exoplanet Atmospheric Retrieval An Ensemble of BNNs for Exoplanetary Atmospheric Retrieval Concrete Dropout Dense Layer 1 Concrete Dropout Dense Layer 2 Concrete Dropout Dense Layer 3 Concrete Dropout Dense Layer 4 Spectra Atmospheric Parameters: Network Outputs Expectations over Network samples initialization and the path taken during stochastic mization. In this paper we use an ensemble of ﬁve plan models and provide comparison to a single model models were chosen due to the empirical perfor in Table 1, as larger ensembles result in increa marginal improvements. The challenge in using an ensemble is in ho outputs from the individual models are combine our case, each output is the mean and covaria a multivariate normal distribution. Therefore in bining these distributions together, we can trea overall output from the ensemble as a Gaussian m model, whereby the each component weight corres to 1/M , where M is the number of models in the e ble. To calculate the expectation of this mixture m µens , we take the average of the individual comp means such that µens = 1 M M X m=1 µm . The variance of the mixture model ⌃ens can be lated by employing the law of total variance: (b) plan-net 10 5 0 H2O 10 5 0 HCN 10 5 0 NH3 1000 2000 T (K) 10 5 0 0 10 5 0 H2O 10 5 0 HCN 10 5 0 NH3 10 0 0 (c) plan-net Ensemble Figure 2. Retrieval analysis of the WFC3 transmiss spectrum of WASP-12b, where we compare the random est with both a single plan-net and a plan-net ensem The black cross denotes the mean over the samples, wh we report the results in Table 3. We note consistent resu across all models, and highlight the broader posteriors of • estimate parameters of exoplanet atmospheres (retrieval) • traditional Bayesian approach with nested sampling is computationally expensive • goal: fast inference through an ensemble of Bayesian neural networks See also: * Walmsley: Galaxy Zoo DECaLS: Detailed Visual Morphology Measurements from Volunteers and Deep Learning for 314,000 Galaxies * Guiglion: Extracting the most from 4MOST: machine learning appraoch for stellar spectra parametrization

Slide 24

Slide 24 text

NuSTAR NASA/Goddard Ć iprijanovi ć et al (2021), Ganin et al (2016) Merging Galaxy Identi fi cation • understanding galaxy mergers important for understanding evolution of galaxies and structure formation • training data sets based on large cosmological simulations • DeepMerge II: includes Maximum Mean Discrepancy and Domain Adversarial Training DeepMergeII: BuildingRobustDeepLearning Algorithmsfor MergingGalaxy Identification Across Domains 5 et al. 2020) and the more complex and well known network ResNet18 with 22,484,866 trainable parameters (He et al. 2015)— to compare results across architectures. We decided to use the smallest standard ResNet architecture, in order to more easily tackle possible overfitting of the model, due to small sizes of merging galaxies datasets. The DeepMerge network, first introduced by ≈iprÚanovi∆ et al. (2020), is a simple CNN comprised of three convolutional layers followed by batch normalization, max pooling, and dropout, and three dense layers. In this paper, the dropout layers have been removed such that the only regularization happens via L2 regularization of the weight decay parameter in the optimizer. Additionally, the last layer of the original DeepMerge network was updated to include two neurons rather than one. For more details about the architecture check Table A1 in the Appendix. ResNets were first proposed in the seminal paper He et al. (2015) and have become one of the most widely-used network architectures for image recognition. They are comprised of residual blocks; in the case of ResNet18, blocks of two 3x3 convolutional layers are followed by a ReLU nonlinearity. The chaining of these residual blocks enables the network to retain high training accuracy performance even with increasing network depth. The domain classifier used in adversarial domain training to calculate transfer loss comprises of three dense layers, the first of which is the same dimension of the extracted features in the base network (either DeepMerge or ResNet18), such that these features form the input into the domain classifier. The second layer has 1024 neurons, followed by ReLU activation and dropout of 0.5, and the third has one output neuron followed by Sigmoid activation, conveying the domain chosen by the network. Details about training the networks can be found in Appendix A. We also list all hyperparameters used for training in di erent exper- iments in Table A2 and Table A3. Our parameter choice for each experiment was informed by running hyperparameter searches using DeepHyper (Balaprakash et al. 2018; Balaprakash et al. 2019). 4 DATA Here we present two dataset pairs for classifying distant merging galaxies (I = 2) and nearby merging galaxies (I < 0.1). Figure 1. Galaxy images from Illustris-1 simulation at I = 2. The left column shows merging galaxies and the right column shows non-mergers. The same objects are repeated across rows, with the top showing the source domain, the middle showing the target domain, and the bottom displaying the source objects with logarithmic color map normalization for enhanced visibility. images are convolved with the PSF, with added random sky shot Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky Figure 1: The proposed architecture includes a deep feature extractor (green) and a deep

Slide 26

Slide 26 text

NuSTAR NASA/Goddard Villar et al (2020) Active Anomaly Detection of Extragalactic Transients • classi fi ers for light curves are trained on types of sources and transients we know • how do we fi nd new types of transients? Anomaly Detection 3 Figure 1. Sample grizY light curves of majority SN classes (Type Ia, Ibc, II). Bold lines represent the 2D Gaussian Process mean function, with shaded regions representing 68% conﬁdence intervals. Even when entirely missing one or more bands, our method is able to produce reasonable interpolated light curves. Note that the y-axis, the magnitudes used to train the Gaussian Process, is designed such that the light curve tends to zero as the light curve approaches the survey magnitude limit. • Type I Superluminous SNe (SLSNe) are luminous, hydrogen-free events thought to be powered by rapidly spinning, highly magnetized neutron stars. SLSN models were produced using MOS- FiT (Guillochon et al. 2018; Nicholl et al. 2017; Villar et al. 2018). They make up 2.2 percent of our sample. We consider SLSNe to be members of the minority classes. • Type Iax are irregular Type Ia SNe with typically lower luminosities and lower velocities com- pared to normal Type Ia SNe (Li et al. 2003). The models were generated using available data in the Open Supernova Catalog (Guillochon et al. 2017). Type Iax SNe make up 1.8 percent of our dataset. We consider Type Iax SNe to be members of the minority classes. • Type IIn SNe2 are core-collapse SNe mainly powered by the interaction of the SN ejecta with circumstellar material (CSM). The models were generated using MOSFiT (Villar et al. 2017; Guil- lochon et al. 2018; Jiang et al. 2020). Type IIn SNe make up 1.8 percent of our sample. We consider Type IIn SNe to be members of the minority consider Type Ia-91bg SNe to be members of the minority classes. • Tidal Disruption Events (TDEs) result from the tidal disruption of stars by supermassive black holes (SMBH; Rees 1988). TDE models were generated using MOSFiT (Guillochon et al. 2018; Mockler et al. 2019) and make up 0.6 percent of our sample. We consider TDEs to be members of the minority classes. • Ca-rich Transients (CARTs) are intermediate luminosity transients whose spectra appear rich in calcium (Kasliwal et al. 2012). CARTs are mod- eled using MOSFiT, assuming they are powered by the radioactive decay of 56Ni. We note that this is the same model used to generate Type Ibc SNe, but with a distinct parameter space. CARTs make up 0.31 percent of our sample. We consider CARTs to be members of the minority classes. • Intermediate Luminosity Optical Transients (ILOTs) are transients that are brighter than no- vae but less luminous than SNe (Kasliwal 2012). 6 Villar et al. M ugrizY 𝜎 ugrizY phase { { ... Gru Gru μ 𝝈2 ... Gru Gru Dense ... Assign each LC an anomaly score via isolation forest Gaussian process interpolation of light curves (LCs) 1 Encode LCs with VRAENN 2 3 Output LC Repeat Layer Villar et al. Evolution of anomaly scores as a function of time for a representative set of SNe. Left: Vector plot showing the maly scores over time. Arrows represent the average gradients of anomaly scores. Right: Anomaly score curves for tive set of SNe. The green dotted line is the 99th percentile threshold for ﬁnal values (having an anomaly score See also: * Saifollahi: In Search of the Weirdest Galaxies in the Universe * Pruzhinkskaya: Anomaly Detection in the Zwicky Transient Facility Data Releases * Crake: A Machine Learning Exploration Across the Transient Universe: A Search for the Peculiar

Slide 30

Slide 30 text

NuSTAR NASA/Goddard Caldeira & Nord (2020) Uncertainty Quanti fi cation Presented in the Fundamental Science in the era of AI workshop at ICLR 2020 0.020 0.025 0.030 0.035 0.040 0.045 0.050 Relative analytic uncertainty estimates 0.020 0.025 0.030 0.035 0.040 0.045 0.050 Relative aleatoric uncertainty estimates Trained with T noise of 1-5% DE CD BNN 0.03 0.04 0.05 0.06 0.07 Relative analytic uncertainty estimates 0.03 0.04 0.05 0.06 0.07 Trained with T noise of 1-10% DE CD BNN 0.04 0.06 0.08 0.10 0.12 0.14 Relative analytic uncertainty estimates 0.04 0.06 0.08 0.10 0.12 0.14 Trained with T noise of 1-20% DE CD BNN Figure 1: Comparison of the relative aleatoric statistical uncertainty in g to the relative analytic uncertainty estimate of g for each method, with increasing ranges of noise in T. The relative uncertainty is the ratio of the uncertainty and the value of g predicted. To make this plot, the x-axis was divided into equal-sized bins. The lines correspond to the median prediction in each bin, while the shaded area denotes the interval between the 16th and 84th percentiles. For the narrowest range of noise in training (left), the deep learning-based UQ methods fail to correlate with the analytic method: they reproduce average noise in the training set, but not the variation. For larger values of noise in T (middle), DE and CD correlate better with the analytic estimate, but they overestimate low analytic uncertainties, and underestimate high analytic uncertainties. BNNs still fail to correlate with the analytic estimate. For much larger values of noise in T (right): all methods follow a similar trend to the analytic estimate. This ﬂags an issue that one should be aware of when training deep learning methods with a negative log-likelihood objective. Presented in the Fundamental Science in the era of AI workshop at ICLR 2020 10 12 14 16 18 20 22 24 Output g (m/s2) 0.0 0.1 0.2 0.3 0.4 0.5 Epistemic uncertainty (m/s2) DE CD BNN Training range 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Input L (m) 0.0 0.1 0.2 0.3 0.4 Epistemic uncertainty (m/s2) DE CD BNN Training range • Simple Pendulum Experiment • Estimate g Neural networks tend to underestimate uncertainty See also: * Hunt: Uncertainty in Machine Learning: Are Bayesian Neural Networks plausible in 2021?

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text