Machine Learning In Astronomy: An (Incomplete) Overview

Machine Learning In Astronomy: An (Incomplete) Overview Daniela Huppenkothen SRON
Netherlands Institute for Space Research ! Tiana_Athriel " dhuppenkothen ! [email protected] Watch out for sticky notes pointing to talks in this symposium

right now 1.7 billion stars Click to add text

2023 40 billion sources 10 million alerts/night Click to add
text credit: LSST/NOAO

SKA Click to add text credit: LSST/NOAO 160TB raw data/second

How do we make sense of all this data?

machine learning

Click to add title

data (+ labels) pile of linear algebra answers features Active
learning talks: • Ishida, RESSPECT: A Recommendation System for Spectroscopic Follow-up allocation • Sravan: Autonomous real-time science-driven follow-up in the era of LSST • Kornilov: Active Anomaly Detection for time-domain discoveries

https://towardsdatascience.com/the-mostly-complete-chart-of-neural-networks-explained-3fb6f2367464 This chart is from 2017 (so hopelessly outdated)

What do we want to learn?

NuSTAR NASA/Goddard How can we split supernovae into different classes?
• Eligibility criteria (is it a supernova?) • features (spectral lines, light curve shape, …) What is the probability for a supernovae with a given spectrum/light curve to be a Type a? • Eligibility criteria (is it a supernova?) • Output (classes of supernovae) • Input (spectral lines, light curve shape, …) • Eligibility criteria (is it a supernova?) • White dwarf explosion models • Features (spectral lines, light curve shape, …) Are Type Ia supernovae caused by the explosion of white dwarfs? adapted from Hernan et al (2019)

NuSTAR NASA/Goddard Sedaghat & Mahabal (2018) Image Differencing with ConvNets
• image differencing is the basis of fi nding transients in optical surveys • requires complex (and potentially expensive) pipeline • goal: predict difference image from reference/science image using a ConvNet Figure 2. Examples of the reference (left) and science (center) images. The image on the right is the ground truth output deﬁned for this image pair. It contains the image of a single transient, completely devoid of background and noise. The proﬁle of the transient is the best match to reality our model can produce. We formulate the problem as follows. Let us consider (I1, I2) as the reference-science pair: I1 = I0 ⇤ 1 + S1 + N1 (1) I2 = (I0 + It) ⇤ 2 + S2 + N2 (2) back- olutio for ev outpu bette loss f E = N wher (grou ples the c or Eu the o (2017 4.2 Neur reference image science image difference 2 Nima Sedaghat Encoder Decoder Figure 1. Our CNN-based encoder-decoder network, TransiNet, produces a di↵erence image without an actual subtraction. It does so through training using a labeled set of transients as the ground-truth.

NuSTAR NASA/Goddard Sedaghat & Mahabal (2018) Image Differencing with ConvNets
• image differencing is the basis of fi nding transients in optical surveys • requires complex (and potentially expensive) pipeline • goal: predict difference image from reference/science image using a ConvNet 4 Nima Sedaghat Conv1 7x7 : 128 Conv2 5x5 : 256 Conv3 5x5 : 512 Conv3_1 3x3 : 512 Conv4 3x3 : 1024 Conv4_1 3x3 : 1024 Conv5 3x3 : 1024 Conv5_1 3x3 : 1024 Conv6 3x3 : 2048 Conv6_1 3x3 : 2048 Input Output Upconv0 4x4 : 32 Upconv2 4x4 : 128 Upconv4 4x4 : 512 Upconv5 4x4 : 1024 Upconv1 4x4 : 64 Upconv3 4x4 : 256 L1 Loss Figure 3. Our suggested fully convolutional encoder-decoder network architecture. The captions on top/bottom of each layer show the kernel size, followed by the number of feature maps. Each arrow represents a convolution layer with a kernel size of 3x3 and stride and padding values of 1, which preserves the spatial dimensions. Dotted lines represent the skip connections. Low-resolution outputs are depicted on top of each up-convolution layer with the corresponding loss. After each (Up-)convolution layer there is a ReLU layer which is not displayed here. the transients found, and the ground-truth images contain just the transients. If anything, finding such blended point sources should make finding point sources in the field (i.e. away from other sources) easier. Unlike most other surveys, the CRTS images are obtained without a filter, but that too is not something that directly concerns our method. We gathered 214 pairs of publicly available jpeg images from SN Hunt and split this dataset in to training, validation and test subsets of 102, 26 and 86 members respectively. The reference images are typically made by stacking ⇠ 20 older images of the same area. The science image is a single 30-second exposure. The pixels are 200.5 ⇥ 200.5 and thus The annotations on real images are not required to be accurate, as the main responsibility of this dataset is to provide the network with real examples of the sky. This lack of accuracy is compensated by the synthetic samples with precise positions. 4.2.2 Synthetic Data To make close-to-real synthetic training samples, we need re- 8 Nima Sedaghat Reference Science ZOGY - D ZOGY - (Scorr>5 ) TransiNet Figure 6. Image subtraction examples using ZOGY and TransiNet for a set of CRTS Supernova Hunt images. The first column has the deep reference images, second column contains the science images which have a transient source and are a shallower version of the

NuSTAR NASA/Goddard Wu & Peek (2020) Estimating Galaxy Spectra from
Images • galaxy spectra are expensive to obtain • future surveys like Rubin and the Roman Space Telescope will generate vast amounts of galaxy images • goal: robustly predict galaxy spectra from broadband imaging Figure 2: A schematic of our methodology. A pretrained VAE maps SDSS spectra to six latent variables, which we use as training targets (upper). We optimize a CNN to estimate the latent variables from grizy galaxy images (lower). Our best model comprises a deconvolution stem, resnet-like CNN body, and fully-connected layer head. While the loss function compares targets and predictions in the six-dimensional latent space, we show examples of the decoded spectra for visual comparison. 2.2 Galaxy image cutouts We have obtained images of galaxies in five broad-band filters (grizy) from Data Release 2 of the Pan-STARRS 1 survey (https://panstarrs.stsci.edu/). The 224 ⇥ 224 image cutouts are delivered in FITS format with an angular scale of 0.2500 pixel 1. We augment the data using D4 dihedral group transformations. Most images have other astronomical objects in them, which may convey details about the galaxy environment (or inject irrelevant information about background or foreground sources). Although a small number of cutouts have imaging artifacts, we do not attempt to remove them in this work. 2.3 Network deconvolution CNNs are supervised machine learning algorithms that can encapsulate detailed representations of images. In recent years, CNNs have become widely used for astronomical classification and multivariate regression tasks, in addition to other computer vision problems [1,7]. A recently proposed architecture modification introduces deconvolution layers [11], which remove pixel-wise and channel-wise correlations from input features. Network deconvolution allows for efficient Figure 1: Randomly selected examples of PanSTARRS grizy image inputs and reconstructed galaxy spectrum outputs. VAE-decoded targets and predictions are shown in black and red, respectively. See also: * Canameras: Identifying strong gravitational lenses in current and future large-scale imaging surveys * Richards: Automated Detection of Galactic cirrus structures: a deep learning approach

NuSTAR NASA/Goddard Feinstein et al (2020) Finding Stellar Flares in
TESS Data testing the CNN are taken from G¨ unther et al. (2020), who searched for flares in the first two sectors of the TESS mission. The light curves consist of integrated flux measurements taken at two minute cadence over roughly 27 days; they were made publicly available with the first TESS data releases through the Mikul- ski Archive for Space Telescopes (MAST). Similarly to G¨ unther et al. (2020), we split each light curve into individual orbits, and normalized the Simple Aperture Pho- tometry flux (SAP flux) separately for each orbit. For supervised learning tasks, neural networks require input data that are uniformly sampled to train prop- erly. For the inputs to the CNN implemented here, we used a data set of one-dimensional time series where all elements have the same number of 2-minute cadences. We found that a length of 200 cadences provided enough information about the baseline flux surrounding a given flare. Longer baselines often predicted high probabilities for both rotational signatures and flares instead of just flares. This baseline also provided ample flare and non- flare sets to train, validate, and test on. Following the methods of Pearson et al. (2018), we ensured all known flare peak times from the G¨ unther et al. (2020) catalog were centered at the 100th cadence (i.e. centered). Each of these light curve snippets are hereafter referred to as a “sample.” All of the discussed steps (e.g. training and ensembling a series of CNN models) in this section are incorporated into the open-source Python package stella.2 stella and the CNN architecture described here, is specifically tailored for finding flares in TESS short-cadence light curves and should not be applied to other photometric time-series data. 2.2. Labels We used a binary labeling scheme of “flare” and “non- flare” for the samples (see Figure 1 for examples of the samples). For the flare examples, we used the peak times of flares identified by G¨ unther et al. (2020). Non-flare samples were centered on locations in the light curves at least 100 cadences from a flare. Our final training set contains 5389 hand-labeled flare examples and created 17684 non-flare examples for a 30% positive class data set. We then randomly divided the data set into training (80%), validation (10%), and test (10%) sets. We used the validation set to tune the network and train those at low-energy, were identified in the original catalog and therefore have a “non-flare” label in the training set (Figure 4; false negatives). Second, we found the catalog is o↵ in peak flare time for some cases and therefore have been classified as false positives when evaluating the validation set. This is because the flare was not at the center cadence of the example. Figure 1. Samples in the training set. Using flares identified in G¨ unther et al. (2020), we created a training set of non-flares (top) and flares (bottom), each of equal 200 cadence length. The light curves were not normalized. We include within the non-flare cases some examples of obvious spot modulation (upper right) so the CNN will ignore this variability and focus on the characteristic flare shape. 2.3. Network Architecture & Training Our CNN architecture, shown in Figure 2, is implemented in tf.keras, which is TensorFlow’s (Abadi et al. 2016) open source, high-level implementation of the Keras API specification (Chollet & others 2018). The network consists of a one-dimensional convolutional column with global max pooling and dropout, the results of which are flattened and fed into a series of fully connected (or “dense”) layers ending in a sigmoid function that produces an output in the range [0,1]. This output loosely represents the “score” of how likely a given Young stellar activity number of model parameters while increasing general- ization (e.g., Lin et al. 2013). Dropout helps prevent model over-fitting by randomly “dropping” (or setting to zero) some fraction of the output neurons in a given layer during training to prevent the model from becom- ing overly dependent on any of its features (Srivastava et al. 2014). Training neural networks involves inputting samples and then minimizing a cost function that measures how far o↵ the network’s predictions are from the truth. This is done through back propagation, which updates the model parameters to reduce the value of the cost function. For model training, we used the Adam optimiza- tion algorithm (Kingma & Ba 2014) to minimize the binary cross-entropy error function. The Adam optimizer was run with a learning rate of ↵ = 10 3 (this con- trols the degree to which the weights are updated with each iteration), exponential decay rates of = 0.9 and = 0.999 (for the first and second moment estimates), and ✏ = 10 8 (a small number to prevent any division by zero in the implementation). 2.4. Model Evaluation The exact model architecture, kernel sizes, etc. were chosen based on a trial and error approach to avoid overfitting the model. Over-fitting was evaluated using four standard machine learning metrics: accuracy, precision, recall, and average precision. Accuracy is the fraction of correct classifications by the model for both classes (flares and non-flares), at a given threshold for decid- SIGMOID OUTPUT (0,1) CONV-7-16 MAXPOOL-2 DROPOUT 0.1 CONV-3-64 MAXPOOL-2 DROPOUT 0.1 FLATTEN DENSE-32 DROPOUT 0.1 LIGHT CURVES Figure 2. The architecture of the stella CNN. The tra 14 Feinstein et al. Figure 14. Flare rates for our sample broken down by age and colored by e↵ective temperature, where purple bins represent • stellar fl ares affect the early stages of exoplanet evolution • study fl are rates in young stars • current methods remove low- amplitude fl ares • fi nd stellar fl ares with an ensemble CNN See also: * Rusticus: A Transit Detection Algorithm Based on Recurrent Neural Networks

NuSTAR NASA/Goddard How can we split supernovae into different classes?
• Eligibility criteria (is it a supernova?) • features (spectral lines, light curve shape, …) What is the probability for a supernovae with a given spectrum/light curve to be a Type a? • Eligibility criteria (is it a supernova?) • Output (classes of supernovae) • Input (spectral lines, light curve shape, …) • Eligibility criteria (is it a supernova?) • White dwarf explosion models • Features (spectral lines, light curve shape, …) Are Type Ia supernovae caused by the explosion of white dwarfs? adapted from Hernan et al (2019)

most of my science

NuSTAR NASA/Goddard Cobb et al (2019) Exoplanet Atmospheric Retrieval An
Ensemble of BNNs for Exoplanetary Atmospheric Retrieval Concrete Dropout Dense Layer 1 Concrete Dropout Dense Layer 2 Concrete Dropout Dense Layer 3 Concrete Dropout Dense Layer 4 Spectra Atmospheric Parameters: Network Outputs Expectations over Network samples initialization and the path taken during stochastic mization. In this paper we use an ensemble of ﬁve plan models and provide comparison to a single model models were chosen due to the empirical perfor in Table 1, as larger ensembles result in increa marginal improvements. The challenge in using an ensemble is in ho outputs from the individual models are combine our case, each output is the mean and covaria a multivariate normal distribution. Therefore in bining these distributions together, we can trea overall output from the ensemble as a Gaussian m model, whereby the each component weight corres to 1/M , where M is the number of models in the e ble. To calculate the expectation of this mixture m µens , we take the average of the individual comp means such that µens = 1 M M X m=1 µm . The variance of the mixture model ⌃ens can be lated by employing the law of total variance: (b) plan-net 10 5 0 H2O 10 5 0 HCN 10 5 0 NH3 1000 2000 T (K) 10 5 0 0 10 5 0 H2O 10 5 0 HCN 10 5 0 NH3 10 0 0 (c) plan-net Ensemble Figure 2. Retrieval analysis of the WFC3 transmiss spectrum of WASP-12b, where we compare the random est with both a single plan-net and a plan-net ensem The black cross denotes the mean over the samples, wh we report the results in Table 3. We note consistent resu across all models, and highlight the broader posteriors of • estimate parameters of exoplanet atmospheres (retrieval) • traditional Bayesian approach with nested sampling is computationally expensive • goal: fast inference through an ensemble of Bayesian neural networks See also: * Walmsley: Galaxy Zoo DECaLS: Detailed Visual Morphology Measurements from Volunteers and Deep Learning for 314,000 Galaxies * Guiglion: Extracting the most from 4MOST: machine learning appraoch for stellar spectra parametrization

Challenges

NuSTAR NASA/Goddard Benavente et al (2017) Biased Training Data •
Training data sets come from previous instruments and/or simulations • ML models are very sensitive to the details of the input data • feature spaces are not the same for different instruments or for simulations vs observations • Biases of types of sources that are observed

Domain Adaptation

NuSTAR NASA/Goddard Ghosh et al (2020) Study Morphology of Galaxies
• Color-mass diagrams separated by morphology are useful for studying galactic evolution • training data sets are small(ish) and instrument-dependent • solution: train CNN on simulations, retrain last few layers with observations GaMorNet employed to study galaxy morphology and quenching Disk-Dominated Bulge-Dominated Indeterminate GaMorNet employed to study galaxy morphology and quenching 9 Figure 4. Schematic diagram of GaMorNet, a CNN optimized to identify whether galaxies are bulge-dominated or disk- dominated. Its architecture, which is based on AlexNet (Krizhevsky et al. 2012), consists of ﬁve convolutional layers and three fully connected layers. Between these layers are max-pooling, local response normalization, and dropout layers. The numbers inside the circles refer to the layer number and corresponding details for each layer can be found by looking up the corresponding layer order number in Table 2.

NuSTAR NASA/Goddard Ć iprijanovi ć et al (2021), Ganin et
al (2016) Merging Galaxy Identi fi cation • understanding galaxy mergers important for understanding evolution of galaxies and structure formation • training data sets based on large cosmological simulations • DeepMerge II: includes Maximum Mean Discrepancy and Domain Adversarial Training DeepMergeII: BuildingRobustDeepLearning Algorithmsfor MergingGalaxy Identification Across Domains 5 et al. 2020) and the more complex and well known network ResNet18 with 22,484,866 trainable parameters (He et al. 2015)— to compare results across architectures. We decided to use the smallest standard ResNet architecture, in order to more easily tackle possible overfitting of the model, due to small sizes of merging galaxies datasets. The DeepMerge network, first introduced by ≈iprÚanovi∆ et al. (2020), is a simple CNN comprised of three convolutional layers followed by batch normalization, max pooling, and dropout, and three dense layers. In this paper, the dropout layers have been removed such that the only regularization happens via L2 regularization of the weight decay parameter in the optimizer. Additionally, the last layer of the original DeepMerge network was updated to include two neurons rather than one. For more details about the architecture check Table A1 in the Appendix. ResNets were first proposed in the seminal paper He et al. (2015) and have become one of the most widely-used network architectures for image recognition. They are comprised of residual blocks; in the case of ResNet18, blocks of two 3x3 convolutional layers are followed by a ReLU nonlinearity. The chaining of these residual blocks enables the network to retain high training accuracy performance even with increasing network depth. The domain classifier used in adversarial domain training to calculate transfer loss comprises of three dense layers, the first of which is the same dimension of the extracted features in the base network (either DeepMerge or ResNet18), such that these features form the input into the domain classifier. The second layer has 1024 neurons, followed by ReLU activation and dropout of 0.5, and the third has one output neuron followed by Sigmoid activation, conveying the domain chosen by the network. Details about training the networks can be found in Appendix A. We also list all hyperparameters used for training in di erent exper- iments in Table A2 and Table A3. Our parameter choice for each experiment was informed by running hyperparameter searches using DeepHyper (Balaprakash et al. 2018; Balaprakash et al. 2019). 4 DATA Here we present two dataset pairs for classifying distant merging galaxies (I = 2) and nearby merging galaxies (I < 0.1). Figure 1. Galaxy images from Illustris-1 simulation at I = 2. The left column shows merging galaxies and the right column shows non-mergers. The same objects are repeated across rows, with the top showing the source domain, the middle showing the target domain, and the bottom displaying the source objects with logarithmic color map normalization for enhanced visibility. images are convolved with the PSF, with added random sky shot Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky Figure 1: The proposed architecture includes a deep feature extractor (green) and a deep

Domain Adaptation does not help with out-of-(training)-distribution data

NuSTAR NASA/Goddard Villar et al (2020) Active Anomaly Detection of
Extragalactic Transients • classi fi ers for light curves are trained on types of sources and transients we know • how do we fi nd new types of transients? Anomaly Detection 3 Figure 1. Sample grizY light curves of majority SN classes (Type Ia, Ibc, II). Bold lines represent the 2D Gaussian Process mean function, with shaded regions representing 68% conﬁdence intervals. Even when entirely missing one or more bands, our method is able to produce reasonable interpolated light curves. Note that the y-axis, the magnitudes used to train the Gaussian Process, is designed such that the light curve tends to zero as the light curve approaches the survey magnitude limit. • Type I Superluminous SNe (SLSNe) are luminous, hydrogen-free events thought to be powered by rapidly spinning, highly magnetized neutron stars. SLSN models were produced using MOS- FiT (Guillochon et al. 2018; Nicholl et al. 2017; Villar et al. 2018). They make up 2.2 percent of our sample. We consider SLSNe to be members of the minority classes. • Type Iax are irregular Type Ia SNe with typically lower luminosities and lower velocities com- pared to normal Type Ia SNe (Li et al. 2003). The models were generated using available data in the Open Supernova Catalog (Guillochon et al. 2017). Type Iax SNe make up 1.8 percent of our dataset. We consider Type Iax SNe to be members of the minority classes. • Type IIn SNe2 are core-collapse SNe mainly powered by the interaction of the SN ejecta with circumstellar material (CSM). The models were generated using MOSFiT (Villar et al. 2017; Guil- lochon et al. 2018; Jiang et al. 2020). Type IIn SNe make up 1.8 percent of our sample. We consider Type IIn SNe to be members of the minority consider Type Ia-91bg SNe to be members of the minority classes. • Tidal Disruption Events (TDEs) result from the tidal disruption of stars by supermassive black holes (SMBH; Rees 1988). TDE models were generated using MOSFiT (Guillochon et al. 2018; Mockler et al. 2019) and make up 0.6 percent of our sample. We consider TDEs to be members of the minority classes. • Ca-rich Transients (CARTs) are intermediate luminosity transients whose spectra appear rich in calcium (Kasliwal et al. 2012). CARTs are mod- eled using MOSFiT, assuming they are powered by the radioactive decay of 56Ni. We note that this is the same model used to generate Type Ibc SNe, but with a distinct parameter space. CARTs make up 0.31 percent of our sample. We consider CARTs to be members of the minority classes. • Intermediate Luminosity Optical Transients (ILOTs) are transients that are brighter than no- vae but less luminous than SNe (Kasliwal 2012). 6 Villar et al. M ugrizY 𝜎 ugrizY phase { { ... Gru Gru μ 𝝈2 ... Gru Gru Dense ... Assign each LC an anomaly score via isolation forest Gaussian process interpolation of light curves (LCs) 1 Encode LCs with VRAENN 2 3 Output LC Repeat Layer Villar et al. Evolution of anomaly scores as a function of time for a representative set of SNe. Left: Vector plot showing the maly scores over time. Arrows represent the average gradients of anomaly scores. Right: Anomaly score curves for tive set of SNe. The green dotted line is the 99th percentile threshold for ﬁnal values (having an anomaly score See also: * Saifollahi: In Search of the Weirdest Galaxies in the Universe * Pruzhinkskaya: Anomaly Detection in the Zwicky Transient Facility Data Releases * Crake: A Machine Learning Exploration Across the Transient Universe: A Search for the Peculiar

Uncertain Labels

Credits: NASA/CXC/M.Weiss 0.6 0.8 1.0 e [counts/s] 50200 50300 50400
50500 50600 100 200 300 50700 50800 50900 51000 51100 100 200 300 51200 51300 51400 51500 51600 100 200 300 51700 51800 51900 52000 52100 100 200 300 52200 52300 52400 52500 52600 100 200 300 200 300 Time in days brightness GRS 1915+105 Humans are not perfect classi fi ers!

Uncertainty Quanti fi cation

NuSTAR NASA/Goddard Caldeira & Nord (2020) Uncertainty Quanti fi cation
Presented in the Fundamental Science in the era of AI workshop at ICLR 2020 0.020 0.025 0.030 0.035 0.040 0.045 0.050 Relative analytic uncertainty estimates 0.020 0.025 0.030 0.035 0.040 0.045 0.050 Relative aleatoric uncertainty estimates Trained with T noise of 1-5% DE CD BNN 0.03 0.04 0.05 0.06 0.07 Relative analytic uncertainty estimates 0.03 0.04 0.05 0.06 0.07 Trained with T noise of 1-10% DE CD BNN 0.04 0.06 0.08 0.10 0.12 0.14 Relative analytic uncertainty estimates 0.04 0.06 0.08 0.10 0.12 0.14 Trained with T noise of 1-20% DE CD BNN Figure 1: Comparison of the relative aleatoric statistical uncertainty in g to the relative analytic uncertainty estimate of g for each method, with increasing ranges of noise in T. The relative uncertainty is the ratio of the uncertainty and the value of g predicted. To make this plot, the x-axis was divided into equal-sized bins. The lines correspond to the median prediction in each bin, while the shaded area denotes the interval between the 16th and 84th percentiles. For the narrowest range of noise in training (left), the deep learning-based UQ methods fail to correlate with the analytic method: they reproduce average noise in the training set, but not the variation. For larger values of noise in T (middle), DE and CD correlate better with the analytic estimate, but they overestimate low analytic uncertainties, and underestimate high analytic uncertainties. BNNs still fail to correlate with the analytic estimate. For much larger values of noise in T (right): all methods follow a similar trend to the analytic estimate. This ﬂags an issue that one should be aware of when training deep learning methods with a negative log-likelihood objective. Presented in the Fundamental Science in the era of AI workshop at ICLR 2020 10 12 14 16 18 20 22 24 Output g (m/s2) 0.0 0.1 0.2 0.3 0.4 0.5 Epistemic uncertainty (m/s2) DE CD BNN Training range 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Input L (m) 0.0 0.1 0.2 0.3 0.4 Epistemic uncertainty (m/s2) DE CD BNN Training range • Simple Pendulum Experiment • Estimate g Neural networks tend to underestimate uncertainty See also: * Hunt: Uncertainty in Machine Learning: Are Bayesian Neural Networks plausible in 2021?

Exciting Future Trends

Adding Structure to Neural Networks & Physics-informed Machine Learning

NuSTAR NASA/Goddard Szklenár et al (2020) Rotation-Invariant Neural Networks •
classify different types of periodic variable stars • fold light curve on period • problem: phase is arbitrary and based on start of observations • goal: design neural network that is invariant to phase Zhang & Bloom (2021) Cyclic-Permutation Invariant Networks for Classifying Periodic Variables 3 Figure 1. Schematic illustration of the e ect of polar coordinate convolutions in preserving cyclic-permutation invariance. The input and output sequences are shown in polar coordinates for iTCN (top), and in Cartesian coordinates for TCN (bottom). The input sequence is a sine curve with two full oscillations in both See also: * Scaife: Radio galaxy classi fi cation using group- equivariant convolutional neural networks

Emulators

NuSTAR NASA/Goddard Himes et al (2020) Emulating Planetary Atmosphere Models
• spectra of planetary atmospheres are very noisy • Bayesian methods are standard • Deep learning models exist, but are less precise • goal: replace expensive physics simulator with a neural network • result: speedup of >1000 with CNN on GPU Accurate Machine Learning Atm … MARGE Parameters θ i RT Simulator (slow) Spectrum s i RT Simulator Surrogate (fast) Parameters θ Spectrum s Neural Network Training HOMER Observed spectrum s obs Bayesian sampler Posterior p(θ | s obs ) … Figure 1. Schematic diagram of our inverse modeling method, color-coded based on the scope of our software pack- ages. MARGE (Section 2.3.1) generates a data set based on a deterministic, forward process (e.g., RT) and trains a surrogate model to approximate that process. Using the trained surrogate, HOMER (Section 2.3.2) infers the inverse process (e.g., atmospheric retrieval) by simulating many forward models and comparing them to the target data (e.g., an observed spectrum) in a Bayesian framework. a standard inference pipeline. This approach preserves the accuracy of the Bayesian inference and, while slower clude abun We phys H2 /H abun and ture Accurate Machine Learning Atmospheric Retrieval 7 Accurate Machine Learning Atmospheric Retrieval 7 Figure 2. Four comparisons of planetary emission spectra predicted by MARGE and calculated by BART. The smoothed curves use a Savitzky-Golay ﬁlter with a third-order polynomial across a window of 101 elements (100 cm 1). The purple color arises due to a detailed match between the red and blue spectra at high resolution

neural network-based density estimation & simulation-based inference

Huppenkothen & Bachetti (submitted) Accurate Timing with SBI 7 Mitigating
Dead Time in X-ray Detectors • dead time is an instrumental effect that changes the X-ray light curve properties • dead time is hard to model analytically, but easy to simulate • use (sequential) neural posterior estimation to approximate the posterior using a Mixture Density Network See also: * my talk on Friday in the data science session :)

Unsupervised Learning See also: * Lazar: Galaxy morphological classi fi
cation via unsupervised machine learning * Garabato: Quality Assessment of unsupervised clustering based on SOMs applied to spectrophotometric surveys * Vohl: Pulsar Pro fi les from a graph perspective

Final Note: Ethics & Machine Learning

Machine Learning, Ethics & Astronomy • Our development and use
of algorithms is not independent from the societal context we live in • We develop and use ML models, we teach algorithms to our students, we write and speak about them at conferences, public outreach, … • How do we ful fi ll our responsibility to contribute to an ethical and responsible use of ML in the society we are part of?

Conclusions • Machine learning is now in common use in
astronomy for a variety of prediction and discovery problems • Some of the greatest challenges to making ML successful in astronomy come from biased training data sets, inaccurate labels, dif fi culties in uncertainty estimation and interpretation of deep learning models • Unsupervised learning and anomaly detection methods push the boundaries of what’s possible with uncertain labels • Citizen Science and transfer learning improve models suffering from small and/or biased training data sets • Bayesian neural networks, physics-informed neural networks, neural network-based density estimation, better uncertainty quanti fi cation and new visualization tools designed for interpretability make machine learning useful for astrophysical inference • As astronomers, we need to fi gure out our role in promoting responsible and ethical machine learning in the larger context of society, through model building, teaching ML to our students, and communicating with the public See also: * Tranin: Citizen Science for classi fi cation of X-ray archives * Nienartowicz: The Gaia Vari: Multibillion Time Series Data Science Platform for Variability Studies

Machine Learning In Astronomy: An (Incomplete) ...

Machine Learning In Astronomy: An (Incomplete) Overview

More Decks by Daniela Huppenkothen

Other Decks in Science

Featured

Transcript