Three concepts of causal structure ● Generative vs discriminative models. ○ (cf S. Villar) ● Enforcing symmetries with graph structure. ○ The “convolutional” in CNN, or the “recurrent” in RNN, for example. ● Building models that represent our strong causal beliefs. ○ As in “the image is blurred by the seeing, pixelized, and Poisson-sampled at the device”. ○ (cf Lanusse, or Green)

But unitarity? ● If the laws of physics are unitary, then there is no concept of intervention. ● Therefore only certain meanings of the word “causal” are appropriate here. I think of the causal structure as being the physical dependencies in the data-generating process, representable by a directed graph.

Generative vs discriminative models ● Are you asking what function of your labels makes your data? ○ x = A(y) ○ Labels y generate the data x. ● Or are you asking what function of your data makes your labels? ○ y = B(x) ○ Data x are transformed into labels y.

Generative vs discriminative models Generative: ● x = A(y) + noise . ● Train with A := argmin_A || x - A(y) || . ● Test with y := argmin_y || x - A(y) || . ● The test step is like a pseudo-inverse of the forward model. Or an inference! ● Can deal with missing data and non-trivial likelihood functions (heteroskedastic, for example). But the test step is an inference, effectively. Discriminative: ● y = B(x) + prediction error ● Train with B := argmin_B || y - B(x) || . ○ plus regularization! . ● Test with y := B(x) . ● There is no inverse, not even a pseudo-inverse. ● Test step is generally very fast!

Example: Linear models ● Let the data x be D-dimensional and the labels y be K-dimensional. ● Generative: x = A y + noise, where A is D x K ○ A has pseudo-inverse (ATA)-1 AT or something like that. ○ Training data size N must be N > K. ● Discriminative y = B x + prediction error, where B is K x D ○ Training requires a regularization if N < D. ● Conjecture: The generative model is always more accurate. ○ This is even at optimal regularization amplitude. ○ I can demonstrate this in a simple sandbox.

Example: Linear models ● Both models (generative and discriminative) do well. ● But the generative model does better. ○ It saturates some bounds on inference. ○ And the discriminative model required a tuned regularization. ● Conjecture: No matter what the training-set size, the generative model is always more accurate. ○ This is even at optimal regularization amplitude.

Graph structure to enforce symmetries ● There are results from math that say that (on graph NNs, anyway), any compact symmetry can be enforced on the model. ○ Bruna, LeCun, others; see also Charnock: http://bit.ly/NeuralBiasModel ● The C in CNN is about translational symmetry. ● In many of our problems (cosmology, turbulence, galaxy images), the symmetries are exact. ○ Would you believe a cosmological parameter inference that depends on how you translate or rotate the large-scale structure? ○ Would you believe a cosmic shear estimate that isn’t covariant under rotations? ● Because these issues are hard, many practitioners resort to data augmentation. ○ But this only enforces the symmetries in the limit. And it’s far away!

Representing our beliefs about the physics ● The stellar spectrum depends on Teff, log g, and element abundances. ● The color of the star depends on its temperature and interstellar reddening, which in turn depends on its location in the Galaxy. ● The galaxy image is sheared by the cosmological gravitational field, blurred by the Earth’s atmosphere, and pixelized by the detector.

Extreme-precision radial velocities ● It is now routine to measure stellar Doppler shifts at the m/s level. ● Even at resolving power 100,000, this is 1/1000 of a pixel in the spectrograph. ● Used to ﬁnd or conﬁrm many hundreds of extra-solar planets. ○ And thousands more coming very soon. ● Measured RVs are limited by our ability to model the atmosphere and star. ○ (Not everyone would agree with this statement, but it’s a hill I’ll die on.) ● The total signal-to-noise in typical data sets is immense. ○ 100s of 100,000-pixel observations over many years with SNR of 100s each. ● It was awarded the 2019 Nobel Prize. ○ Mayor and Queloz

wobble ● Megan Bedell (Flatiron), Dan Foreman-Mackey (Flatiron), Ben Montet (Chicago), Rodrigo Luger (Flatiron). ● All tested and operating on real HARPS data. ● arXiv:1901.00503

wobble ● A spectrum has lines or shape from star, atmosphere, and spectrograph. ○ This is causal structure because different spectra are taken at different relative velocities. ● These lines have different rest frames. ○ Doppler shift is a hard-coded symmetry of the model, because Duh. ● And some or all of these components can vary with time. ○ And the relative velocities of star, atmosphere, and spectrograph do too. ● Linearized model for tractability (convexity). ● Justiﬁable likelihood function to account properly for noise. ○ We have a good noise model and the data are heteroskedastic.

The stellar color-magnitude diagram ● In the space of luminosity and temperature, stars lie in an amazingly structured and simple distribution. ○ See, eg, the last 150 years of astronomy. ○ Main sequence, red-giant branch, white dwarfs, horizontal branch, red clump, binary sequences, and so on. ○ Almost one-dimensional! (with thickness) ● Theory does a great job! But small, systematic deviations. ○ These lead to biases if you want to use the theory to measure stellar properties, for example. ● If we can understand the CMD well, we can infer distances to all the stars! ● Gaia changed the world. ○ Data now “outweigh” theory in many respects.

De-noising Gaia ● Lauren Anderson (Flatiron), Boris Leistedt (NYU), Axel Widmark (Stockholm), Keith Hawkins (Texas), and others. ● ESA Gaia DR1 data ○ This is out of date now, of course. ● arXiv:1706.05055, arXiv:1705.08988, arXiv:1703.08112

Model structure ● Extremely ﬂexible model for the true color-magnitude diagram. ○ (The word “true” has many possible meanings here.) ○ We could have used a deep-learning model here. ● Correct use of the Gaia likelihood function (noise model). ○ We didn’t have to cut out noisy or bad objects. ○ Every star has its own individual noise properties (heteroskedastic). ● The model knows that parallax and brightness both depend on distance! ○ Didn’t have to learn that from the data. ○ This is causal structure in my sense of this term. ○ Technically, it is the embodiment of the symmetries of relativity and electromagnetism.

Precision Our data-driven model was trained on and ﬁt to the Gaia data but produced more precise results than the Gaia data. What gives? ● Stationarity assumption. ○ Stars are similar to one another; related to statistical shrinkage. ● Use of high-quality noise model. ● Enforcing physical symmetries. ○ Lorentz invariance (the symmetries of electromagnetism and spacetime). ○ (But no other use of physical models of stars, which we don’t fully believe.) ○ There is much more we could do!

Stellar spectroscopy ● Stars (as observed) have only a few ﬁrst-order parameters. ○ Effective temperature, surface gravity, surface abundances (a few dominate). ● These parameters reveal themselves in absorption-line strengths. ● Stellar interior models and atmosphere models are amazingly detailed. ○ Models predict spectra at the few-percent level or even better, depending on stuhh. ○ And yet, the data are so incredible that we can see their failures at immense signiﬁcance. ● Spectroscopy at resolving power 20,000 to 100,000 is the standard tool. ○ (wavelength over delta-wavelength)

The Cannon* ● Melissa Ness (Columbia), Anna Ho (Caltech), Andy Casey (Monash), Jessica Birky (UW), Hans-Walter Rix (MPIA), Soledad Villar (NYU), and many others. ● SDSS-III and SDSS-IV APOGEE spectroscopy data. ○ These are remarkable projects of which I am very fortunate to be a part. ○ We have also run on other kinds of data from other projects. ● arXiv:1501.07604, arXiv:1603.03040, arXiv:1609.03195, arXiv:1609.02914, arXiv:1602.00303, arXiv:1511.08204 … * It’s named after the person, not the weapon!

The Cannon ● Training-set framework. ○ Some stars have good labels, from somewhere! ○ I’m going to call parameters “labels”. ● Every spectral pixel brightness (expectation) is a simple function of labels. ● Righteous likelihood function for the spectral pixel brightnesses. ○ Fully heteroskedastic. ○ Can deal with missing data and low SNR spectra. ● Model training is maximum-likelihood. ● Model execution on new data is also maximum-likelihood.

Conclusions ● Generative vs discriminative models. ○ (cf S. Villar) ○ Generative models are more accurate, in at least some settings. ○ Generative models better represent our beliefs. ● Enforcing symmetries with graph structure. ● Building models that represent our strong causal beliefs. ○ (cf Lanusse, or Green) ○ Creating state-of-the-art radial-velocity measurements for exoplanet discovery. ○ De-noising Gaia data through hierarchical inference. ○ Labeling stellar spectra more accurately than with physical models.