David W Hogg
December 13, 2019
310

# Causal structure and machine learning (in astronomy)

Presented on 2019-12-13 in Ringberg Castle at #MLringberg2019.

## David W Hogg

December 13, 2019

## Transcript

1. Causal structure and
machine learning
David W Hogg
(NYU) (MPIA) (Flatiron)

2. Three concepts of causal structure
● Generative vs discriminative models.
○ (cf S. Villar)
● Enforcing symmetries with graph structure.
○ The “convolutional” in CNN, or the “recurrent” in RNN, for example.
● Building models that represent our strong causal beliefs.
○ As in “the image is blurred by the seeing, pixelized, and Poisson-sampled at the device”.
○ (cf Lanusse, or Green)

3. But unitarity?
● If the laws of physics are unitary, then there is no concept of intervention.
● Therefore only certain meanings of the word “causal” are appropriate here.
I think of the causal structure as being the physical dependencies in the
data-generating process, representable by a directed graph.

4. Generative vs discriminative models
○ x = A(y)
○ Labels y generate the data x.
○ y = B(x)
○ Data x are transformed into labels y.

5. Generative vs discriminative models
Generative:
● x = A(y) + noise .
● Train with A := argmin_A || x - A(y) || .
● Test with y := argmin_y || x - A(y) || .
● The test step is like a pseudo-inverse of
the forward model. Or an inference!
● Can deal with missing data and non-trivial
likelihood functions (heteroskedastic, for
example). But the test step is an inference,
effectively.
Discriminative:
● y = B(x) + prediction error
● Train with B := argmin_B || y - B(x) || .
○ plus regularization! .
● Test with y := B(x) .
● There is no inverse, not even a
pseudo-inverse.
● Test step is generally very fast!

6. Example: Linear models
● Let the data x be D-dimensional and the labels y be K-dimensional.
● Generative: x = A y + noise, where A is D x K
○ A has pseudo-inverse (ATA)-1 AT or something like that.
○ Training data size N must be N > K.
● Discriminative y = B x + prediction error, where B is K x D
○ Training requires a regularization if N < D.
● Conjecture: The generative model is always more accurate.
○ This is even at optimal regularization amplitude.
○ I can demonstrate this in a simple sandbox.

7. Example: Linear models
● Both models (generative and discriminative) do well.
● But the generative model does better.
○ It saturates some bounds on inference.
○ And the discriminative model required a tuned regularization.
● Conjecture: No matter what the training-set size, the generative model is
always more accurate.
○ This is even at optimal regularization amplitude.

8. Graph structure to enforce symmetries
● There are results from math that say that (on graph NNs, anyway), any
compact symmetry can be enforced on the model.
● The C in CNN is about translational symmetry.
● In many of our problems (cosmology, turbulence, galaxy images), the
symmetries are exact.
○ Would you believe a cosmological parameter inference that depends on how you translate or
rotate the large-scale structure?
○ Would you believe a cosmic shear estimate that isn’t covariant under rotations?
● Because these issues are hard, many practitioners resort to data
augmentation.
○ But this only enforces the symmetries in the limit. And it’s far away!

9. Representing our beliefs about the physics
● The stellar spectrum depends on Teff, log g, and element abundances.
● The color of the star depends on its temperature and interstellar reddening,
which in turn depends on its location in the Galaxy.
● The galaxy image is sheared by the cosmological gravitational field, blurred
by the Earth’s atmosphere, and pixelized by the detector.

● It is now routine to measure stellar Doppler shifts at the m/s level.
● Even at resolving power 100,000, this is 1/1000 of a pixel in the spectrograph.
● Used to ﬁnd or conﬁrm many hundreds of extra-solar planets.
○ And thousands more coming very soon.
● Measured RVs are limited by our ability to model the atmosphere and star.
○ (Not everyone would agree with this statement, but it’s a hill I’ll die on.)
● The total signal-to-noise in typical data sets is immense.
○ 100s of 100,000-pixel observations over many years with SNR of 100s each.
● It was awarded the 2019 Nobel Prize.
○ Mayor and Queloz

11. wobble
● Megan Bedell (Flatiron), Dan Foreman-Mackey (Flatiron), Ben Montet
(Chicago), Rodrigo Luger (Flatiron).
● All tested and operating on real HARPS data.
● arXiv:1901.00503

12. wobble
● A spectrum has lines or shape from star, atmosphere, and spectrograph.
○ This is causal structure because different spectra are taken at different relative velocities.
● These lines have different rest frames.
○ Doppler shift is a hard-coded symmetry of the model, because Duh.
● And some or all of these components can vary with time.
○ And the relative velocities of star, atmosphere, and spectrograph do too.
● Linearized model for tractability (convexity).
● Justiﬁable likelihood function to account properly for noise.
○ We have a good noise model and the data are heteroskedastic.

13. The stellar color-magnitude diagram
● In the space of luminosity and temperature, stars lie in an amazingly
structured and simple distribution.
○ See, eg, the last 150 years of astronomy.
○ Main sequence, red-giant branch, white dwarfs, horizontal branch, red clump, binary
sequences, and so on.
○ Almost one-dimensional! (with thickness)
● Theory does a great job! But small, systematic deviations.
○ These lead to biases if you want to use the theory to measure stellar properties, for example.
● If we can understand the CMD well, we can infer distances to all the stars!
● Gaia changed the world.
○ Data now “outweigh” theory in many respects.

14. De-noising Gaia
● Lauren Anderson (Flatiron), Boris Leistedt (NYU), Axel Widmark (Stockholm),
Keith Hawkins (Texas), and others.
● ESA Gaia DR1 data
○ This is out of date now, of course.
● arXiv:1706.05055, arXiv:1705.08988, arXiv:1703.08112

15. Model structure
● Extremely ﬂexible model for the true color-magnitude diagram.
○ (The word “true” has many possible meanings here.)
○ We could have used a deep-learning model here.
● Correct use of the Gaia likelihood function (noise model).
○ We didn’t have to cut out noisy or bad objects.
○ Every star has its own individual noise properties (heteroskedastic).
● The model knows that parallax and brightness both depend on distance!
○ Didn’t have to learn that from the data.
○ This is causal structure in my sense of this term.
○ Technically, it is the embodiment of the symmetries of relativity and electromagnetism.

16. Precision
Our data-driven model was trained on and ﬁt to the Gaia data but produced more
precise results than the Gaia data. What gives?
● Stationarity assumption.
○ Stars are similar to one another; related to statistical shrinkage.
● Use of high-quality noise model.
● Enforcing physical symmetries.
○ Lorentz invariance (the symmetries of electromagnetism and spacetime).
○ (But no other use of physical models of stars, which we don’t fully believe.)
○ There is much more we could do!

17. Stellar spectroscopy
● Stars (as observed) have only a few ﬁrst-order parameters.
○ Effective temperature, surface gravity, surface abundances (a few dominate).
● These parameters reveal themselves in absorption-line strengths.
● Stellar interior models and atmosphere models are amazingly detailed.
○ Models predict spectra at the few-percent level or even better, depending on stuhh.
○ And yet, the data are so incredible that we can see their failures at immense signiﬁcance.
● Spectroscopy at resolving power 20,000 to 100,000 is the standard tool.
○ (wavelength over delta-wavelength)

18. The Cannon*
● Melissa Ness (Columbia), Anna Ho (Caltech), Andy Casey (Monash), Jessica
Birky (UW), Hans-Walter Rix (MPIA), Soledad Villar (NYU), and many others.
● SDSS-III and SDSS-IV APOGEE spectroscopy data.
○ These are remarkable projects of which I am very fortunate to be a part.
○ We have also run on other kinds of data from other projects.
● arXiv:1501.07604, arXiv:1603.03040, arXiv:1609.03195, arXiv:1609.02914,
arXiv:1602.00303, arXiv:1511.08204 …
* It’s named after the person, not the weapon!

19. The Cannon
● Training-set framework.
○ Some stars have good labels, from somewhere!
○ I’m going to call parameters “labels”.
● Every spectral pixel brightness (expectation) is a simple function of labels.
● Righteous likelihood function for the spectral pixel brightnesses.
○ Fully heteroskedastic.
○ Can deal with missing data and low SNR spectra.
● Model training is maximum-likelihood.
● Model execution on new data is also maximum-likelihood.

20. Conclusions
● Generative vs discriminative models.
○ (cf S. Villar)
○ Generative models are more accurate, in at least some settings.
○ Generative models better represent our beliefs.
● Enforcing symmetries with graph structure.
● Building models that represent our strong causal beliefs.
○ (cf Lanusse, or Green)
○ Creating state-of-the-art radial-velocity measurements for exoplanet discovery.
○ De-noising Gaia data through hierarchical inference.
○ Labeling stellar spectra more accurately than with physical models.