Causal structure and machine learning (in astronomy)

7feb7bbc3605d995c6099de0e25b4b99?s=47 David W Hogg
December 13, 2019

Causal structure and machine learning (in astronomy)

Presented on 2019-12-13 in Ringberg Castle at #MLringberg2019.

7feb7bbc3605d995c6099de0e25b4b99?s=128

David W Hogg

December 13, 2019
Tweet

Transcript

  1. Causal structure and machine learning David W Hogg (NYU) (MPIA)

    (Flatiron)
  2. Three concepts of causal structure • Generative vs discriminative models.

    ◦ (cf S. Villar) • Enforcing symmetries with graph structure. ◦ The “convolutional” in CNN, or the “recurrent” in RNN, for example. • Building models that represent our strong causal beliefs. ◦ As in “the image is blurred by the seeing, pixelized, and Poisson-sampled at the device”. ◦ (cf Lanusse, or Green)
  3. None
  4. But unitarity? • If the laws of physics are unitary,

    then there is no concept of intervention. • Therefore only certain meanings of the word “causal” are appropriate here. I think of the causal structure as being the physical dependencies in the data-generating process, representable by a directed graph.
  5. Generative vs discriminative models • Are you asking what function

    of your labels makes your data? ◦ x = A(y) ◦ Labels y generate the data x. • Or are you asking what function of your data makes your labels? ◦ y = B(x) ◦ Data x are transformed into labels y.
  6. Generative vs discriminative models Generative: • x = A(y) +

    noise . • Train with A := argmin_A || x - A(y) || . • Test with y := argmin_y || x - A(y) || . • The test step is like a pseudo-inverse of the forward model. Or an inference! • Can deal with missing data and non-trivial likelihood functions (heteroskedastic, for example). But the test step is an inference, effectively. Discriminative: • y = B(x) + prediction error • Train with B := argmin_B || y - B(x) || . ◦ plus regularization! . • Test with y := B(x) . • There is no inverse, not even a pseudo-inverse. • Test step is generally very fast!
  7. Example: Linear models • Let the data x be D-dimensional

    and the labels y be K-dimensional. • Generative: x = A y + noise, where A is D x K ◦ A has pseudo-inverse (ATA)-1 AT or something like that. ◦ Training data size N must be N > K. • Discriminative y = B x + prediction error, where B is K x D ◦ Training requires a regularization if N < D. • Conjecture: The generative model is always more accurate. ◦ This is even at optimal regularization amplitude. ◦ I can demonstrate this in a simple sandbox.
  8. None
  9. None
  10. None
  11. None
  12. Example: Linear models • Both models (generative and discriminative) do

    well. • But the generative model does better. ◦ It saturates some bounds on inference. ◦ And the discriminative model required a tuned regularization. • Conjecture: No matter what the training-set size, the generative model is always more accurate. ◦ This is even at optimal regularization amplitude.
  13. Graph structure to enforce symmetries • There are results from

    math that say that (on graph NNs, anyway), any compact symmetry can be enforced on the model. ◦ Bruna, LeCun, others; see also Charnock: http://bit.ly/NeuralBiasModel • The C in CNN is about translational symmetry. • In many of our problems (cosmology, turbulence, galaxy images), the symmetries are exact. ◦ Would you believe a cosmological parameter inference that depends on how you translate or rotate the large-scale structure? ◦ Would you believe a cosmic shear estimate that isn’t covariant under rotations? • Because these issues are hard, many practitioners resort to data augmentation. ◦ But this only enforces the symmetries in the limit. And it’s far away!
  14. Representing our beliefs about the physics • The stellar spectrum

    depends on Teff, log g, and element abundances. • The color of the star depends on its temperature and interstellar reddening, which in turn depends on its location in the Galaxy. • The galaxy image is sheared by the cosmological gravitational field, blurred by the Earth’s atmosphere, and pixelized by the detector.
  15. Extreme-precision radial velocities • It is now routine to measure

    stellar Doppler shifts at the m/s level. • Even at resolving power 100,000, this is 1/1000 of a pixel in the spectrograph. • Used to find or confirm many hundreds of extra-solar planets. ◦ And thousands more coming very soon. • Measured RVs are limited by our ability to model the atmosphere and star. ◦ (Not everyone would agree with this statement, but it’s a hill I’ll die on.) • The total signal-to-noise in typical data sets is immense. ◦ 100s of 100,000-pixel observations over many years with SNR of 100s each. • It was awarded the 2019 Nobel Prize. ◦ Mayor and Queloz
  16. None
  17. wobble • Megan Bedell (Flatiron), Dan Foreman-Mackey (Flatiron), Ben Montet

    (Chicago), Rodrigo Luger (Flatiron). • All tested and operating on real HARPS data. • arXiv:1901.00503
  18. wobble • A spectrum has lines or shape from star,

    atmosphere, and spectrograph. ◦ This is causal structure because different spectra are taken at different relative velocities. • These lines have different rest frames. ◦ Doppler shift is a hard-coded symmetry of the model, because Duh. • And some or all of these components can vary with time. ◦ And the relative velocities of star, atmosphere, and spectrograph do too. • Linearized model for tractability (convexity). • Justifiable likelihood function to account properly for noise. ◦ We have a good noise model and the data are heteroskedastic.
  19. None
  20. None
  21. The stellar color-magnitude diagram • In the space of luminosity

    and temperature, stars lie in an amazingly structured and simple distribution. ◦ See, eg, the last 150 years of astronomy. ◦ Main sequence, red-giant branch, white dwarfs, horizontal branch, red clump, binary sequences, and so on. ◦ Almost one-dimensional! (with thickness) • Theory does a great job! But small, systematic deviations. ◦ These lead to biases if you want to use the theory to measure stellar properties, for example. • If we can understand the CMD well, we can infer distances to all the stars! • Gaia changed the world. ◦ Data now “outweigh” theory in many respects.
  22. None
  23. De-noising Gaia • Lauren Anderson (Flatiron), Boris Leistedt (NYU), Axel

    Widmark (Stockholm), Keith Hawkins (Texas), and others. • ESA Gaia DR1 data ◦ This is out of date now, of course. • arXiv:1706.05055, arXiv:1705.08988, arXiv:1703.08112
  24. None
  25. None
  26. None
  27. None
  28. Model structure • Extremely flexible model for the true color-magnitude

    diagram. ◦ (The word “true” has many possible meanings here.) ◦ We could have used a deep-learning model here. • Correct use of the Gaia likelihood function (noise model). ◦ We didn’t have to cut out noisy or bad objects. ◦ Every star has its own individual noise properties (heteroskedastic). • The model knows that parallax and brightness both depend on distance! ◦ Didn’t have to learn that from the data. ◦ This is causal structure in my sense of this term. ◦ Technically, it is the embodiment of the symmetries of relativity and electromagnetism.
  29. Precision Our data-driven model was trained on and fit to

    the Gaia data but produced more precise results than the Gaia data. What gives? • Stationarity assumption. ◦ Stars are similar to one another; related to statistical shrinkage. • Use of high-quality noise model. • Enforcing physical symmetries. ◦ Lorentz invariance (the symmetries of electromagnetism and spacetime). ◦ (But no other use of physical models of stars, which we don’t fully believe.) ◦ There is much more we could do!
  30. Stellar spectroscopy • Stars (as observed) have only a few

    first-order parameters. ◦ Effective temperature, surface gravity, surface abundances (a few dominate). • These parameters reveal themselves in absorption-line strengths. • Stellar interior models and atmosphere models are amazingly detailed. ◦ Models predict spectra at the few-percent level or even better, depending on stuhh. ◦ And yet, the data are so incredible that we can see their failures at immense significance. • Spectroscopy at resolving power 20,000 to 100,000 is the standard tool. ◦ (wavelength over delta-wavelength)
  31. None
  32. None
  33. The Cannon* • Melissa Ness (Columbia), Anna Ho (Caltech), Andy

    Casey (Monash), Jessica Birky (UW), Hans-Walter Rix (MPIA), Soledad Villar (NYU), and many others. • SDSS-III and SDSS-IV APOGEE spectroscopy data. ◦ These are remarkable projects of which I am very fortunate to be a part. ◦ We have also run on other kinds of data from other projects. • arXiv:1501.07604, arXiv:1603.03040, arXiv:1609.03195, arXiv:1609.02914, arXiv:1602.00303, arXiv:1511.08204 … * It’s named after the person, not the weapon!
  34. The Cannon • Training-set framework. ◦ Some stars have good

    labels, from somewhere! ◦ I’m going to call parameters “labels”. • Every spectral pixel brightness (expectation) is a simple function of labels. • Righteous likelihood function for the spectral pixel brightnesses. ◦ Fully heteroskedastic. ◦ Can deal with missing data and low SNR spectra. • Model training is maximum-likelihood. • Model execution on new data is also maximum-likelihood.
  35. None
  36. None
  37. None
  38. None
  39. Conclusions • Generative vs discriminative models. ◦ (cf S. Villar)

    ◦ Generative models are more accurate, in at least some settings. ◦ Generative models better represent our beliefs. • Enforcing symmetries with graph structure. • Building models that represent our strong causal beliefs. ◦ (cf Lanusse, or Green) ◦ Creating state-of-the-art radial-velocity measurements for exoplanet discovery. ◦ De-noising Gaia data through hierarchical inference. ◦ Labeling stellar spectra more accurately than with physical models.