Save 37% off PRO during our Black Friday Sale! »

Data-driven models in the Gaia era (at #LennartFest)

7feb7bbc3605d995c6099de0e25b4b99?s=47 David W Hogg
September 01, 2017

Data-driven models in the Gaia era (at #LennartFest)


David W Hogg

September 01, 2017


  1. Data-driven models in the era of Gaia David W. Hogg

    (NYU) (Flatiron) (MPIA), and Lauren Anderson (Flatiron), Keith Hawkins (Columbia), Boris Leistedt (NYU), Melissa Ness (MPIA), Hans-Walter Rix (MPIA)
  2. Thank you, Gaia • Thank you for the early data

    release (DR1) and steady data releases. • Impact will be huge (it already is). • We recognize and appreciate how much work these early releases are. ◦ (But can we also get trial data to, say, train new models? cf. Steinmetz)
  3. Gaia Sprints • Hack for one intense week on the

    project of your choosing. • Enforced policy of openness. • Already produced 12 refereed papers! ◦ (including all Gaia results in this talk) • Next one is the week of 2018 June 03 in New York City. ◦ We will pay travel expenses for Gaia team members. ◦
  4. (my) Gaia Mission • My vision: A precise parallax for

    every star of the billion! • But: Gaia parallaxes are only precise for nearby stars. • But: Gaia delivers amazingly precise spectrophotometry.
  5. (my) Gaia Mission • Calibrate stellar models at close distances?

    • Use those models for photometric parallaxes at all distances? • But: I don’t trust the numerical simulations!
  6. The astrometrist’s view of the world • Geometry > Physics

    • Physics > Numerical simulations of stars ◦ (even spectroscopic radial velocity measurements are suspect!)
  7. What can I contribute? • You don’t have to use

    physics to build an accurate stellar model. • Data > Numerical simulations of stars!
  8. Statistical shrinkage • If you observe a billion related objects,

    every object can contribute some kind of information to your beliefs about every other one.
  9. Causal structure • To capitalize on shrinkage, you must impose

    the causal structure in which you strongly believe. • For example: Geometry & relativity. • For example: Gaia noise model.
  10. Graphical models

  11. Anderson et al 2017 arXiv:1706.05055 • Flexible mixture-of-Gaussian model for

    the noise-deconvolved color–magnitude diagram. • Using Gaia TGAS parallax and 2MASS photometric noise (uncertainties) responsibly. • Using rigid dust model (from Green et al). • ...Then use the CMD model to get improved parallaxes.
  12. None
  13. None
  14. None
  15. None
  16. None
  17. Hawkins et al 2017 arXiv:1705.08988 • How precise are red-clump

    stars as standard candles? • Build a mixture model for RC stars and contaminants. • Fit for mean and dispersion of RC absolute magnitudes, taking account of the TGAS and photometric uncertainties. • ...Find 0.17 mag dispersion.
  18. Hawkins et al 2017 arXiv:1705.08988

  19. None
  20. Leistedt et al 2017 arXiv:1703.08112 • Similar to Anderson et

    al, but fully Bayesian. • Model is less flexible, but it is tractable as a sampling problem. • ...Now distance posteriors are fully marginalized with respect to CMD models!
  21. None
  22. None
  23. So: Just throw machine learning at the problem? • No!

    ◦ missing data. ◦ heteroskedasticity. ◦ generalizability. • Every good data-driven model will be bespoke.
  24. Statistical shrinkage • A data-driven model can be far more

    precise than the data on which it was trained. • (But not more accurate.)
  25. Statistical philosophy • Pragmatism reigns. ◦ Full Bayes (eg, Leistedt

    et al). ◦ Maximum marginalized likelihood (eg, Anderson et al). ◦ Maximum likelihood (eg, Ness et al). • The important thing is the causal structure, not the statistical philosophy.
  26. Ness et al 2017 arXiv:1701.07829 • Use high-SNR APOGEE spectra

    as training set. • Train The Cannon (Ness et al 2015) to get detailed chemical abundances. • Apply to low-SNR APOGEE spectra. • ...Find far more precise chemical homogeneity among cluster stars than in the training data. ◦ (also: better results at lower SNR)
  27. None
  28. None
  29. None
  30. Aside: Proper motions are like parallaxes • Proper motions decrease

    with distance like parallaxes. • With a position–velocity model for the MW, they can be combined. ◦ cf. Floor’s talk; cf. “reduced proper motion” ◦ At large distances (and 10-year mission) we expect proper motions might dominate information.
  31. Fundamental assumption of data-driven models • Stationarity. • ie: The

    causal structure is correct. • ie: All non-trivial dependencies are represented in the graphical model.
  32. Assumptions can be tested • By construction, data-driven models are

    easy to validate. • When the causal structure is insufficient, the failures appear in simple validations or visualizations.
  33. Example: Halo stars are different from Disk stars • Different

    distributions of metallicity -> different color–magnitude diagrams. • Solution: Add kinematics and Galactocentric distance into the graphical model, and permit the model to discover this.
  34. Summary • There is no longer any reason to use

    numerical stellar models to generate photometric parallaxes. • The billion-star catalog plus statistical shrinkage will deliver enormous precision (and accuracy), better than any physics models. • Data > Numerical models of stars.