Machine-assisted discovery in astronomy

Machine-assisted discovery in astronomy

UMN journal club presentation given by K. Willett
Discussing results of Graham et al. (2013)

A5e5f24da39ad8290bbf1ca6822cd21e?s=128

Kyle Willett

March 07, 2013
Tweet

Transcript

  1. Machine-assisted discovery of relationships in astronomy Matthew J. Graham, S.

    G. Djorgovski, Ashish A. Mahabal, Ciro Donalek, and Andrew J. Drake arXiv:1302.5129v1 Kyle Willett MIfA Journal Club March 7, 2013
  2. 2

  3. Their conclusions • Discovering scientifically significant relationships in large, high-dimensional

    datasets is non-trivial • New algorithms and software have now successfully reproduced known astronomical results • Interpretation and evaluation of physical significance are still roles for human scientists 3
  4. Why is this important? • New surveys (LSST, SKA) will

    be producing data with: • many thousands of parameters per object • millions to billions of objects • time-domain observations • tens of TB of data NIGHTLY • Discovering the full set of physical relationships in such large catalogs are not possible for humans alone - we need help. 4
  5. 5 Anscombe’s quartet Identical: mean of x mean of y

    variance of x variance of y correlation(x,y) linear regression Image: wikimedia.org What’s wrong with current techniques?
  6. A non-parametric method to measure relationships between variables 6 “maximal

    information coefficient” MIC = 0 -> statistically independent (pure noise) MIC = 1 -> perfect dependence (no noise) Reshef et al. (2011)
  7. If there is a relationship present, what form does it

    take? 7 Schmidt & Lipson (2009)
  8. An improved method of choosing your model: symbolic regression 8

    1.Load a data set 2.Choose the functional form of the relationship to model 3.Select possible mathematical building blocks 4.Use genetic algorithms to explore the metric space and find invariant quantities 5.Output a list of potential fits, ranked according to both accuracy and complexity +-, */, sin(), log(), etc. z = f(x,y) split into “training” and “validation” subsets
  9. 9 http://creativemachines.cornell.edu/eureqa

  10. Hertzsprung-Russell diagram 10

  11. Fundamental plane of elliptical galaxies 11 Djorgovski & Davis (1987):

    Symbolic regression: SDSS
  12. Binary classifications • Similar techniques can be used to determine

    if an object falls in a particular class (or not) • star vs. galaxy (SDSS pipeline) • RR Lyrae vs. W UMa stars • CV vs. blazar • Supernovae: Type Ia vs. core- collapse • Searching for relationships among 30-60 dimensions (no. of parameters) 12 RR Lyrae W UMa
  13. Limitations • Different techniques (MIC, Eureqa, SBR, mRMR) can yield

    different results for the same input. • Efficient discovery requires prepared datasets, choices of variables, weighting, and fit metrics, and guesses to functional forms • Methods only apply for invariant quantities expressible as partial differential equations. Will not work for: • fractal behavior • chaotic or stochastic activity • Current statistics are still limited to bivariate relationships 13
  14. • Analyze the consistency of the features (data parameters) from

    these various techniques • sequential backwards ranking • minimum-reduction- maximum-relevance 14
  15. Conclusions (embellished) • Discovering scientifically significant relationships in large, high-dimensional

    datasets is non-trivial • These tools will be necessary to fully analyze the data from LSST and SKA • New algorithms and software have now successfully reproduced known astronomical results • MIC: identifying non-parametric measure of dependence for pairs of variables • Symbolic regression: rank their functional forms and fit coefficients • Interpretation and evaluation of physical significance are still roles for human scientists • The discovery process is becoming machine-assisted, but we still need astrophysics to supply the data, analyze results, and provide context. 15