Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine-assisted discovery in astronomy

Machine-assisted discovery in astronomy

UMN journal club presentation given by K. Willett
Discussing results of Graham et al. (2013)

Kyle Willett

March 07, 2013
Tweet

More Decks by Kyle Willett

Other Decks in Science

Transcript

  1. Machine-assisted discovery of relationships in astronomy Matthew J. Graham, S.

    G. Djorgovski, Ashish A. Mahabal, Ciro Donalek, and Andrew J. Drake arXiv:1302.5129v1 Kyle Willett MIfA Journal Club March 7, 2013
  2. 2

  3. Their conclusions • Discovering scientifically significant relationships in large, high-dimensional

    datasets is non-trivial • New algorithms and software have now successfully reproduced known astronomical results • Interpretation and evaluation of physical significance are still roles for human scientists 3
  4. Why is this important? • New surveys (LSST, SKA) will

    be producing data with: • many thousands of parameters per object • millions to billions of objects • time-domain observations • tens of TB of data NIGHTLY • Discovering the full set of physical relationships in such large catalogs are not possible for humans alone - we need help. 4
  5. 5 Anscombe’s quartet Identical: mean of x mean of y

    variance of x variance of y correlation(x,y) linear regression Image: wikimedia.org What’s wrong with current techniques?
  6. A non-parametric method to measure relationships between variables 6 “maximal

    information coefficient” MIC = 0 -> statistically independent (pure noise) MIC = 1 -> perfect dependence (no noise) Reshef et al. (2011)
  7. An improved method of choosing your model: symbolic regression 8

    1.Load a data set 2.Choose the functional form of the relationship to model 3.Select possible mathematical building blocks 4.Use genetic algorithms to explore the metric space and find invariant quantities 5.Output a list of potential fits, ranked according to both accuracy and complexity +-, */, sin(), log(), etc. z = f(x,y) split into “training” and “validation” subsets
  8. Binary classifications • Similar techniques can be used to determine

    if an object falls in a particular class (or not) • star vs. galaxy (SDSS pipeline) • RR Lyrae vs. W UMa stars • CV vs. blazar • Supernovae: Type Ia vs. core- collapse • Searching for relationships among 30-60 dimensions (no. of parameters) 12 RR Lyrae W UMa
  9. Limitations • Different techniques (MIC, Eureqa, SBR, mRMR) can yield

    different results for the same input. • Efficient discovery requires prepared datasets, choices of variables, weighting, and fit metrics, and guesses to functional forms • Methods only apply for invariant quantities expressible as partial differential equations. Will not work for: • fractal behavior • chaotic or stochastic activity • Current statistics are still limited to bivariate relationships 13
  10. • Analyze the consistency of the features (data parameters) from

    these various techniques • sequential backwards ranking • minimum-reduction- maximum-relevance 14
  11. Conclusions (embellished) • Discovering scientifically significant relationships in large, high-dimensional

    datasets is non-trivial • These tools will be necessary to fully analyze the data from LSST and SKA • New algorithms and software have now successfully reproduced known astronomical results • MIC: identifying non-parametric measure of dependence for pairs of variables • Symbolic regression: rank their functional forms and fit coefficients • Interpretation and evaluation of physical significance are still roles for human scientists • The discovery process is becoming machine-assisted, but we still need astrophysics to supply the data, analyze results, and provide context. 15