Slide 1

Slide 1 text

Machine-assisted discovery of relationships in astronomy Matthew J. Graham, S. G. Djorgovski, Ashish A. Mahabal, Ciro Donalek, and Andrew J. Drake arXiv:1302.5129v1 Kyle Willett MIfA Journal Club March 7, 2013

Slide 2

Slide 2 text

2

Slide 3

Slide 3 text

Their conclusions • Discovering scientifically significant relationships in large, high-dimensional datasets is non-trivial • New algorithms and software have now successfully reproduced known astronomical results • Interpretation and evaluation of physical significance are still roles for human scientists 3

Slide 4

Slide 4 text

Why is this important? • New surveys (LSST, SKA) will be producing data with: • many thousands of parameters per object • millions to billions of objects • time-domain observations • tens of TB of data NIGHTLY • Discovering the full set of physical relationships in such large catalogs are not possible for humans alone - we need help. 4

Slide 5

Slide 5 text

5 Anscombe’s quartet Identical: mean of x mean of y variance of x variance of y correlation(x,y) linear regression Image: wikimedia.org What’s wrong with current techniques?

Slide 6

Slide 6 text

A non-parametric method to measure relationships between variables 6 “maximal information coefficient” MIC = 0 -> statistically independent (pure noise) MIC = 1 -> perfect dependence (no noise) Reshef et al. (2011)

Slide 7

Slide 7 text

If there is a relationship present, what form does it take? 7 Schmidt & Lipson (2009)

Slide 8

Slide 8 text

An improved method of choosing your model: symbolic regression 8 1.Load a data set 2.Choose the functional form of the relationship to model 3.Select possible mathematical building blocks 4.Use genetic algorithms to explore the metric space and find invariant quantities 5.Output a list of potential fits, ranked according to both accuracy and complexity +-, */, sin(), log(), etc. z = f(x,y) split into “training” and “validation” subsets

Slide 9

Slide 9 text

9 http://creativemachines.cornell.edu/eureqa

Slide 10

Slide 10 text

Hertzsprung-Russell diagram 10

Slide 11

Slide 11 text

Fundamental plane of elliptical galaxies 11 Djorgovski & Davis (1987): Symbolic regression: SDSS

Slide 12

Slide 12 text

Binary classifications • Similar techniques can be used to determine if an object falls in a particular class (or not) • star vs. galaxy (SDSS pipeline) • RR Lyrae vs. W UMa stars • CV vs. blazar • Supernovae: Type Ia vs. core- collapse • Searching for relationships among 30-60 dimensions (no. of parameters) 12 RR Lyrae W UMa

Slide 13

Slide 13 text

Limitations • Different techniques (MIC, Eureqa, SBR, mRMR) can yield different results for the same input. • Efficient discovery requires prepared datasets, choices of variables, weighting, and fit metrics, and guesses to functional forms • Methods only apply for invariant quantities expressible as partial differential equations. Will not work for: • fractal behavior • chaotic or stochastic activity • Current statistics are still limited to bivariate relationships 13

Slide 14

Slide 14 text

• Analyze the consistency of the features (data parameters) from these various techniques • sequential backwards ranking • minimum-reduction- maximum-relevance 14

Slide 15

Slide 15 text

Conclusions (embellished) • Discovering scientifically significant relationships in large, high-dimensional datasets is non-trivial • These tools will be necessary to fully analyze the data from LSST and SKA • New algorithms and software have now successfully reproduced known astronomical results • MIC: identifying non-parametric measure of dependence for pairs of variables • Symbolic regression: rank their functional forms and fit coefficients • Interpretation and evaluation of physical significance are still roles for human scientists • The discovery process is becoming machine-assisted, but we still need astrophysics to supply the data, analyze results, and provide context. 15