Statistics, Data Mining, and Machine Learning (mostly don't work) in Astronomy

Statistics, Data Mining, and Machine Learning (mostly don't work) in Astronomy

Plenary talk at the 18th International Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT 2017; https://indico.cern.ch/event/567550/)

56c4053438af8e8b90d6f53cbb7573be?s=128

Jake VanderPlas

August 23, 2017
Tweet

Transcript

  1. Statistics, Data Mining, and Machine Learning In Astronomy Jake VanderPlas

    @jakevdp ACAT 2017
  2. Statistics, Data Mining, and Machine Learning In Astronomy Jake VanderPlas

    @jakevdp ACAT 2017
  3. Straightforward application of common techniques often fails. In Astronomy: -

    Data are often quite noisy
  4. Straightforward application of common techniques often fails. In Astronomy: -

    Data are often quite noisy - Pre-labeled objects are often biased toward easy to observe (bright and/or nearby) objects
  5. Straightforward application of common techniques often fails. In Astronomy: -

    Data are often quite noisy - Pre-labeled objects are often biased toward easy to observe (bright and/or nearby) objects - Data are fundamentally image-based, which doesn’t play well with tabular database architectures
  6. We make progress in Astronomy by adapting and extending methods

    developed in other fields
  7. Case Studies: Statistics, Data Mining, and Machine Learning

  8. Case Studies: Statistics, Data Mining, and Machine Learning Case 1:

    Generalizing the Lomb-Scargle Periodogram* * J. VanderPlas et al 2015
  9. Jake VanderPlas Periodic Analysis Large-Scale Structure: Sesar et al. 2010

    Robust detection of periodic variability is important in many areas of Astronomy. Exoplanets: European Space Agency
  10. Jake VanderPlas Jake VanderPlas Lomb-Scargle Periodogram cf. Lomb (1976), Scargle

    (1982) Figure: VanderPlas & Ivezic 2015 - Generalization of a Fourier Spectrogram - Effectively assumes a sinusoidal model:
  11. Problem: Lomb-Scargle is not designed for heterogeneous data. For example,

    stars observed in multiple bands (i.e. wavelength regions)
  12. Jake VanderPlas Jake VanderPlas Two Naive Multiband Approaches 5 bands/night

    1. Ignore band distinction and fit a single periodogram to all bands. (model is highly biased: under-fits the data) 2. Fit an independent periodogram within each band; combine the 2 of all K bands (model is too flexible: over-fits the data)
  13. Jake VanderPlas Jake VanderPlas Two Naive Multiband Approaches 1 band/night

    1. Ignore band distinction and fit a single periodogram to all bands. (model is highly biased: under-fits the data) 2. Fit an independent periodogram within each band; combine the 2 of all K bands (model is too flexible: over-fits the data)
  14. Jake VanderPlas Jake VanderPlas Idea: Generalize the model “Multiband Periodogram”

  15. Jake VanderPlas Jake VanderPlas Idea: Generalize the model “Multiband Periodogram”

    - define a base component which contributes equally to all bands.
  16. Jake VanderPlas Jake VanderPlas Idea: Generalize the model “Multiband Periodogram

    - for each band, add a band component to describe deviation from base model
  17. Jake VanderPlas Jake VanderPlas Putting it all together: The Multiband

    Periodogram + = Regularize the band component to drive common variation to the base model.
  18. Jake VanderPlas Jake VanderPlas Multiband Periodogram on realistic survey data

    . . . Detects period with high significance when single-band approaches fail!
  19. Jake VanderPlas Jake VanderPlas Statistics: We make progress by “opening

    the black box” and specializing or extending standard statistical methods
  20. Case Studies: Statistics, Data Mining, and Machine Learning Case 2:

    A Database for Images* * P. Mehta et al 2017
  21. None
  22. Key question: can scientific image analysis be done at scale

    on existing systems? Typical databases optimized for tabular data: Typical astronomy data consists of arrays of pixels. Standard data-mining tools are not built for typical scientific data:
  23. Database architecture purpose-built for computation on multi-dim arrays. Python package

    aimed at parallelization of scientific workflows Shared-nothing DBMS developed by members of our UW team Popular in-memory big data system with wide adoption & Python interface System optimized for operations on N-dimensional tensors. We Explored Five Systems:
  24. Key Takeaways: Dask Myria SciDB Spark Tensorflow

  25. Key Takeaways: Scientific pipelines are complex enough that they rarely

    map onto built-in primitives for existing big data systems. Sufficient Primitives Dask Myria SciDB Spark Tensorflow N/A
  26. Key Takeaways: In the meantime, seamless support for user-defined functions

    (UDFs) is absolutely essential for scientific use-cases Sufficient Primitives Python UDF Support Dask Myria SciDB Spark Tensorflow N/A
  27. Key Takeaways: Sufficient Primitives Support for flexible domain-specific data formats

    in pipelines it very important for any nontrivial computational task Python UDF Support Flexible data formats Dask Myria SciDB Spark Tensorflow N/A
  28. Key Takeaways: Sufficient Primitives Ideally, parallel computations & memory usage

    should be tuned automatically by the systems. None of the explored systems do this particularly well. Python UDF Support Flexible data formats Automatic tuning Dask Myria SciDB Spark Tensorflow N/A
  29. Key Takeaways: Sufficient Primitives Installation headaches are the easiest way

    to drive frustration. Streamlined installation, particularly on the cloud, is a must Python UDF Support Flexible data formats Streamlined Installation Automatic tuning Dask Myria SciDB Spark Tensorflow N/A
  30. Dask Myria SciDB Spark Tensorflow Key Takeaways: Sufficient Primitives A

    large and active user & developer community makes solving problems & getting questions answered much easier. Python UDF Support Flexible data formats Streamlined Installation Large User Community Automatic tuning N/A
  31. See our paper for more detailed quantitative breakdown & discussion

    https://arxiv.org/abs/1612.02485
  32. Jake VanderPlas Jake VanderPlas Data Mining: There is room for

    research in development of databases purpose-built for analysis of scientific imagery.
  33. Case Studies: Statistics, Data Mining, and Machine Learning Case 3:

    The Cannon* * M. Ness et al., 2015
  34. Challenge: given spectra, determine labels (e.g. temperature, surface gravity, metal

    content, etc.) Image: APOGEE project
  35. Textbook Machine Learning Training Data

  36. Textbook Machine Learning Training Data Model

  37. Textbook Machine Learning Training Data Model + Unknown data

  38. Textbook Machine Learning Training Data Model + Unknown data Predictions

  39. Reality: ML (often) doesn’t work in Astronomy - Most algorithms

    don’t suitably handle noise or measurement errors
  40. - Most algorithms don’t suitably handle noise or measurement errors

    - Unlabeled data is often statistically distinct from training data (e.g. fainter) Reality: ML (often) doesn’t work in Astronomy
  41. The Cannon: Turning ML Around Observed Data (with noise) ML

    Model Multiple Labels (no noise) Observed Data (with noise) ML Model Multiple Labels (no noise) Hard: Easier: Key insight: predict data from labels to create a data-driven generative model & treat label prediction as a least squares inference problem.
  42. Stellar spectra generated from 6 spectra-derived labels (temperature, surface gravity,

    metallicity, etc.) true spectra model spectra
  43. Results: much more accurate labels, even for much fainter objects.

  44. Jake VanderPlas Jake VanderPlas Machine Learning: We make progress by

    thinking outside the box to adapt existing ML methods to new classes of data
  45. Jake VanderPlas Jake VanderPlas Statistics: Generalizing Lomb-Scargle Data Mining: Image-specific

    databases Machine Learning: Turning ML around with The Cannon
  46. Jake VanderPlas Jake VanderPlas Statistics, Data Mining, and Machine Learning

    methods, applied naively, are often not well-suited to astronomy. But with some tweaks and some new insights, they can be!
  47. Email: jakevdp@uw.edu Twitter: @jakevdp Github: jakevdp Web: http://vanderplas.com/ Blog: http://jakevdp.github.io/

    Thank You!