Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Statistics, Data Mining, and Machine Learning (...

Statistics, Data Mining, and Machine Learning (mostly don't work) in Astronomy

Plenary talk at the 18th International Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT 2017; https://indico.cern.ch/event/567550/)

Jake VanderPlas

August 23, 2017
Tweet

More Decks by Jake VanderPlas

Other Decks in Science

Transcript

  1. Straightforward application of common techniques often fails. In Astronomy: -

    Data are often quite noisy - Pre-labeled objects are often biased toward easy to observe (bright and/or nearby) objects
  2. Straightforward application of common techniques often fails. In Astronomy: -

    Data are often quite noisy - Pre-labeled objects are often biased toward easy to observe (bright and/or nearby) objects - Data are fundamentally image-based, which doesn’t play well with tabular database architectures
  3. Case Studies: Statistics, Data Mining, and Machine Learning Case 1:

    Generalizing the Lomb-Scargle Periodogram* * J. VanderPlas et al 2015
  4. Jake VanderPlas Periodic Analysis Large-Scale Structure: Sesar et al. 2010

    Robust detection of periodic variability is important in many areas of Astronomy. Exoplanets: European Space Agency
  5. Jake VanderPlas Jake VanderPlas Lomb-Scargle Periodogram cf. Lomb (1976), Scargle

    (1982) Figure: VanderPlas & Ivezic 2015 - Generalization of a Fourier Spectrogram - Effectively assumes a sinusoidal model:
  6. Problem: Lomb-Scargle is not designed for heterogeneous data. For example,

    stars observed in multiple bands (i.e. wavelength regions)
  7. Jake VanderPlas Jake VanderPlas Two Naive Multiband Approaches 5 bands/night

    1. Ignore band distinction and fit a single periodogram to all bands. (model is highly biased: under-fits the data) 2. Fit an independent periodogram within each band; combine the 2 of all K bands (model is too flexible: over-fits the data)
  8. Jake VanderPlas Jake VanderPlas Two Naive Multiband Approaches 1 band/night

    1. Ignore band distinction and fit a single periodogram to all bands. (model is highly biased: under-fits the data) 2. Fit an independent periodogram within each band; combine the 2 of all K bands (model is too flexible: over-fits the data)
  9. Jake VanderPlas Jake VanderPlas Idea: Generalize the model “Multiband Periodogram”

    - define a base component which contributes equally to all bands.
  10. Jake VanderPlas Jake VanderPlas Idea: Generalize the model “Multiband Periodogram

    - for each band, add a band component to describe deviation from base model
  11. Jake VanderPlas Jake VanderPlas Putting it all together: The Multiband

    Periodogram + = Regularize the band component to drive common variation to the base model.
  12. Jake VanderPlas Jake VanderPlas Multiband Periodogram on realistic survey data

    . . . Detects period with high significance when single-band approaches fail!
  13. Jake VanderPlas Jake VanderPlas Statistics: We make progress by “opening

    the black box” and specializing or extending standard statistical methods
  14. Case Studies: Statistics, Data Mining, and Machine Learning Case 2:

    A Database for Images* * P. Mehta et al 2017
  15. Key question: can scientific image analysis be done at scale

    on existing systems? Typical databases optimized for tabular data: Typical astronomy data consists of arrays of pixels. Standard data-mining tools are not built for typical scientific data:
  16. Database architecture purpose-built for computation on multi-dim arrays. Python package

    aimed at parallelization of scientific workflows Shared-nothing DBMS developed by members of our UW team Popular in-memory big data system with wide adoption & Python interface System optimized for operations on N-dimensional tensors. We Explored Five Systems:
  17. Key Takeaways: Scientific pipelines are complex enough that they rarely

    map onto built-in primitives for existing big data systems. Sufficient Primitives Dask Myria SciDB Spark Tensorflow N/A
  18. Key Takeaways: In the meantime, seamless support for user-defined functions

    (UDFs) is absolutely essential for scientific use-cases Sufficient Primitives Python UDF Support Dask Myria SciDB Spark Tensorflow N/A
  19. Key Takeaways: Sufficient Primitives Support for flexible domain-specific data formats

    in pipelines it very important for any nontrivial computational task Python UDF Support Flexible data formats Dask Myria SciDB Spark Tensorflow N/A
  20. Key Takeaways: Sufficient Primitives Ideally, parallel computations & memory usage

    should be tuned automatically by the systems. None of the explored systems do this particularly well. Python UDF Support Flexible data formats Automatic tuning Dask Myria SciDB Spark Tensorflow N/A
  21. Key Takeaways: Sufficient Primitives Installation headaches are the easiest way

    to drive frustration. Streamlined installation, particularly on the cloud, is a must Python UDF Support Flexible data formats Streamlined Installation Automatic tuning Dask Myria SciDB Spark Tensorflow N/A
  22. Dask Myria SciDB Spark Tensorflow Key Takeaways: Sufficient Primitives A

    large and active user & developer community makes solving problems & getting questions answered much easier. Python UDF Support Flexible data formats Streamlined Installation Large User Community Automatic tuning N/A
  23. Jake VanderPlas Jake VanderPlas Data Mining: There is room for

    research in development of databases purpose-built for analysis of scientific imagery.
  24. Reality: ML (often) doesn’t work in Astronomy - Most algorithms

    don’t suitably handle noise or measurement errors
  25. - Most algorithms don’t suitably handle noise or measurement errors

    - Unlabeled data is often statistically distinct from training data (e.g. fainter) Reality: ML (often) doesn’t work in Astronomy
  26. The Cannon: Turning ML Around Observed Data (with noise) ML

    Model Multiple Labels (no noise) Observed Data (with noise) ML Model Multiple Labels (no noise) Hard: Easier: Key insight: predict data from labels to create a data-driven generative model & treat label prediction as a least squares inference problem.
  27. Jake VanderPlas Jake VanderPlas Machine Learning: We make progress by

    thinking outside the box to adapt existing ML methods to new classes of data
  28. Jake VanderPlas Jake VanderPlas Statistics, Data Mining, and Machine Learning

    methods, applied naively, are often not well-suited to astronomy. But with some tweaks and some new insights, they can be!