Statistics, Data Mining, and Machine Learning (mostly don't work) in Astronomy
Plenary talk at the 18th International Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT 2017; https://indico.cern.ch/event/567550/)
Straightforward application of common techniques often fails. In Astronomy: - Data are often quite noisy - Pre-labeled objects are often biased toward easy to observe (bright and/or nearby) objects
Straightforward application of common techniques often fails. In Astronomy: - Data are often quite noisy - Pre-labeled objects are often biased toward easy to observe (bright and/or nearby) objects - Data are fundamentally image-based, which doesn’t play well with tabular database architectures
Jake VanderPlas Periodic Analysis Large-Scale Structure: Sesar et al. 2010 Robust detection of periodic variability is important in many areas of Astronomy. Exoplanets: European Space Agency
Jake VanderPlas Jake VanderPlas Two Naive Multiband Approaches 5 bands/night 1. Ignore band distinction and fit a single periodogram to all bands. (model is highly biased: under-fits the data) 2. Fit an independent periodogram within each band; combine the 2 of all K bands (model is too flexible: over-fits the data)
Jake VanderPlas Jake VanderPlas Two Naive Multiband Approaches 1 band/night 1. Ignore band distinction and fit a single periodogram to all bands. (model is highly biased: under-fits the data) 2. Fit an independent periodogram within each band; combine the 2 of all K bands (model is too flexible: over-fits the data)
Jake VanderPlas Jake VanderPlas Idea: Generalize the model “Multiband Periodogram - for each band, add a band component to describe deviation from base model
Jake VanderPlas Jake VanderPlas Putting it all together: The Multiband Periodogram + = Regularize the band component to drive common variation to the base model.
Jake VanderPlas Jake VanderPlas Multiband Periodogram on realistic survey data . . . Detects period with high significance when single-band approaches fail!
Key question: can scientific image analysis be done at scale on existing systems? Typical databases optimized for tabular data: Typical astronomy data consists of arrays of pixels. Standard data-mining tools are not built for typical scientific data:
Database architecture purpose-built for computation on multi-dim arrays. Python package aimed at parallelization of scientific workflows Shared-nothing DBMS developed by members of our UW team Popular in-memory big data system with wide adoption & Python interface System optimized for operations on N-dimensional tensors. We Explored Five Systems:
Key Takeaways: Scientific pipelines are complex enough that they rarely map onto built-in primitives for existing big data systems. Sufficient Primitives Dask Myria SciDB Spark Tensorflow N/A
Key Takeaways: In the meantime, seamless support for user-defined functions (UDFs) is absolutely essential for scientific use-cases Sufficient Primitives Python UDF Support Dask Myria SciDB Spark Tensorflow N/A
Key Takeaways: Sufficient Primitives Support for flexible domain-specific data formats in pipelines it very important for any nontrivial computational task Python UDF Support Flexible data formats Dask Myria SciDB Spark Tensorflow N/A
Key Takeaways: Sufficient Primitives Ideally, parallel computations & memory usage should be tuned automatically by the systems. None of the explored systems do this particularly well. Python UDF Support Flexible data formats Automatic tuning Dask Myria SciDB Spark Tensorflow N/A
Key Takeaways: Sufficient Primitives Installation headaches are the easiest way to drive frustration. Streamlined installation, particularly on the cloud, is a must Python UDF Support Flexible data formats Streamlined Installation Automatic tuning Dask Myria SciDB Spark Tensorflow N/A
Dask Myria SciDB Spark Tensorflow Key Takeaways: Sufficient Primitives A large and active user & developer community makes solving problems & getting questions answered much easier. Python UDF Support Flexible data formats Streamlined Installation Large User Community Automatic tuning N/A
- Most algorithms don’t suitably handle noise or measurement errors - Unlabeled data is often statistically distinct from training data (e.g. fainter) Reality: ML (often) doesn’t work in Astronomy
The Cannon: Turning ML Around Observed Data (with noise) ML Model Multiple Labels (no noise) Observed Data (with noise) ML Model Multiple Labels (no noise) Hard: Easier: Key insight: predict data from labels to create a data-driven generative model & treat label prediction as a least squares inference problem.
Jake VanderPlas Jake VanderPlas Statistics: Generalizing Lomb-Scargle Data Mining: Image-specific databases Machine Learning: Turning ML around with The Cannon
Jake VanderPlas Jake VanderPlas Statistics, Data Mining, and Machine Learning methods, applied naively, are often not well-suited to astronomy. But with some tweaks and some new insights, they can be!