Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Academic Data Science: conducting research at t...

Academic Data Science: conducting research at the interface of different disciplines

Data Club @ UGA first seminar. I presented my work on image analysis and machine learning for astronomy, and the study of extra-solar planets via direct imaging. Then, I discussed the many challenges that we face as academic data scientists working at the interface of several disciplines. Finally, I concluded by listing some of the exciting opportunities that data science opens for us, young researchers, and our future (perhaps academic) careers.

https://data-institute.univ-grenoble-alpes.fr/data-institute/news-and-events/data-club-1st-seminar-735317.htm?RH=10277933017461015

Carlos Alberto Gomez Gonzalez

February 28, 2018
Tweet

More Decks by Carlos Alberto Gomez Gonzalez

Other Decks in Science

Transcript

  1. Academic Data Science: conducting research at the interface of different

    disciplines Carlos Alberto Gomez Gonzalez Data Club @ UGA, 28/02/2018
  2. 3

  3. PSRB1257+12 b,c 51 Peg b HD 209458 b HR8799 b,c,d

    HR8799 e, beta Pic b 51 Eri b http://exoplanetarchive.ipac.caltech.edu, 25 Jan 2018 Too few of these!!!
  4. Basic calibration and “cosmetics” • Dark/bias subtraction • Flat fielding

    • Sky (thermal background) subtraction • Bad pixel correction Raw astronomical images Final residual image Image recentering • Center of mass • 2d Gaussian fit • DFT cross-correlation Bad frames removal • Image correlation • Pixel statistics (specific image regions) Reference PSF creation • Pairwise • Median • PCA, NMF • LOCI • LLSG Image combination • Mean, median, trimmed mean PSF reference subtraction De-rotation (for ADI) or rescaling (for mSDI) Characterization of detected companions Planet hunter pipeline
  5. Basic calibration and “cosmetics” • Dark/bias subtraction • Flat fielding

    • Sky (thermal background) subtraction • Bad pixel correction Raw astronomical images Final residual image Image recentering • Center of mass • 2d Gaussian fit • DFT cross-correlation Bad frames removal • Image correlation • Pixel statistics (specific image regions) Reference PSF creation • Pairwise • Median • PCA, NMF • LOCI • LLSG Image combination • Mean, median, trimmed mean PSF reference subtraction De-rotation (for ADI) or rescaling (for mSDI) Characterization of detected companions image post-processing
  6. • https://github.com/vortex-exoplanet/VIP • 50k+ lines of code: author + 7

    contributors • 337 commits, 77 pull requests, 49 closed issues, 16 releases • Growing community of users: > 12 papers published using/ citing VIP • Documentation: http://vip.readthedocs.io/ Gomez Gonzalez et al. 2017
  7. “Essentially, all models are wrong, but some are useful.” George

    Box “…if the model is going to be wrong anyway, why not see if you can get the computer to ‘quickly’ learn a model from the data, rather than have a human laboriously derive a model from a lot of thought.” Peter Norvig
  8. N x Pann k SVD low-rank approximation levels k residuals,

    back to image space X : MLAR samples 0 1 Convolutional LSTM layer kernel=(3x3), filters=40 Convolutional LSTM layer kernel=(2x2), filters=80 Dense layer units=128 Output dense layer units=1 3d Max pooling size=(2x2x2) 3d Max pooling size=(2x2x2) ReLU activation + dropout Sigmoid activation X and y to train/test/validation sets Probability of positive class MLAR patches Binary map probability threshold = 0.9 Trained classifier PSF Input cube, N frames Input cube y : Labels … … (a) (b) (c) Gomez Gonzalez et al. 2018 Supervised detection
  9. Good classifier True positive True Negative Threshold False Negative False

    Positive Observations Bad classifier Performance assessment
  10. Image (model PSF) subtraction Supervised detection (SODINN) noisy and unlabelled

    images data transformation + adequate (DL) model astonishing results
  11. What is data science? https://datascience.nyu.edu/what-is-data-science/ • Over-abused buzzword that can

    mean anything and everything • Does it deal only with ”big” data? • At its core, DS involves using automated methods to analyze massive amounts of data and to extract knowledge from them. • Is DS really that new? How is it different from statistics? • DS offers a powerful new approach to making discoveries. By combining aspects of statistics, CS, applied mathematics, and visualization, DS can turn the vast amounts of data the digital age generates into new insights and new knowledge.
  12. • Interdisciplinary expertise isn’t properly recognized • Inadequate metrics and

    assessment mechanisms for promotion • No clear paths/protocols for establishing collaborations (multidisciplinarity) • It is often not trivial to navigate and integrate knowledge from different disciplines • Never-ending impostor syndrome Interdisciplinarity: challenges https://www.nature.com/articles/s41599-017-0039-7 http://blog.fperez.org/2013/11/an-ambitious-experiment-in-data-science.html https://www.space.com/39420-becoming-astrophysicist-keeps-getting-tougher.html
  13. In academia: • Change modern science in a positive way

    • Freedom of research • Exciting cross and interdisciplinary projects Out: • Industry rewards the skills and behavior that is not properly valued in academia (The Sexiest Job of the 21st Century) • Building up transferable skills • Opportunity to work on “interesting” topics Opportunities http://jakevdp.github.io/blog/2013/10/26/big-data-brain-drain/ https://danielahuppenkothen.wordpress.com/2018/01/13/lessons-learned-in-data-science/
  14. Transform science: • Cross and inter-disciplinary research (collaboration with CS,

    ML, AI fields) • Ensuring the use of robust statistical approaches and well- suited metrics • Integrating cutting-edge AI developments Academic data science http://jakevdp.github.io/blog/2014/08/22/hacking-academia/
  15. Transforming science: • Code release (open-source development) • Knowledge sharing

    • non-refereed publications • The Journal of Open Source Software: https://joss.theoj.org/ • Data challenges (benchmark datasets) Academic data science