Academic Data Science: conducting research at the interface of different disciplines

Academic Data Science: conducting research at the interface of different disciplines

Data Club @ UGA first seminar. I presented my work on image analysis and machine learning for astronomy, and the study of extra-solar planets via direct imaging. Then, I discussed the many challenges that we face as academic data scientists working at the interface of several disciplines. Finally, I concluded by listing some of the exciting opportunities that data science opens for us, young researchers, and our future (perhaps academic) careers.

https://data-institute.univ-grenoble-alpes.fr/data-institute/news-and-events/data-club-1st-seminar-735317.htm?RH=10277933017461015

Ff4e4a187a5b14652d6b9ab842d4a3e9?s=128

Carlos Alberto Gomez Gonzalez

February 28, 2018
Tweet

Transcript

  1. Academic Data Science: conducting research at the interface of different

    disciplines Carlos Alberto Gomez Gonzalez Data Club @ UGA, 28/02/2018
  2. Exoplanets

  3. 3

  4. Mostly, we rely on indirect methods for detecting exoplanets Cause

    it’s very hard to see them!
  5. PSRB1257+12 b,c 51 Peg b HD 209458 b HR8799 b,c,d

    HR8799 e, beta Pic b 51 Eri b http://exoplanetarchive.ipac.caltech.edu, 25 Jan 2018 Too few of these!!!
  6. HR8799 bcde (Marois et al. 2008-2010) This is how we

    see exoplanets
  7. Sea of speckles :(

  8. Basic calibration and “cosmetics” • Dark/bias subtraction • Flat fielding

    • Sky (thermal background) subtraction • Bad pixel correction Raw astronomical images Final residual image Image recentering • Center of mass • 2d Gaussian fit • DFT cross-correlation Bad frames removal • Image correlation • Pixel statistics (specific image regions) Reference PSF creation • Pairwise • Median • PCA, NMF • LOCI • LLSG Image combination • Mean, median, trimmed mean PSF reference subtraction De-rotation (for ADI) or rescaling (for mSDI) Characterization of detected companions Planet hunter pipeline
  9. Basic calibration and “cosmetics” • Dark/bias subtraction • Flat fielding

    • Sky (thermal background) subtraction • Bad pixel correction Raw astronomical images Final residual image Image recentering • Center of mass • 2d Gaussian fit • DFT cross-correlation Bad frames removal • Image correlation • Pixel statistics (specific image regions) Reference PSF creation • Pairwise • Median • PCA, NMF • LOCI • LLSG Image combination • Mean, median, trimmed mean PSF reference subtraction De-rotation (for ADI) or rescaling (for mSDI) Characterization of detected companions image post-processing
  10. Gomez Gonzalez et al. 2016 A lgo- ZO O

  11. • https://github.com/vortex-exoplanet/VIP • 50k+ lines of code: author + 7

    contributors • 337 commits, 77 pull requests, 49 closed issues, 16 releases • Growing community of users: > 12 papers published using/ citing VIP • Documentation: http://vip.readthedocs.io/ Gomez Gonzalez et al. 2017
  12. Open source ——> Open Science

  13. Image sequence Final residual image ? ? ? ? ?

    ? Detection
  14. “Essentially, all models are wrong, but some are useful.” George

    Box “…if the model is going to be wrong anyway, why not see if you can get the computer to ‘quickly’ learn a model from the data, rather than have a human laboriously derive a model from a lot of thought.” Peter Norvig
  15. PC 1 PC 2 Unsupervised Supervised Regression Classification Dimensionality reduction

    Clustering Textbook ML
  16. N x Pann k SVD low-rank approximation levels k residuals,

    back to image space X : MLAR samples 0 1 Convolutional LSTM layer kernel=(3x3), filters=40 Convolutional LSTM layer kernel=(2x2), filters=80 Dense layer units=128 Output dense layer units=1 3d Max pooling size=(2x2x2) 3d Max pooling size=(2x2x2) ReLU activation + dropout Sigmoid activation X and y to train/test/validation sets Probability of positive class MLAR patches Binary map probability threshold = 0.9 Trained classifier PSF Input cube, N frames Input cube y : Labels … … (a) (b) (c) Gomez Gonzalez et al. 2018 Supervised detection
  17. Good classifier True positive True Negative Threshold False Negative False

    Positive Observations Bad classifier Performance assessment
  18. Performance assessment

  19. Image (model PSF) subtraction Supervised detection (SODINN) noisy and unlabelled

    images data transformation + adequate (DL) model astonishing results
  20. None
  21. Machine learning & Stats Computer science My research Exoplanets direct

    imaging
  22. What is data science? https://datascience.nyu.edu/what-is-data-science/ • Over-abused buzzword that can

    mean anything and everything • Does it deal only with ”big” data? • At its core, DS involves using automated methods to analyze massive amounts of data and to extract knowledge from them. • Is DS really that new? How is it different from statistics? • DS offers a powerful new approach to making discoveries. By combining aspects of statistics, CS, applied mathematics, and visualization, DS can turn the vast amounts of data the digital age generates into new insights and new knowledge.
  23. Disciplinarities Credit: Alexander Refsum Jensenius http://www.arj.no/2012/03/12/disciplinarities-2/ Not so easy to

    get here!
  24. • Interdisciplinary expertise isn’t properly recognized • Inadequate metrics and

    assessment mechanisms for promotion • No clear paths/protocols for establishing collaborations (multidisciplinarity) • It is often not trivial to navigate and integrate knowledge from different disciplines • Never-ending impostor syndrome Interdisciplinarity: challenges https://www.nature.com/articles/s41599-017-0039-7 http://blog.fperez.org/2013/11/an-ambitious-experiment-in-data-science.html https://www.space.com/39420-becoming-astrophysicist-keeps-getting-tougher.html
  25. In academia: • Change modern science in a positive way

    • Freedom of research • Exciting cross and interdisciplinary projects Out: • Industry rewards the skills and behavior that is not properly valued in academia (The Sexiest Job of the 21st Century) • Building up transferable skills • Opportunity to work on “interesting” topics Opportunities http://jakevdp.github.io/blog/2013/10/26/big-data-brain-drain/ https://danielahuppenkothen.wordpress.com/2018/01/13/lessons-learned-in-data-science/
  26. http://brohrer.github.io/get_data_science_job.html https://www.datanami.com/2016/08/29/9-paths-data-science-interview/ https://www.datacamp.com/community/tutorials/data-science-industry-infographic Data analyst, data scientist, data engineer, machine

    learning engineer… or stay in academia and…
  27. Transform science: • Cross and inter-disciplinary research (collaboration with CS,

    ML, AI fields) • Ensuring the use of robust statistical approaches and well- suited metrics • Integrating cutting-edge AI developments Academic data science http://jakevdp.github.io/blog/2014/08/22/hacking-academia/
  28. Transforming science: • Code release (open-source development) • Knowledge sharing

    • non-refereed publications • The Journal of Open Source Software: https://joss.theoj.org/ • Data challenges (benchmark datasets) Academic data science
  29. carlos.gomez@univ-grenoble-alpes.fr carlgogo carlosalbertogomezgonzalez https://carlgogo.github.io/ ¡Gracias!