Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resources for Data Science in Astronomy

gully
November 30, 2015

Resources for Data Science in Astronomy

An overview and tutorial of resources for software engineering and data literacy as applicable to astronomy research software. The 1-hour talk was given to the graduate students and postdocs at the Kavli Institute for Astronomy and Astrophysics at Peking University in Beijing, China on November 30, 2015, during the Grad Dinner Talk Series.

gully

November 30, 2015
Tweet

More Decks by gully

Other Decks in Science

Transcript

  1. Resources  for  data   science  in  astronomy Michael  Gully-­‐San/ago  

    Postdoctoral  Researcher   Kavli  Ins7tute  for  Astronomy  and  Astrophysics Created by Lisa Oregioni from the Noun Project Created by OliM from the Noun Project
  2. About me: past • PhD May 2015, University of Texas

    at Austin Department of Astronomy; Austin, Texas, USA. • Advisor: Daniel T. Jaffe, PI of IGRINS, GMTNIRS, now VP for Research. • Diffractive optics development for near-IR spectroscopy: JWST, IGRINS, iSHELL, etc. • NASA Fellow at Jet Propulsion Lab, Pasadena. • Discovery and characterization of young brown dwarfs in nearby star forming regions • Thesis topic: "Innovative technologies for---and Observational Studies of--- Star and Planet Formation." • BA Astronomy & Physics 2007, Boston University Department of Astronomy; Boston, MA, USA
  3. About me: present • Postdoc, Kavli Institute for Astronomy and

    Astrophysics • Collaborator: Greg Herczeg • Research topic: Inference of high grasp near-IR spectra of young stars. • Interests in the future of research software in astronomy, large surveys, and leveraging modern statistical and computational tools. 0.36 0.32 0.28 0.24 [Fe/H] 6280 6320 6360 6400 Te↵ [K] 4.80 4.95 5.10 5.25 v sin i [km s 1] 0.35 0.30 0.25 [Fe/H] 4.80 4.95 5.10 5.25 v sin i [km s 1] P ✓? w C ⌅ M D ✓ext θ w
  4. About this talk. Basically, 1. Software and statistics tools ease

    the analysis, interpretation, and sharing of astronomical datasets. 2. Most of these tools have only recently been developed, or have only recently received wide-spread usage. 3. Many of these tools were not developed specifically for astronomy, but they are useful for astronomy. 4. Current and future astronomy datasets will require knowledge about these tools, especially algorithms that easily "scale" with the number of data points.
  5. Disclaimer • Some of the material presented here is controversial.

    • The mere use of the term Data Science in academic environments can be controversial, because it is mistaken as business analytics. In this talk, I will use Data Science as equal to Data Literacy. • My point-of-view is mostly as an American familiar with academic data science and, to-a-lesser-extent, tech startup companies. I know very little about data science in China. • There is still a lot I don't know-- I hope merely to provide information that will facilitate your own exploration, if this interests you. • I'm probably biased to Mac OS X.
  6. • Package  management   • shell,  conda,  homebrew,  texlive,  dotfiles

      • version  control/reproducibility   • git,  GitHub,  Jupyter  Notebooks,  Latex/Makefiles   • Python:   • pandas,  astropy,  seaborn,  aplpy,  astroML,  scipy,  emcee,  Jupyter,  bokeh,   astroquery   • Sta7s7cs   • FiMng  a  straight  line  to  data,  Gaussian  process  regression,  Machine   Learning   • Astronomy   • astroML  book,  ApJ  data  tables,  .Astronomy,  AstroHackWeek,  DSE's,    
  7. You first need to install the software in order to

    use it. The main challenge is satisfying dependencies.
  8. version control systems solve this problem git is a version

    control system (also: CVS, SVN, mercurial) How about scripts for making plots? Which code made which figure with what data? Jupyter Notebooks solve this problem.
  9. A directory that is version-controlled with git is called a

    repository. When you edit files, the repository notices.
  10. Incremental changes to the repository are saved as commits. Each

    commit has a unique identifier. In this case: 15a918e.
  11. A git repository saves the entire revision history of a

    project. Where? In a hidden".git" directory.
  12. You can, but don't have to: Sync your local changes

    to a remote repository. pull commits from the remote push commits to the remote
  13. Git is very useful for collaboration. Contributors can make "pull

    requests" to someone else's repository. Project maintainers can decide to merge the pull requests.
  14. Projects with many contributors can get complicated. git has many

    features for dealing with merge conflicts. If you are working alone locally, you don't have to worry about merge conflicts.
  15. There are many useful aspects to using git. One example

    is going back to previous versions in your repository, by "checking out" commits or branches.
  16. This way of thinking makes it much easier to be

    "experimental", by not being afraid to screw up your code. If you don't like the changes, you can go back easily. This way of thinking is encoded into git with branches. If you are interested in branching, and more advanced features, there are many good resources for learning git. Try git here: try.github.io My favorite is codeschool.com, but unfortunately this costs money for an account.
  17. GitHub is a website that hosts remote git repositories. GitHub

    also provides features to "socially code". You need a GitHub account to store your repositories on GitHub. GitHub is free for public repositories. For students: GitHub is free for up to 5 (or 20) private repositories. Private repositories are repositories that no one else can see.
  18. How to learn more astrophysics-specific statistics 1.Statistics, Data Mining, and

    Machine Learning in Astronomy Textbook (astroml.org) 2.David Hogg videos and blog 3.An astronomer's introduction to Gaussian processes 4.Effective Computation in Physics Book 5.Astro Data Hack Week (astrohackweek.github.io/)
  19. There are so many resources if you want more broad/general

    data literacy. 6.Open Source Data Science masters (datasciencemasters.org/) 7.Python for Data Analysis book (pandas.pydata.org/) 8.Hacker Rank (hackerrank.com/) 9.Kaggle (www.kaggle.com) 10. SQL Zoo (sqlzoo.net/) 11.Code School (www.codeschool.com) 12.Codecademy (www.codecademy.com)
  20. • An Astronomer's Introduction to Gaussian Processes (https://speakerdeck.com/dfm/an- astronomers-introduction-to-gaussian-processes- v2)

    • dotfiles (https://github.com/gully/dotfiles) • Awesome Astronomy (https://github.com/ jonathansick/awesome-astronomy) • Astro Hack Week Resources (https://github.com/ AstroHackWeek)
  21. • Large Synoptic Survey Telescope Data Management Team (http://dm.lsst.org/) •

    Pandas (http://pandas.pydata.org/) • astropy (http://astropy.org/) • astroquery (https://github.com/astropy/astroquery) • Apache Spark (http://spark.apache.org/) • Jake Vanderplas (https://jakevdp.github.io/) • Continuum Analytics (https://www.continuum.io/) • Anaconda Academic subscriptions (https://www.continuum.io/anaconda- academic-subscriptions-available) • GitHub Education (https://education.github.com/) • SciPy conference with video tutorials (http://scipy2015.scipy.org/) • Beijing Python Meetup (http://www.meetup.com/Beijing-Python/) • Docker (https://www.docker.com/)
  22. • Institutional Hack Days are becoming more common. I started

    one at the UT Austin Department of Astronomy. I'll probably do one here. • Best: attend Astro Hack Week, .Astronomy, or a hack day as part of an astronomy conference.