Slide 1

Slide 1 text

Resources  for  data   science  in  astronomy Michael  Gully-­‐San/ago   Postdoctoral  Researcher   Kavli  Ins7tute  for  Astronomy  and  Astrophysics Created by Lisa Oregioni from the Noun Project Created by OliM from the Noun Project

Slide 2

Slide 2 text

About me: past • PhD May 2015, University of Texas at Austin Department of Astronomy; Austin, Texas, USA. • Advisor: Daniel T. Jaffe, PI of IGRINS, GMTNIRS, now VP for Research. • Diffractive optics development for near-IR spectroscopy: JWST, IGRINS, iSHELL, etc. • NASA Fellow at Jet Propulsion Lab, Pasadena. • Discovery and characterization of young brown dwarfs in nearby star forming regions • Thesis topic: "Innovative technologies for---and Observational Studies of--- Star and Planet Formation." • BA Astronomy & Physics 2007, Boston University Department of Astronomy; Boston, MA, USA

Slide 3

Slide 3 text

About me: present • Postdoc, Kavli Institute for Astronomy and Astrophysics • Collaborator: Greg Herczeg • Research topic: Inference of high grasp near-IR spectra of young stars. • Interests in the future of research software in astronomy, large surveys, and leveraging modern statistical and computational tools. 0.36 0.32 0.28 0.24 [Fe/H] 6280 6320 6360 6400 Te↵ [K] 4.80 4.95 5.10 5.25 v sin i [km s 1] 0.35 0.30 0.25 [Fe/H] 4.80 4.95 5.10 5.25 v sin i [km s 1] P ✓? w C ⌅ M D ✓ext θ w

Slide 4

Slide 4 text

About this talk. Basically, 1. Software and statistics tools ease the analysis, interpretation, and sharing of astronomical datasets. 2. Most of these tools have only recently been developed, or have only recently received wide-spread usage. 3. Many of these tools were not developed specifically for astronomy, but they are useful for astronomy. 4. Current and future astronomy datasets will require knowledge about these tools, especially algorithms that easily "scale" with the number of data points.

Slide 5

Slide 5 text

Disclaimer • Some of the material presented here is controversial. • The mere use of the term Data Science in academic environments can be controversial, because it is mistaken as business analytics. In this talk, I will use Data Science as equal to Data Literacy. • My point-of-view is mostly as an American familiar with academic data science and, to-a-lesser-extent, tech startup companies. I know very little about data science in China. • There is still a lot I don't know-- I hope merely to provide information that will facilitate your own exploration, if this interests you. • I'm probably biased to Mac OS X.

Slide 6

Slide 6 text

• Package  management   • shell,  conda,  homebrew,  texlive,  dotfiles   • version  control/reproducibility   • git,  GitHub,  Jupyter  Notebooks,  Latex/Makefiles   • Python:   • pandas,  astropy,  seaborn,  aplpy,  astroML,  scipy,  emcee,  Jupyter,  bokeh,   astroquery   • Sta7s7cs   • FiMng  a  straight  line  to  data,  Gaussian  process  regression,  Machine   Learning   • Astronomy   • astroML  book,  ApJ  data  tables,  .Astronomy,  AstroHackWeek,  DSE's,    

Slide 7

Slide 7 text

Installing code

Slide 8

Slide 8 text

You first need to install the software in order to use it. The main challenge is satisfying dependencies.

Slide 9

Slide 9 text

I wrote a blog post on this. http://bit.ly/1XAv5VO http://gully.github.io/2015/11/15/brews_gems_dotfiles_conda_installing_on_mac/

Slide 10

Slide 10 text

Conda is the best way to install and manage Python on your computer.

Slide 11

Slide 11 text

Conda is the best way to install and manage Python on your computer.

Slide 12

Slide 12 text

Conda is the best way to install and manage Python on your computer.

Slide 13

Slide 13 text

Conda is the best way to install and manage Python on your computer.

Slide 14

Slide 14 text

Conda is the best way to install and manage Python on your computer.

Slide 15

Slide 15 text

Conda is the best way to install and manage Python on your computer.

Slide 16

Slide 16 text

You can install a lot of other tools with Homebrew if you have a Mac.

Slide 17

Slide 17 text

You can install a lot of other tools with Homebrew if you have a Mac.

Slide 18

Slide 18 text

You can install a lot of other tools with Homebrew if you have a Mac.

Slide 19

Slide 19 text

You can install a lot of other tools with Homebrew if you have a Mac.

Slide 20

Slide 20 text

There's also TexLive manager for installing/updating LaTeX.

Slide 21

Slide 21 text

version control.

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

version control systems solve this problem git is a version control system (also: CVS, SVN, mercurial) How about scripts for making plots? Which code made which figure with what data? Jupyter Notebooks solve this problem.

Slide 25

Slide 25 text

A directory that is version-controlled with git is called a repository. When you edit files, the repository notices.

Slide 26

Slide 26 text

You can add and remove files from being tracked in version control.

Slide 27

Slide 27 text

Incremental changes to the repository are saved as commits. Each commit has a unique identifier. In this case: 15a918e.

Slide 28

Slide 28 text

A git repository saves the entire revision history of a project. Where? In a hidden".git" directory.

Slide 29

Slide 29 text

You can, but don't have to: Sync your local changes to a remote repository. pull commits from the remote push commits to the remote

Slide 30

Slide 30 text

Git is very useful for collaboration. Contributors can make "pull requests" to someone else's repository. Project maintainers can decide to merge the pull requests.

Slide 31

Slide 31 text

Projects with many contributors can get complicated. git has many features for dealing with merge conflicts. If you are working alone locally, you don't have to worry about merge conflicts.

Slide 32

Slide 32 text

There are many useful aspects to using git. One example is going back to previous versions in your repository, by "checking out" commits or branches.

Slide 33

Slide 33 text

This way of thinking makes it much easier to be "experimental", by not being afraid to screw up your code. If you don't like the changes, you can go back easily. This way of thinking is encoded into git with branches. If you are interested in branching, and more advanced features, there are many good resources for learning git. Try git here: try.github.io My favorite is codeschool.com, but unfortunately this costs money for an account.

Slide 34

Slide 34 text

Remote repositories can be hosted on GitHub.

Slide 35

Slide 35 text

GitHub is a website that hosts remote git repositories. GitHub also provides features to "socially code". You need a GitHub account to store your repositories on GitHub. GitHub is free for public repositories. For students: GitHub is free for up to 5 (or 20) private repositories. Private repositories are repositories that no one else can see.

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

Many astronomers are on GitHub

Slide 38

Slide 38 text

GitHub is increasingly popular in astronomy. Number of papers on astroPH with GitHub mentioned.

Slide 39

Slide 39 text

There are many astronomy projects on GitHub.

Slide 40

Slide 40 text

Python

Slide 41

Slide 41 text

Python is increasingly popular in astronomy.

Slide 42

Slide 42 text

pydata ecosystem + astronomy specific libraries

Slide 43

Slide 43 text

Here are the modules I will demonstrate today.

Slide 44

Slide 44 text

Statistics

Slide 45

Slide 45 text

How to learn more astrophysics-specific statistics 1.Statistics, Data Mining, and Machine Learning in Astronomy Textbook (astroml.org) 2.David Hogg videos and blog 3.An astronomer's introduction to Gaussian processes 4.Effective Computation in Physics Book 5.Astro Data Hack Week (astrohackweek.github.io/)

Slide 46

Slide 46 text

There are so many resources if you want more broad/general data literacy. 6.Open Source Data Science masters (datasciencemasters.org/) 7.Python for Data Analysis book (pandas.pydata.org/) 8.Hacker Rank (hackerrank.com/) 9.Kaggle (www.kaggle.com) 10. SQL Zoo (sqlzoo.net/) 11.Code School (www.codeschool.com) 12.Codecademy (www.codecademy.com)

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

• An Astronomer's Introduction to Gaussian Processes (https://speakerdeck.com/dfm/an- astronomers-introduction-to-gaussian-processes- v2) • dotfiles (https://github.com/gully/dotfiles) • Awesome Astronomy (https://github.com/ jonathansick/awesome-astronomy) • Astro Hack Week Resources (https://github.com/ AstroHackWeek)

Slide 49

Slide 49 text

• Large Synoptic Survey Telescope Data Management Team (http://dm.lsst.org/) • Pandas (http://pandas.pydata.org/) • astropy (http://astropy.org/) • astroquery (https://github.com/astropy/astroquery) • Apache Spark (http://spark.apache.org/) • Jake Vanderplas (https://jakevdp.github.io/) • Continuum Analytics (https://www.continuum.io/) • Anaconda Academic subscriptions (https://www.continuum.io/anaconda- academic-subscriptions-available) • GitHub Education (https://education.github.com/) • SciPy conference with video tutorials (http://scipy2015.scipy.org/) • Beijing Python Meetup (http://www.meetup.com/Beijing-Python/) • Docker (https://www.docker.com/)

Slide 50

Slide 50 text

• Institutional Hack Days are becoming more common. I started one at the UT Austin Department of Astronomy. I'll probably do one here. • Best: attend Astro Hack Week, .Astronomy, or a hack day as part of an astronomy conference.