Slide 1

Slide 1 text

#SciPy2015 Jake VanderPlas The eScience Institute: Data Science at UW Jake VanderPlas; @jakevdp PyData Seattle; July 24, 2015

Slide 2

Slide 2 text

Jake VanderPlas UW eScience is a cross-disciplinary institute at University of Washington (Seattle) providing training and support in modern data-intensive research.

Slide 3

Slide 3 text

Jake VanderPlas Major Support: Gordon and Betty Moore Foundation & Alfred P. Sloan Foundation - $38 million over 5 years, split between UW, NYU, and UC Berkeley Washington Research Foundation - $9.3 million over 5 years for faculty & postdocs - Also $7.1 million to the closely-aligned Institute for Neuroengineering University of Washington - $550,000/year for staff support - $600,000/year for faculty support National Science Foundation - $2.8 million over 5 years for graduate program development and Ph.D. student funding (IGERT)

Slide 4

Slide 4 text

Jake VanderPlas Case Study: Astronomy

Slide 5

Slide 5 text

Jake VanderPlas My Main Research: Astrostatistics - How to extend current methods for use on sparse, noisy, heterogeneous data. - How to scale current methods for large datasets from future surveys - Development of new statistical methods to answer relevant scientific questions

Slide 6

Slide 6 text

Jake VanderPlas Image: Jeff Seivert; http://lcas-astronomy.org

Slide 7

Slide 7 text

Jake VanderPlas The Pleiades cluster . . . Image: Jeff Seivert; http://lcas-astronomy.org

Slide 8

Slide 8 text

Jake VanderPlas The Pleiades cluster . . . Image: Jeff Seivert; http://lcas-astronomy.org Question: How do we determine the distance to far-away stars?

Slide 9

Slide 9 text

Jake VanderPlas Image: Jeff Seivert; http://lcas-astronomy.org Science depending on these distances: - Big Bang & expansion - Dark Matter & Dark Energy - Stellar Evolution - Spiral Structure of our Galaxy - Characterization of Extrasolar Planets

Slide 10

Slide 10 text

Jake VanderPlas The Pleiades cluster . . . Image: Jeff Seivert; http://lcas-astronomy.org Question: How do we determine the distance to far-away stars?

Slide 11

Slide 11 text

Jake VanderPlas Image: Jeff Seivert; http://lcas-astronomy.org Ricky Patterson; http://www.astro.virginia.edu/ Distances via Parallax

Slide 12

Slide 12 text

Jake VanderPlas Image: Jeff Seivert; http://lcas-astronomy.org Distances via “Standard Candles”

Slide 13

Slide 13 text

Jake VanderPlas Image: Jeff Seivert; http://lcas-astronomy.org Variable Stars as Standard Candles

Slide 14

Slide 14 text

Jake VanderPlas Image: Jeff Seivert; http://lcas-astronomy.org Variable Stars as Standard Candles

Slide 15

Slide 15 text

Jake VanderPlas Period-Luminostiy Relationship for Cepheid Variable stars Henrietta Leavitt

Slide 16

Slide 16 text

Jake VanderPlas Edwin Hubble Discovered Distance-Redshift trend (i.e. The Big Bang) using Cepheid variables!

Slide 17

Slide 17 text

Jake VanderPlas But this type of analysis is going to get much harder in the future . . .

Slide 18

Slide 18 text

Jake VanderPlas - 8.4m Primary Mirror - 3 Gigapixel camera - Survey mode: 2 exposures every ~30 seconds - Covers full southern sky every three nights - 30,000GB/night! - Final catalog: 100s of Petabytes

Slide 19

Slide 19 text

Jake VanderPlas Challenges in the LSST Era - Scaling: many classic Astrostatistical methods are difficult to apply at scale - Noise: Fully utilizing LSST data means pushing methods to low signal-to-noise - Heterogeneous Data: each exposure isolates one wavelength range

Slide 20

Slide 20 text

Jake VanderPlas

Slide 21

Slide 21 text

Jake VanderPlas

Slide 22

Slide 22 text

Jake VanderPlas Brighter Fainter Fraction Recovered 6 months 1 year 2 years 5 years

Slide 23

Slide 23 text

Jake VanderPlas Important Features - Simple, ridge-regularized linear model: Fast & scalable! - Well-documented & tested open Python implementation (pip install gatspy) - Code on GitHub to reproduce the entire paper from scratch (key feature!)

Slide 24

Slide 24 text

Jake VanderPlas Nearly Every Field is entering a Data-Rich era Astronomy: LSST Physics: LHC Neruoscience: EEG, fMRI Sociology: Web Data Biology: Sequencing Economics: POS terminals Oceanography: OOI

Slide 25

Slide 25 text

Jake VanderPlas - How do we encourage and support good software practice across fields? - How do we keep good software engineers in academia? - How do we facilitate interdisciplinary collaboration? - How do we train academics to be good computational researchers?

Slide 26

Slide 26 text

Jake VanderPlas Feb. 2014 Kickoff Event: 137 posters from 30 UW departments!

Slide 27

Slide 27 text

Jake VanderPlas Original Core Faculty Team Data Science Methodology Biological Sciences Environmental Sciences Social Sciences Physical Sciences Cecilia Aragon Human Centered Design & Engr. Magda Balazinska CSE Emily Fox Statistics Carlos Guestrin CSE Bill Howe CSE Jeff Heer CSE Ed Lazowska CSE David Beck Chem. Engr. Tom Daniel Biology Bill Noble Genome Sciences Josh Blumenstock iSchool Mark Ellis Geography Tyler McCormick Sociology, CSSS Ginger Armbrust Oceanography Randy LeVeque Applied Math Thom Richardson Statistics, CSSS Werner Stuetzle Statistics Andy Connolly Astronomy John Vidale Earth & Space Sciences

Slide 28

Slide 28 text

Jake VanderPlas

Slide 29

Slide 29 text

Jake VanderPlas

Slide 30

Slide 30 text

Jake VanderPlas

Slide 31

Slide 31 text

Jake VanderPlas

Slide 32

Slide 32 text

Jake VanderPlas UW eScience Highlights Promoting Interdisciplinary Careers - Data Science PhD program - Interdisciplinary Postdocs - Interdisciplinary research scientists - Interdisciplinary Faculty - Short and long-term “Senior Research Fellows” Most have one foot in eScience, one foot in their domain department.

Slide 33

Slide 33 text

Jake VanderPlas Example: Mario Juric, Astronomy - Data Management Lead for LSST - Professor of Astronomy, UW - Sr. Fellow at UW eScience Working on scalable software infrastructure for the LSST project, especially regarding the formation, structure, and evolution of the Milky Way. Faculty position half-funded through eScience, half through Astronomy.

Slide 34

Slide 34 text

Jake VanderPlas UW eScience Highlights Data Science Incubator Program - Students/postdocs/faculty paired with a resident data scientist for one quarter. - Researcher works in the Data Science Studio two days per week - Lightweight 1-page proposals, emphasis on quick & impactful results

Slide 35

Slide 35 text

Jake VanderPlas Example – Fall 2014 Incubator: Unlocking Kenya’s Health Data Results - Generalizable method to process HIS-like data - Clean, integrate, and make available this data for research - Preliminary analysis of HIS data – known spikes in malaria visible in the data, finally “Much of the material remains unprocessed, or, if processed, unanalyzed, or, if analyzed, not read, or, if read, not used or acted upon”

Slide 36

Slide 36 text

Jake VanderPlas UW eScience Highlights Interdisciplinary Education - Data Science “tracks” in several graduate departments (IGERT fellowship program) - Data Science certificates in several undergraduate departments (soon) - Formal and informal workshops on relevant topics

Slide 37

Slide 37 text

Jake VanderPlas Cecilia Noecker Genome Sc. & ML Matt Murbach ChemE & ML Ryan Maas CS & Astro Alex Tank Stats & Allen Inst. for Brain Science Grace Telford Astro & Stats Will Gagne-Maynard Oceanography & MSR 2014 cohort IGERT PhD student fellows

Slide 38

Slide 38 text

Jake VanderPlas UW eScience Highlights Data Science Studio - Permanent desks for eScience fellows - Open collaboration space - Conference & Meeting rooms - Drop-in “office hours” - Seminars, sponsored lunches, summer programs, etc.

Slide 39

Slide 39 text

Jake VanderPlas WRF Data Science Studio

Slide 40

Slide 40 text

Jake VanderPlas WRF Data Science Studio full 6th floor

Slide 41

Slide 41 text

Jake VanderPlas Example: Summer 2015 Data Science for Social Good 16 student interns matched with 6 data scientists working on data-oriented social-good projects with 4 non-profit groups - Assessing Community Well- Being Through Open Data - King County Metro Paratransit - Open Sidewalk Graph for Accessible Trip Planning - Predictors of Permanent Housing for Homeless Families

Slide 42

Slide 42 text

Jake VanderPlas

Slide 43

Slide 43 text

Jake VanderPlas ~ Thank You! ~ Email: [email protected] Twitter: @jakevdp Github: jakevdp Web: http://vanderplas.com/ Blog: http://jakevdp.github.io/