The eScience Institute: Data Science at UW

The eScience Institute: Data Science at UW

The eScience Institute is an interdisciplinary institute at the University of Washington, focusing on enabling data-intensive research on campus. In this talk, I will go over some of the essential components of the effort, and dive deeper into an example of my own research on statistical methods for Astronomy.


Jake VanderPlas

July 25, 2015


  1. #SciPy2015 Jake VanderPlas The eScience Institute: Data Science at UW

    Jake VanderPlas; @jakevdp PyData Seattle; July 24, 2015
  2. Jake VanderPlas UW eScience is a cross-disciplinary institute at University

    of Washington (Seattle) providing training and support in modern data-intensive research.
  3. Jake VanderPlas Major Support: Gordon and Betty Moore Foundation &

    Alfred P. Sloan Foundation - $38 million over 5 years, split between UW, NYU, and UC Berkeley Washington Research Foundation - $9.3 million over 5 years for faculty & postdocs - Also $7.1 million to the closely-aligned Institute for Neuroengineering University of Washington - $550,000/year for staff support - $600,000/year for faculty support National Science Foundation - $2.8 million over 5 years for graduate program development and Ph.D. student funding (IGERT)
  4. Jake VanderPlas Case Study: Astronomy

  5. Jake VanderPlas My Main Research: Astrostatistics - How to extend

    current methods for use on sparse, noisy, heterogeneous data. - How to scale current methods for large datasets from future surveys - Development of new statistical methods to answer relevant scientific questions
  6. Jake VanderPlas Image: Jeff Seivert;

  7. Jake VanderPlas The Pleiades cluster . . . Image: Jeff

  8. Jake VanderPlas The Pleiades cluster . . . Image: Jeff

    Seivert; Question: How do we determine the distance to far-away stars?
  9. Jake VanderPlas Image: Jeff Seivert; Science depending on these

    distances: - Big Bang & expansion - Dark Matter & Dark Energy - Stellar Evolution - Spiral Structure of our Galaxy - Characterization of Extrasolar Planets
  10. Jake VanderPlas The Pleiades cluster . . . Image: Jeff

    Seivert; Question: How do we determine the distance to far-away stars?
  11. Jake VanderPlas Image: Jeff Seivert; Ricky Patterson; Distances

    via Parallax
  12. Jake VanderPlas Image: Jeff Seivert; Distances via “Standard Candles”

  13. Jake VanderPlas Image: Jeff Seivert; Variable Stars as Standard

  14. Jake VanderPlas Image: Jeff Seivert; Variable Stars as Standard

  15. Jake VanderPlas Period-Luminostiy Relationship for Cepheid Variable stars Henrietta Leavitt

  16. Jake VanderPlas Edwin Hubble Discovered Distance-Redshift trend (i.e. The Big

    Bang) using Cepheid variables!
  17. Jake VanderPlas But this type of analysis is going to

    get much harder in the future . . .
  18. Jake VanderPlas - 8.4m Primary Mirror - 3 Gigapixel camera

    - Survey mode: 2 exposures every ~30 seconds - Covers full southern sky every three nights - 30,000GB/night! - Final catalog: 100s of Petabytes
  19. Jake VanderPlas Challenges in the LSST Era - Scaling: many

    classic Astrostatistical methods are difficult to apply at scale - Noise: Fully utilizing LSST data means pushing methods to low signal-to-noise - Heterogeneous Data: each exposure isolates one wavelength range
  20. Jake VanderPlas

  21. Jake VanderPlas

  22. Jake VanderPlas Brighter Fainter Fraction Recovered 6 months 1 year

    2 years 5 years
  23. Jake VanderPlas Important Features - Simple, ridge-regularized linear model: Fast

    & scalable! - Well-documented & tested open Python implementation (pip install gatspy) - Code on GitHub to reproduce the entire paper from scratch (key feature!)
  24. Jake VanderPlas Nearly Every Field is entering a Data-Rich era

    Astronomy: LSST Physics: LHC Neruoscience: EEG, fMRI Sociology: Web Data Biology: Sequencing Economics: POS terminals Oceanography: OOI
  25. Jake VanderPlas - How do we encourage and support good

    software practice across fields? - How do we keep good software engineers in academia? - How do we facilitate interdisciplinary collaboration? - How do we train academics to be good computational researchers?
  26. Jake VanderPlas Feb. 2014 Kickoff Event: 137 posters from 30

    UW departments!
  27. Jake VanderPlas Original Core Faculty Team Data Science Methodology Biological

    Sciences Environmental Sciences Social Sciences Physical Sciences Cecilia Aragon Human Centered Design & Engr. Magda Balazinska CSE Emily Fox Statistics Carlos Guestrin CSE Bill Howe CSE Jeff Heer CSE Ed Lazowska CSE David Beck Chem. Engr. Tom Daniel Biology Bill Noble Genome Sciences Josh Blumenstock iSchool Mark Ellis Geography Tyler McCormick Sociology, CSSS Ginger Armbrust Oceanography Randy LeVeque Applied Math Thom Richardson Statistics, CSSS Werner Stuetzle Statistics Andy Connolly Astronomy John Vidale Earth & Space Sciences
  28. Jake VanderPlas

  29. Jake VanderPlas

  30. Jake VanderPlas

  31. Jake VanderPlas

  32. Jake VanderPlas UW eScience Highlights Promoting Interdisciplinary Careers - Data

    Science PhD program - Interdisciplinary Postdocs - Interdisciplinary research scientists - Interdisciplinary Faculty - Short and long-term “Senior Research Fellows” Most have one foot in eScience, one foot in their domain department.
  33. Jake VanderPlas Example: Mario Juric, Astronomy - Data Management Lead

    for LSST - Professor of Astronomy, UW - Sr. Fellow at UW eScience Working on scalable software infrastructure for the LSST project, especially regarding the formation, structure, and evolution of the Milky Way. Faculty position half-funded through eScience, half through Astronomy.
  34. Jake VanderPlas UW eScience Highlights Data Science Incubator Program -

    Students/postdocs/faculty paired with a resident data scientist for one quarter. - Researcher works in the Data Science Studio two days per week - Lightweight 1-page proposals, emphasis on quick & impactful results
  35. Jake VanderPlas Example – Fall 2014 Incubator: Unlocking Kenya’s Health

    Data Results - Generalizable method to process HIS-like data - Clean, integrate, and make available this data for research - Preliminary analysis of HIS data – known spikes in malaria visible in the data, finally “Much of the material remains unprocessed, or, if processed, unanalyzed, or, if analyzed, not read, or, if read, not used or acted upon”
  36. Jake VanderPlas UW eScience Highlights Interdisciplinary Education - Data Science

    “tracks” in several graduate departments (IGERT fellowship program) - Data Science certificates in several undergraduate departments (soon) - Formal and informal workshops on relevant topics
  37. Jake VanderPlas Cecilia Noecker Genome Sc. & ML Matt Murbach

    ChemE & ML Ryan Maas CS & Astro Alex Tank Stats & Allen Inst. for Brain Science Grace Telford Astro & Stats Will Gagne-Maynard Oceanography & MSR 2014 cohort IGERT PhD student fellows
  38. Jake VanderPlas UW eScience Highlights Data Science Studio - Permanent

    desks for eScience fellows - Open collaboration space - Conference & Meeting rooms - Drop-in “office hours” - Seminars, sponsored lunches, summer programs, etc.
  39. Jake VanderPlas WRF Data Science Studio

  40. Jake VanderPlas WRF Data Science Studio full 6th floor

  41. Jake VanderPlas Example: Summer 2015 Data Science for Social Good

    16 student interns matched with 6 data scientists working on data-oriented social-good projects with 4 non-profit groups - Assessing Community Well- Being Through Open Data - King County Metro Paratransit - Open Sidewalk Graph for Accessible Trip Planning - Predictors of Permanent Housing for Homeless Families
  42. Jake VanderPlas

  43. Jake VanderPlas ~ Thank You! ~ Email: Twitter: @jakevdp

    Github: jakevdp Web: Blog: