Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The eScience Institute: Data Science at UW

The eScience Institute: Data Science at UW

The eScience Institute is an interdisciplinary institute at the University of Washington, focusing on enabling data-intensive research on campus. In this talk, I will go over some of the essential components of the effort, and dive deeper into an example of my own research on statistical methods for Astronomy.

Jake VanderPlas

July 25, 2015
Tweet

More Decks by Jake VanderPlas

Other Decks in Research

Transcript

  1. #SciPy2015
    Jake VanderPlas
    The eScience Institute:
    Data Science at UW
    Jake VanderPlas; @jakevdp
    PyData Seattle; July 24, 2015

    View Slide

  2. Jake VanderPlas
    UW eScience is a cross-disciplinary institute
    at University of Washington (Seattle)
    providing training and support
    in modern data-intensive
    research.

    View Slide

  3. Jake VanderPlas
    Major Support:
    Gordon and Betty Moore Foundation &
    Alfred P. Sloan Foundation
    - $38 million over 5 years, split between UW, NYU,
    and UC Berkeley
    Washington Research Foundation
    - $9.3 million over 5 years for faculty & postdocs
    - Also $7.1 million to the closely-aligned Institute
    for Neuroengineering
    University of Washington
    - $550,000/year for staff support
    - $600,000/year for faculty support
    National Science Foundation
    - $2.8 million over 5 years for graduate program
    development and Ph.D. student funding (IGERT)

    View Slide

  4. Jake VanderPlas
    Case Study: Astronomy

    View Slide

  5. Jake VanderPlas
    My Main Research: Astrostatistics
    - How to extend current methods for use on
    sparse, noisy, heterogeneous data.
    - How to scale current methods for large
    datasets from future surveys
    - Development of new statistical methods
    to answer relevant scientific questions

    View Slide

  6. Jake VanderPlas
    Image: Jeff Seivert; http://lcas-astronomy.org

    View Slide

  7. Jake VanderPlas
    The Pleiades cluster . . .
    Image: Jeff Seivert; http://lcas-astronomy.org

    View Slide

  8. Jake VanderPlas
    The Pleiades cluster . . .
    Image: Jeff Seivert; http://lcas-astronomy.org
    Question: How do we determine the
    distance to far-away stars?

    View Slide

  9. Jake VanderPlas
    Image: Jeff Seivert; http://lcas-astronomy.org
    Science depending on these distances:
    - Big Bang & expansion
    - Dark Matter & Dark
    Energy
    - Stellar Evolution
    - Spiral Structure of
    our Galaxy
    - Characterization of
    Extrasolar Planets

    View Slide

  10. Jake VanderPlas
    The Pleiades cluster . . .
    Image: Jeff Seivert; http://lcas-astronomy.org
    Question: How do we determine the
    distance to far-away stars?

    View Slide

  11. Jake VanderPlas
    Image: Jeff Seivert; http://lcas-astronomy.org
    Ricky Patterson; http://www.astro.virginia.edu/
    Distances via Parallax

    View Slide

  12. Jake VanderPlas
    Image: Jeff Seivert; http://lcas-astronomy.org
    Distances via
    “Standard Candles”

    View Slide

  13. Jake VanderPlas
    Image: Jeff Seivert; http://lcas-astronomy.org
    Variable Stars as Standard
    Candles

    View Slide

  14. Jake VanderPlas
    Image: Jeff Seivert; http://lcas-astronomy.org
    Variable Stars as Standard
    Candles

    View Slide

  15. Jake VanderPlas
    Period-Luminostiy
    Relationship for Cepheid
    Variable stars
    Henrietta Leavitt

    View Slide

  16. Jake VanderPlas
    Edwin Hubble
    Discovered Distance-Redshift
    trend (i.e. The Big Bang) using
    Cepheid variables!

    View Slide

  17. Jake VanderPlas
    But this type of analysis is going to get
    much harder in the future . . .

    View Slide

  18. Jake VanderPlas
    - 8.4m Primary Mirror
    - 3 Gigapixel camera
    - Survey mode: 2 exposures
    every ~30 seconds
    - Covers full southern sky
    every three nights
    - 30,000GB/night!
    - Final catalog:
    100s of Petabytes

    View Slide

  19. Jake VanderPlas
    Challenges in the LSST Era
    - Scaling: many classic Astrostatistical
    methods are difficult to apply at scale
    - Noise: Fully utilizing LSST data means
    pushing methods to low signal-to-noise
    - Heterogeneous Data: each exposure
    isolates one wavelength range

    View Slide

  20. Jake VanderPlas

    View Slide

  21. Jake VanderPlas

    View Slide

  22. Jake VanderPlas
    Brighter Fainter
    Fraction Recovered
    6 months
    1 year
    2 years
    5 years

    View Slide

  23. Jake VanderPlas
    Important Features
    - Simple, ridge-regularized linear model:
    Fast & scalable!
    - Well-documented & tested open Python
    implementation (pip install gatspy)
    - Code on GitHub to reproduce the entire
    paper from scratch (key feature!)

    View Slide

  24. Jake VanderPlas
    Nearly Every Field is entering
    a Data-Rich era
    Astronomy: LSST
    Physics: LHC Neruoscience: EEG,
    fMRI
    Sociology:
    Web Data
    Biology: Sequencing Economics: POS
    terminals
    Oceanography:
    OOI

    View Slide

  25. Jake VanderPlas
    - How do we encourage and support
    good software practice across fields?
    - How do we keep good software
    engineers in academia?
    - How do we facilitate interdisciplinary
    collaboration?
    - How do we train academics to be
    good computational researchers?

    View Slide

  26. Jake VanderPlas
    Feb. 2014 Kickoff Event:
    137 posters from
    30 UW departments!

    View Slide

  27. Jake VanderPlas
    Original Core Faculty Team
    Data Science
    Methodology
    Biological
    Sciences
    Environmental
    Sciences
    Social
    Sciences
    Physical
    Sciences
    Cecilia Aragon
    Human Centered
    Design & Engr.
    Magda
    Balazinska
    CSE
    Emily Fox
    Statistics
    Carlos Guestrin
    CSE
    Bill Howe
    CSE
    Jeff Heer
    CSE
    Ed Lazowska
    CSE
    David Beck
    Chem. Engr.
    Tom Daniel
    Biology
    Bill Noble
    Genome Sciences
    Josh
    Blumenstock
    iSchool
    Mark Ellis
    Geography
    Tyler McCormick
    Sociology, CSSS
    Ginger Armbrust
    Oceanography
    Randy LeVeque
    Applied Math
    Thom
    Richardson
    Statistics, CSSS
    Werner
    Stuetzle
    Statistics
    Andy Connolly
    Astronomy
    John Vidale
    Earth & Space
    Sciences

    View Slide

  28. Jake VanderPlas

    View Slide

  29. Jake VanderPlas

    View Slide

  30. Jake VanderPlas

    View Slide

  31. Jake VanderPlas

    View Slide

  32. Jake VanderPlas
    UW eScience Highlights
    Promoting Interdisciplinary Careers
    - Data Science PhD program
    - Interdisciplinary Postdocs
    - Interdisciplinary research scientists
    - Interdisciplinary Faculty
    - Short and long-term “Senior Research
    Fellows”
    Most have one foot in eScience, one foot in
    their domain department.

    View Slide

  33. Jake VanderPlas
    Example: Mario Juric, Astronomy
    - Data Management Lead for LSST
    - Professor of Astronomy, UW
    - Sr. Fellow at UW eScience
    Working on scalable software infrastructure
    for the LSST project, especially regarding
    the formation, structure, and evolution of the
    Milky Way.
    Faculty position half-funded through
    eScience, half through Astronomy.

    View Slide

  34. Jake VanderPlas
    UW eScience Highlights
    Data Science Incubator Program
    - Students/postdocs/faculty paired with a
    resident data scientist for one quarter.
    - Researcher works in the Data Science
    Studio two days per week
    - Lightweight 1-page proposals, emphasis
    on quick & impactful results

    View Slide

  35. Jake VanderPlas
    Example – Fall 2014 Incubator:
    Unlocking Kenya’s Health Data
    Results
    - Generalizable method to process HIS-like data
    - Clean, integrate, and make available this data
    for research
    - Preliminary analysis of HIS data – known
    spikes in malaria visible in the data, finally
    “Much of the material remains unprocessed,
    or, if processed, unanalyzed, or, if analyzed,
    not read, or, if read, not used or acted upon”

    View Slide

  36. Jake VanderPlas
    UW eScience Highlights
    Interdisciplinary Education
    - Data Science “tracks” in several graduate
    departments (IGERT fellowship program)
    - Data Science certificates in several
    undergraduate departments (soon)
    - Formal and informal workshops on
    relevant topics

    View Slide

  37. Jake VanderPlas
    Cecilia Noecker
    Genome Sc. & ML
    Matt Murbach
    ChemE & ML Ryan Maas
    CS & Astro
    Alex Tank
    Stats & Allen Inst.
    for Brain Science
    Grace Telford
    Astro & Stats
    Will Gagne-Maynard
    Oceanography & MSR
    2014 cohort IGERT
    PhD student fellows

    View Slide

  38. Jake VanderPlas
    UW eScience Highlights
    Data Science Studio
    - Permanent desks for eScience fellows
    - Open collaboration space
    - Conference & Meeting rooms
    - Drop-in “office hours”
    - Seminars, sponsored lunches, summer
    programs, etc.

    View Slide

  39. Jake VanderPlas
    WRF Data Science Studio

    View Slide

  40. Jake VanderPlas
    WRF Data Science Studio
    full 6th floor

    View Slide

  41. Jake VanderPlas
    Example: Summer 2015
    Data Science for Social Good
    16 student interns matched with
    6 data scientists working on data-oriented
    social-good projects with 4 non-profit groups
    - Assessing Community Well-
    Being Through Open Data
    - King County Metro Paratransit
    - Open Sidewalk Graph for
    Accessible Trip Planning
    - Predictors of Permanent
    Housing for Homeless Families

    View Slide

  42. Jake VanderPlas

    View Slide

  43. Jake VanderPlas
    ~ Thank You! ~
    Email: [email protected]
    Twitter: @jakevdp
    Github: jakevdp
    Web: http://vanderplas.com/
    Blog: http://jakevdp.github.io/

    View Slide