Slide 1

Slide 1 text

The$accurate$Data$Management$Plan (if$ such$exists) in$the$presence$of$"Big$Data" Matthew'Garcia|'Dept.'of'Forest'&'Wildlife'Ecology'|'UW

Slide 2

Slide 2 text

The accurate Data Management Plan (if such exists) in the presence of "Big Data" Matthew Garcia Ph.D. Candidate Dept. of Forest & Wildlife Ecology University of Wisconsin – Madison

Slide 3

Slide 3 text

Defining “Big Data” • 3V’s (Douglas Laney, ~2001) • Volume (pure scale, often TB through EB) • Velocity (streaming data volume vs. analytical speed) • Variety (forms: text, images, maps, social media entries) • 4V’s (IBM, ~2012) • Veracity (authenticity, uncertainty, completeness) • 6V’s (Microsoft, ~2013) • Variability (number of variables, complexity) • Visibility (visualization, comprehension, accessibility) • 5, 7, 32, 9, 11V’s

Slide 4

Slide 4 text

What do we do with “Big Data”? from Big Data: Principles and Paradigms, Elsevier, 2016

Slide 5

Slide 5 text

Working with “Big Data” from Big Data: Principles and Paradigms, Elsevier, 2016

Slide 6

Slide 6 text

Doing science from phdcomics.com/comics/archive.php?comicid=1431

Slide 7

Slide 7 text

Open Science = Open Access + Open Source + Open Data by Peter Brewer, published 14 Sep 2017 at www.eos.org

Slide 8

Slide 8 text

Experience: my dissertation research Pilot study 1 area ~5 years <1 TB data Dissertation Proposal 5 areas 30 years ~5 TB expected Conceptual Modeling Code Development & Refinement Distributed Computational Modeling ~15 M hours on CHTC and OSG Results ~40 TB working storage à ~2 TB analytical products Publication 1 in 2016 + 3 more in progress Code Availability via GitHub Data Availability part via GitHub much on GDrive soon via USFS

Slide 9

Slide 9 text

Proposing science • Academic proposals • Ph.D. proposal • Institutional (e.g. Hatch) grants • Foundations, NGOs • State agencies • Federal funding sources • National Science Foundation (NSF) • National Institutes of Health (NIH) • NASA and numerous other agencies à Many require a Data Management Plan à Examples abound, but…

Slide 10

Slide 10 text

The “acceptable” Data Management Plan from www.lib.umn.edu/datamanagement/DMP

Slide 11

Slide 11 text

The “acceptable” Data Management Plan from grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm

Slide 12

Slide 12 text

What’s wrong with those plans? • They’re often vague, even with past experience • What are the expected data volumes and archive needs? • How will the data be published? In raw or analyzed form? • What about metadata and conformance to community standards? • How (and for how long) will the data be served to the public? • They’re often isolated from the proposal budget • Data management (incl. time) and archive capital costs • Data and associated publication costs • Data server and website costs à If the DMP required a budget line, many potential problems would be discovered at the time of proposal

Slide 13

Slide 13 text

The “ideal” DMP: the basics DMPTool.org DMP builder from scratch or templates Many examples from dmptool.org

Slide 14

Slide 14 text

The “ideal” DMP: more detail DMPTool.org DMP builder from scratch or templates Many examples DataONE’s “Example Data Management Plan (NSF General)” from dataone.org

Slide 15

Slide 15 text

The “ideal” DMP: factoring costs DMPTool.org DMP builder from scratch or templates Many examples DataONE’s “Example Data Management Plan (NSF General)” Stanford University’s “Including IT Costs in Research Grants” from web.stanford.edu/group/hpc/cgi-bin/rait/

Slide 16

Slide 16 text

Recent developments in storage • Massive data storage and service capability • Federal agencies (2013 Federal Open Data Policy) • Individual institutions (free, up to a point) • Cloud storage via Google, Amazon S3, others ($$$) • Publication requirements (depends on publisher) • Code: GitHub, BitBucket • Datasets • Small datasets (<50 MB) with code, or on FigShare • Medium datasets (unlimited but $ on Dryad, <50 GB on Zenodo) • Large datasets • University (research lab) website? • Institutional repository? (for DSpace@MIT, >200GB is “extraordinary”) • Domain collection? (Registry of Research Data Repositories: r3data.org)

Slide 17

Slide 17 text

Open Science = Open Access + Open Source + Open Data by Peter Brewer, published 14 Sep 2017 at www.eos.org

Slide 18

Slide 18 text

Takeaway messages • Write a detailed Data Management Plan • Write a budget to support the DMP • Working data storage (archive, even if not published) • Data management (personnel, transfer time) • Archive (location, data volume, longevity, accessibility) • Publication (outlets, requirements, costs) • Web site (location, ownership, longevity) à Consider contingencies, and who pays for them • If resulting data volume is larger than expected • If federal/university/domain data service is not available • If/when student moves • If/when PI moves