The accurate Data Management Plan (if such exists) in the presence of "Big Data"

60d0e0af6e89ae0f6114f89cb72b21d3?s=47 Research Data Services
October 13, 2017
150

The accurate Data Management Plan (if such exists) in the presence of "Big Data"

Presentation given as part of the RDS Holz Brown Bag series, October 2017.

60d0e0af6e89ae0f6114f89cb72b21d3?s=128

Research Data Services

October 13, 2017
Tweet

Transcript

  1. The$accurate$Data$Management$Plan (if$ such$exists) in$the$presence$of$"Big$Data" Matthew'Garcia|'Dept.'of'Forest'&'Wildlife'Ecology'|'UW<Madison

  2. The accurate Data Management Plan (if such exists) in the

    presence of "Big Data" Matthew Garcia Ph.D. Candidate Dept. of Forest & Wildlife Ecology University of Wisconsin – Madison
  3. Defining “Big Data” • 3V’s (Douglas Laney, ~2001) • Volume

    (pure scale, often TB through EB) • Velocity (streaming data volume vs. analytical speed) • Variety (forms: text, images, maps, social media entries) • 4V’s (IBM, ~2012) • Veracity (authenticity, uncertainty, completeness) • 6V’s (Microsoft, ~2013) • Variability (number of variables, complexity) • Visibility (visualization, comprehension, accessibility) • 5, 7, 32, 9, 11V’s
  4. What do we do with “Big Data”? from Big Data:

    Principles and Paradigms, Elsevier, 2016
  5. Working with “Big Data” from Big Data: Principles and Paradigms,

    Elsevier, 2016
  6. Doing science from phdcomics.com/comics/archive.php?comicid=1431

  7. Open Science = Open Access + Open Source + Open

    Data by Peter Brewer, published 14 Sep 2017 at www.eos.org
  8. Experience: my dissertation research Pilot study 1 area ~5 years

    <1 TB data Dissertation Proposal 5 areas 30 years ~5 TB expected Conceptual Modeling Code Development & Refinement Distributed Computational Modeling ~15 M hours on CHTC and OSG Results ~40 TB working storage à ~2 TB analytical products Publication 1 in 2016 + 3 more in progress Code Availability via GitHub Data Availability part via GitHub much on GDrive soon via USFS
  9. Proposing science • Academic proposals • Ph.D. proposal • Institutional

    (e.g. Hatch) grants • Foundations, NGOs • State agencies • Federal funding sources • National Science Foundation (NSF) • National Institutes of Health (NIH) • NASA and numerous other agencies à Many require a Data Management Plan à Examples abound, but…
  10. The “acceptable” Data Management Plan from www.lib.umn.edu/datamanagement/DMP

  11. The “acceptable” Data Management Plan from grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm

  12. What’s wrong with those plans? • They’re often vague, even

    with past experience • What are the expected data volumes and archive needs? • How will the data be published? In raw or analyzed form? • What about metadata and conformance to community standards? • How (and for how long) will the data be served to the public? • They’re often isolated from the proposal budget • Data management (incl. time) and archive capital costs • Data and associated publication costs • Data server and website costs à If the DMP required a budget line, many potential problems would be discovered at the time of proposal
  13. The “ideal” DMP: the basics DMPTool.org DMP builder from scratch

    or templates Many examples from dmptool.org
  14. The “ideal” DMP: more detail DMPTool.org DMP builder from scratch

    or templates Many examples DataONE’s “Example Data Management Plan (NSF General)” from dataone.org
  15. The “ideal” DMP: factoring costs DMPTool.org DMP builder from scratch

    or templates Many examples DataONE’s “Example Data Management Plan (NSF General)” Stanford University’s “Including IT Costs in Research Grants” from web.stanford.edu/group/hpc/cgi-bin/rait/
  16. Recent developments in storage • Massive data storage and service

    capability • Federal agencies (2013 Federal Open Data Policy) • Individual institutions (free, up to a point) • Cloud storage via Google, Amazon S3, others ($$$) • Publication requirements (depends on publisher) • Code: GitHub, BitBucket • Datasets • Small datasets (<50 MB) with code, or on FigShare • Medium datasets (unlimited but $ on Dryad, <50 GB on Zenodo) • Large datasets • University (research lab) website? • Institutional repository? (for DSpace@MIT, >200GB is “extraordinary”) • Domain collection? (Registry of Research Data Repositories: r3data.org)
  17. Open Science = Open Access + Open Source + Open

    Data by Peter Brewer, published 14 Sep 2017 at www.eos.org
  18. Takeaway messages • Write a detailed Data Management Plan •

    Write a budget to support the DMP • Working data storage (archive, even if not published) • Data management (personnel, transfer time) • Archive (location, data volume, longevity, accessibility) • Publication (outlets, requirements, costs) • Web site (location, ownership, longevity) à Consider contingencies, and who pays for them • If resulting data volume is larger than expected • If federal/university/domain data service is not available • If/when student moves • If/when PI moves