<1 TB data Dissertation Proposal 5 areas 30 years ~5 TB expected Conceptual Modeling Code Development & Refinement Distributed Computational Modeling ~15 M hours on CHTC and OSG Results ~40 TB working storage à ~2 TB analytical products Publication 1 in 2016 + 3 more in progress Code Availability via GitHub Data Availability part via GitHub much on GDrive soon via USFS
(e.g. Hatch) grants • Foundations, NGOs • State agencies • Federal funding sources • National Science Foundation (NSF) • National Institutes of Health (NIH) • NASA and numerous other agencies à Many require a Data Management Plan à Examples abound, but…
with past experience • What are the expected data volumes and archive needs? • How will the data be published? In raw or analyzed form? • What about metadata and conformance to community standards? • How (and for how long) will the data be served to the public? • They’re often isolated from the proposal budget • Data management (incl. time) and archive capital costs • Data and associated publication costs • Data server and website costs à If the DMP required a budget line, many potential problems would be discovered at the time of proposal
or templates Many examples DataONE’s “Example Data Management Plan (NSF General)” Stanford University’s “Including IT Costs in Research Grants” from web.stanford.edu/group/hpc/cgi-bin/rait/
capability • Federal agencies (2013 Federal Open Data Policy) • Individual institutions (free, up to a point) • Cloud storage via Google, Amazon S3, others ($$$) • Publication requirements (depends on publisher) • Code: GitHub, BitBucket • Datasets • Small datasets (<50 MB) with code, or on FigShare • Medium datasets (unlimited but $ on Dryad, <50 GB on Zenodo) • Large datasets • University (research lab) website? • Institutional repository? (for DSpace@MIT, >200GB is “extraordinary”) • Domain collection? (Registry of Research Data Repositories: r3data.org)
Write a budget to support the DMP • Working data storage (archive, even if not published) • Data management (personnel, transfer time) • Archive (location, data volume, longevity, accessibility) • Publication (outlets, requirements, costs) • Web site (location, ownership, longevity) à Consider contingencies, and who pays for them • If resulting data volume is larger than expected • If federal/university/domain data service is not available • If/when student moves • If/when PI moves