Opening up Research and Data Day1 - FORCE11 Scholarly Communication Institute (FSCI)

Opening up Research and Data FORCE11 FSCI | University of
California, San Diego Slides Link: http://tiny.cc/fsci-mt6-1 Gaurav Godhwani | Handle: @gggodhwani Technical Lead - Open Budgets India - CBGA | Chapter Lead - DataKind Bangalore

A BRIEF ABOUT ME

Image Source: http://www.govtech.com/budget-finance/6-9-Billion-to-be-Spent-on-Civic-Tech-in-2015-Report-Says.html

Think about the top 3 key things you want to
discuss or/and get out of this course

Course Outline - Introduction & Setting the scene - Session
1: Planning Open Data Pipelines - Session 2: Developing Key Components of Open Data Pipelines - Session 3: Scaling Up - Session 4: Learning, Sharing and Iterating

“A piece of knowledge, unlike a piece of physical property,
can be shared by large groups of people without making anybody poorer.” ― Aaron Swartz, The Boy Who Could Change the World: The Writings of Aaron Swartz Image Source: Aaron Swartz 3 at Boston Wikipedia Meetup | CC-BY-SA 3.0 Sage Ross https://commons.wikimedia.org/wiki/File:Aaron_Swartz_3_at_Boston_Wikipedia_Meetup, _2009-08-18.jpg

DEFINITIONS

What is Open? “Open means anyone can freely access, use,
modify, and share for any purpose (subject, at most, to requirements that preserve provenance and openness).” - OpenDefinition.org

What is Open Research? Video Source: CC0 https://en.wikipedia.org/wiki/File:Open_research.ogv

What is Open Research? “Open research is concerned with making
scientific research more transparent, more collaborative and more efficient. A central aspect to it is to provide open access to scientific information, especially to the research published in scholarly journals and to the underlying data, much of which traditional science tends to hide away. Other aspects are more open forms of collaboration and engagement with a wider audience, including citizen scientists and the public at large.” - Wikipedia, https://en.wikipedia.org/wiki/Open_research

Why Open Research? Image Source: Journal of Open Research Software
benefits for authors | CC-BY 3.0 Copyright Ubiquity Press.

What is Open Data? “Open data is data that can
be freely used, re-used and redistributed by anyone - subject only, at most, to the requirement to attribute and sharealike.” - OpenDefinition.org

Implementation of open data policies would thus boost cumulative G20
GDP by around 1.1% points (almost 55%) of the G20’s 2% growth target over five years. Source: The G20 and Open Data: Open for Business https://www.omidyar.com/sites/default/files/file.../ON%20Report_061114_FNL.pdf Why Open Data?

Combining all G20 economies, output could increase by USD 13
trillion cumulatively over the next five years. Source: The G20 and Open Data: Open for Business https://www.omidyar.com/sites/default/files/file.../ON%20Report_061114_FNL.pdf Why Open Data?

Open Research Data “By open data in science we mean
that it is freely available on the public Internet permitting any user to download, copy, analyse, re-process, pass them to software or use them for any other purpose without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself.” - https://pantonprinciples.org/

Open Research Data Image Source: OpenAIRE CC-BY, https://www.openaire.eu/images/FAIR_data.JPG

Session 1: Planning Open Data Pipelines

Session Outline - The FAIR Data Guiding Principles - Key
Data Types in Research - Open Data life cycle and its Key Elements - Data Search Techniques - Open Data Management Plans and Policies - Exploring similar projects in your field [Brief Discussion]

Image Source: OpenAIRE CC-BY, https://www.openaire.eu/images/FAIR_data.JPG The FAIR Data Guiding Principles

The FAIR Data Guiding Principles To be Findable: F1. (meta)data
are assigned a globally Unique and Persistent Identifier F2. data are described with rich metadata (defined by R1 below) F3. metadata clearly and explicitly include the identifier of the data it describes F4. (meta)data are registered or indexed in a searchable resource Source: The FAIR Guiding Principles for scientific data management and stewardship CC-BY 4.0 https://www.nature.com/articles/sdata201618

The FAIR Data Guiding Principles To be Accessible: A1. (meta)data
are retrievable by their identifier using a standardized communications protocol A1.1 the protocol is open, free, and universally implementable A1.2 the protocol allows for an authentication and authorization procedure, where necessary A2. metadata are accessible, even when the data are no longer available Source: The FAIR Guiding Principles for scientific data management and stewardship CC-BY 4.0 https://www.nature.com/articles/sdata201618

The FAIR Data Guiding Principles To be Interoperable: I1. (meta)data
use a formal, accessible, shared, and broadly applicable language for knowledge representation. I2. (meta)data use vocabularies that follow FAIR principles I3. (meta)data include qualified references to other (meta)data Source: The FAIR Guiding Principles for scientific data management and stewardship CC-BY 4.0 https://www.nature.com/articles/sdata201618

The FAIR Data Guiding Principles To be Reusable: R1. meta(data)
are richly described with a plurality of accurate and relevant attributes R1.1. (meta)data are released with a clear and accessible data usage license R1.2. (meta)data are associated with detailed provenance R1.3. (meta)data meet domain-relevant community standards Source: The FAIR Guiding Principles for scientific data management and stewardship CC-BY 4.0 https://www.nature.com/articles/sdata201618

Key Data Types in Research - Observational - Experimental -
Simulation - Derived or compiled - Reference or canonical Source: University of Virginia, Research Data Services + Sciences - Data Types & File Formats http://data.library.virginia.edu/data-management/plan/format-types/

Observational Data - Captured in real-time - Cannot be reproduced
or recaptured. Sometimes called 'unique data' - Example include sensor data, human observation, and survey results Source: University of Virginia, Research Data Services + Sciences - Data Types & File Formats http://data.library.virginia.edu/data-management/plan/format-types/ Key Data Types in Research

Experimental Data - Data from lab equipment and under controlled
conditions - Usually reproducible, but expensive to do so - Examples include gene sequences, chromatograms, spectroscopy Source: University of Virginia, Research Data Services + Sciences - Data Types & File Formats http://data.library.virginia.edu/data-management/plan/format-types/ Key Data Types in Research

Simulation Data - Data generated from test models studying actual
or theoretical systems - Models and metadata where the input may be of greater importance than the output - Examples include climate models, economic models, systems engineering. Source: University of Virginia, Research Data Services + Sciences - Data Types & File Formats http://data.library.virginia.edu/data-management/plan/format-types/ Key Data Types in Research

Derived or Compiled Data - The results of data analysis,
or aggregated from multiple sources - Reproducible, but very expensive - Examples include text and data mining, compiled databases, 3D models. Source: University of Virginia, Research Data Services + Sciences - Data Types & File Formats http://data.library.virginia.edu/data-management/plan/format-types/ Key Data Types in Research

Reference or Canonical Data - Fixed or organic collection datasets,
usually peer-reviewed, and often published and curated. - Examples include gene sequence databanks, census data, chemical structures. Source: University of Virginia, Research Data Services + Sciences - Data Types & File Formats http://data.library.virginia.edu/data-management/plan/format-types/ Key Data Types in Research

Research Data Lifecycle Image Source: UCSC CC-BY 3.0 http://guides.library.ucsc.edu/datamanagement/

Data Search Techniques Identify the Unit of Analysis 1) Who
or What? ◦ Social Unit: This is the population that you want to study 2) When? ◦ Time: This is the period of time you want to study 3) Where? ◦ Space: Geography or place Source: Michigan State University, How to find Data & Statistics http://libguides.lib.msu.edu/c.php?g=96631&p=626754

Search Strategy #1: Search in a Data Archive • Data
Repositories (Open Access Directory) ◦ Open data repositories from multiple academic disciplines. ◦ URL: http://oad.simmons.edu/oadwiki/Data_repositories • re3data.org: Registry of Research Data Repositories ◦ re3data.org is a global registry of research data repositories that covers research data repositories from different academic disciplines. Data Search Techniques Source: Michigan State University, How to find Data & Statistics http://libguides.lib.msu.edu/c.php?g=96631&p=626754

Search Strategy #1: Search in a Data Archive • Figshare
◦ figshare.com is a repository where users can make all of their research outputs available in a citable, shareable and discoverable manner • Open Science Framework ◦ A scholarly commons to connect and publish the entire research cycle ◦ URL: https://osf.io/ Data Search Techniques Source: CC-BY 3.0 How to be a Modern Scientist http://leanpub.com/modernscientist

Search Strategy #1: Search in a Data Archive • Harvard
Dataverse ◦ Find and cite data across all research fields ◦ URL: https://dataverse.harvard.edu/ • Find data from variety of field specific data repositories Data Search Techniques Source: CC-BY 3.0 How to be a Modern Scientist http://leanpub.com/modernscientist

Search Strategy #1: Search in a Data Archive • While
working with new data repository ensure: ◦ been around for a while ◦ have a track record of managing data ◦ make their pricing structure clear if they charge Data Search Techniques Source: CC-BY 3.0 How to be a Modern Scientist http://leanpub.com/modernscientist

Search Strategy #2: Identify Potential Producers • Who might collect
and publish this type of data? ◦ Government Agencies ◦ Non-Government Organizations ◦ Academic Institutions ◦ Private Sector Data Search Techniques Source: Michigan State University, How to find Data & Statistics http://libguides.lib.msu.edu/c.php?g=96631&p=626754

Search Strategy #3: Turn to the literature • Data Archive
Bibliographies • Library Indexes • Library Catalog Data Search Techniques Source: Michigan State University, How to find Data & Statistics http://libguides.lib.msu.edu/c.php?g=96631&p=626754

Elements of an Open Data Management Plan S. No. Element
Description 1 Data description A description of the information to be gathered; the nature and scale of the data that will be generated or collected. 2 Existing data A survey of existing data relevant to the project and a discussion of whether and how these data will be integrated. 3 Format Formats in which the data will be generated, maintained, and made available, including a justification for the procedural and archival appropriateness of those formats. 4 Metadata A description of the metadata to be provided along with the generated data, and a discussion of the metadata standards used. Source: Inter-university Consortium for Political and Social Research (ICPSR), Institute for Social Research University of Michigan http://www.icpsr.umich.edu/icpsrweb/content/datamanagement/dmp/elements.html

Description 5 Storage and backup Storage methods and backup procedures for the data, including the physical and cyber resources and facilities that will be used for the effective preservation and storage of the research data. 6 Security A description of technical and procedural protections for information, including confidential information, and how permissions, restrictions, and embargoes will be enforced. 7 Responsibility Names of the individuals responsible for data management in the research project. 8 License Select appropriate Open Data License(s) for various data elements of the project Source: Inter-university Consortium for Political and Social Research (ICPSR), Institute for Social Research University of Michigan http://www.icpsr.umich.edu/icpsrweb/content/datamanagement/dmp/elements.html

Description 9 Access and sharing A description of how data will be shared, including access procedures, embargo periods, technical mechanisms for dissemination and whether access will be open or granted only to specific user groups. A timeframe for data sharing and publishing should also be provided. 10 Audience The potential secondary users of the data. 11 Selection and retention periods A description of how data will be selected for archiving, how long the data will be held, and plans for eventual transition or termination of the data collection in the future. 12 Archiving and preservation The procedures in place or envisioned for long-term archiving and preservation of the data, including succession plans for the data should the expected archiving entity go out of existence. Source: Inter-university Consortium for Political and Social Research (ICPSR), Institute for Social Research University of Michigan http://www.icpsr.umich.edu/icpsrweb/content/datamanagement/dmp/elements.html

Description 13 Ethics and privacy A discussion of how informed consent will be handled and how privacy will be protected, including any exceptional arrangements that might be needed to protect participant confidentiality, and other ethical issues that may arise. 14 Budget The costs of preparing data and documentation for archiving and how these costs will be paid. Requests for funding may be included. 15 Data organization How the data will be managed during the project, with information about version control, naming conventions, etc. 16 Quality Assurance Procedures for ensuring data quality during the project. 17 Legal requirements A listing of all relevant federal or funder requirements for data management and data sharing. Source: Inter-university Consortium for Political and Social Research (ICPSR), Institute for Social Research University of Michigan http://www.icpsr.umich.edu/icpsrweb/content/datamanagement/dmp/elements.html

Exploring interesting projects in your field

Session 2: Developing key components of the data pipeline

Session Outline - Documentation - Formatting your Data - Raw
Data - Tidy Data - Code Book - Instruction list/Script - Open Discussions

Documentation Source: https://www.explainxkcd.com/wiki/index.php/1459:_Documents

Documentation Source: https://xkcd.com/1481/

Documentation Source: CC-BY 3.0 How to be a Modern Scientist
http://leanpub.com/modernscientist To maximize both the value of your data and your impact • Post both raw and tidy versions of your data • Post relevant metadata about experiments you performed in a README • README should ideally be a TXT or a Markup file • Post related code that can be used to analyze the data as you did in your paper.

Formatting your Data Source: CC-BY 3.0 How to be a
Modern Scientist http://leanpub.com/modernscientist To maximize both the value of your data and your impact 1. The raw data 2. A tidy dataset 3. A code book describing each variable and its values in the tidy data set. 4. An explicit and exact recipe you used to go from 1 -> 2,3

Raw Data Source: CC-BY 3.0 How to be a Modern
Scientist http://leanpub.com/modernscientist Image Source: CC 3.0 BY-SA https://en.wikipedia.org/wiki/Binary_file#/media/File:Wikipedia_favicon_hexdump.svg Examples of Raw Data • The strange binary file your measurement machine spits out

Scientist http://leanpub.com/modernscientist Image Source: CC 4.0 SA https://en.wikipedia.org/wiki/File:Excel-application_Wiki_client_v03.png Examples of Raw Data • The unformatted Excel file with 10 worksheets the org you contracted with sent you

Scientist http://leanpub.com/modernscientist Image Source: https://github.com/ServiceStackV3/mythz_blog/blob/master/pages/811.md Examples of Raw Data • The complicated JSON data you got from scraping the Twitter API

Scientist http://leanpub.com/modernscientist Image Source: https://www.flickr.com/photos/nasacommons/9467782468 Examples of Raw Data • The hand-entered numbers you collected looking through a microscope

Scientist http://leanpub.com/modernscientist You know the raw data is in the right format if you: 1. Ran no software on the data 2. Did not manipulate any of the numbers in the data 3. You did not remove any data from the data set 4. You did not summarize the data in any way

Tidy Data Source: CC-BY 3.0 How to be a Modern
Scientist http://leanpub.com/modernscientist Principles of Tidy Data: 1. Each variable you measure should be in one column 2. Each different observation of that variable should be in a different row 3. There should be one table for each “kind” of variable 4. If you have multiple tables, they should include a column in the table that allows them to be linked 5. Share the data in a CSV or TAB-delimited text file

Code Book Source: CC-BY 3.0 How to be a Modern
Scientist http://leanpub.com/modernscientist A code book describes: 1. Study Design - Thorough description of how you collected the data 2. Information about the variables (including units!) in the data set not contained in the tidy data 3. Information about the summary choices you made 4. Information about the experimental study design you used 5. Should ideally be a TXT or a Markup file

This includes: • Computer Scripts (in R, Python, or something
else) that takes the raw data as input and produces the tidy data you are sharing as output. • Installation Notes (a TXT or a Markup file) • Contribution Guidelines (a TXT or a Markup file) • Pseudo Code explaining your process to non-programmers (a TXT or a Markup file) Instruction list/Script Source: CC-BY 3.0 How to be a Modern Scientist http://leanpub.com/modernscientist

Pseudo Code Source: CC-BY 3.0 How to be a Modern
Scientist http://leanpub.com/modernscientist Example: • Step 1 - take the raw file, run version 3.1.2 of summarize software with parameters a=1, b=2,c=3 • Step 2 - run the software separately for each sample • Step 3 - take column three of outputfile.out for each sample and that is the corresponding row in the output data set

Other Key Topics - Pick few - Open Data Acquisition
Methodologies - Metadata standards - Open Data Ontology - Open Data Storage Structures & Schemas - Data Analysis and Outcomes - Open Source Codebase - Open Data Visualization

Other Key Topics - Pick few - Open Data ethics,
privacy and security - Open Data Quality Checks - Publishing Platforms - Open Data Licences - Open Issues and Bug Tracking - Indexing, Searching and Reusing Open Data - Changelog and Version Controlling

Open Discussions

Opening up Research and Data Day1 - FORCE11 Sch...

Opening up Research and Data Day1 - FORCE11 Scholarly Communication Institute (FSCI)

More Decks by Gaurav Godhwani

Other Decks in Research

Featured

Transcript