Slide 1

Slide 1 text

Data Citation From Principles to Implementation

Slide 2

Slide 2 text

Sarah Callaghan Centre for Environmental Data Analysis http://orcid.org/0000-0002-0517-1031 Martin Fenner DataCite Technical Director http://orcid.org/0000-0003-1419-2405

Slide 3

Slide 3 text

Joint Declaration of Data Citation Principles https://www.force11.org/datacitation

Slide 4

Slide 4 text

1. Importance Data should be considered legitimate, citable products of research. Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publication.

Slide 5

Slide 5 text

1. Importance - Examples

Slide 6

Slide 6 text

1. Importance - Examples http://publications.agu.org/author-resource-center/text-requirements/#refs

Slide 7

Slide 7 text

1. Importance - Issues! ● Reference limits: ○ From Nature’s Manuscript formatting guide (http://www.nature.com/nature/authors/gta/#a5.4) “The maximum number of references, strictly enforced, is 50 for Articles and 30 for Letters. Only one publication can be listed for each number.” ● Information about data citation not getting into the generic author guidelines for important publishers ● Still need the culture change that data is as important as articles

Slide 8

Slide 8 text

2. Credit and Attribution Data citations should facilitate giving scholarly credit and normative and legal attribution to all contributors to the data, recognizing that a single style or mechanism of attribution may not be applicable to all data.

Slide 9

Slide 9 text

Fun Fact Credit comes before evidence in Joint Declaration of Data Citation Principles.

Slide 10

Slide 10 text

Does CC0 require others who use my work to give me attribution? No, and that's a big difference between CC0 and our licenses. Unlike our licenses, there are no conditions contained in CC0. Just like anything in the public domain, it will be possible for others to use or adapt it however they wish without attribution. However, this does not mean that you cannot request attribution in accordance with community or professional norms and standards. https://wiki.creativecommons.org/wiki/CC0_FAQ

Slide 11

Slide 11 text

http://www.phdcomics.com/comics/archive.php?comicid=562

Slide 12

Slide 12 text

http://search.labs.datacite.org/orcid

Slide 13

Slide 13 text

http://casrai.org/CRediT http://www.gigasciencejournal.com/content/3/1/18/about#open-badges Project CRediT

Slide 14

Slide 14 text

3. Evidence In scholarly literature, whenever and wherever a claim relies upon data, the corresponding data should be cited. http://xkcd.com/882/

Slide 15

Slide 15 text

3. Evidence - “Data behind the Graph”

Slide 16

Slide 16 text

3. Evidence - Issues ● Granularity ○ Is it appropriate to assign a citation to just the subset of a larger dataset that underlies a particular graph? ■ Results in lots of citations to lots of ever-so-slightly-different things ○ “Cite the book, not the paragraph” == “Cite the dataset, not the cell number” ?? ○ How do we generate citations automatically? ● Common sense required here ○ Communities need to develop their own guidance for what is “common sense”

Slide 17

Slide 17 text

4. Unique Identification A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community.

Slide 18

Slide 18 text

Figure 3. Multiple Alignment of Ten Conserved Motifs in the RAG1 Core Proteins and Transib TPases The motifs are underlined and numbered from 1 to 10. Starting positions of the motifs immediately follow the corresponding protein names. Distances between the motifs are indicated in numbers of aa residues. Black circles denote conserved residues that form the RAG1/Transib catalytic DDE triad. The RAG1 proteins are as follows: RAG1_XL (GenBank GI no. 2501723, Xenopus laevis, frog), RAG1_HS (4557841, Homo sapiens,human), RAG1_GG (131826, Gallus gallus, chicken), RAG1_CL (1470117,Carcharhinus leucas, bull shark), RAG1_FR (4426834, Fugu rubripes, fugu fish). http.//doi.org/10.1371/journal.pbio.0030181 not machine actionable, not globally unique

Slide 19

Slide 19 text

Antibodies. The antibodies used in this study included the following: rabbit polyclonal antibodies to GABA A receptor α2 (catalog #600-401-D45 RRID:AB_11182018; Rockland Immunochemicals), α5 (catalog #AB9678 RRID:AB_570435; Millipore), β3 (catalog #ab4046 RRID:AB_2109564; Abcam), γ2 (extracellular epitope, catalog #224 003 RRID:AB_2263066; Synaptic Systems), and AMPA receptor GluA1 (catalog #AB1504 RRID:AB_2113602; Millipore; and extracellular epitope, catalog #PC246-100UG RRID:AB_564636; Millipore) … http.//doi.org/10.1523/JNEUROSCI.4415-13.2014 not machine actionable without context, not globally unique

Slide 20

Slide 20 text

17. Yim KM, Ng HW, Chan CK, Yip G, Lau FL. Sibutramine-induced acute myocardial infarction in a young lady. Clin Toxicol (Phila). 2008; 46(9):877-879. 18. Waszkiewicz N, Zalewska-Szajda B, Szajda SD, Simonienko K, Zalewska A, Szulc A et al.. Sibutramine-induced mania as the first manifestation of bipolar disorder. BMC Psychiatry. 2012; 12:43. 19. Yet Another DataTables Column Filter. https://github. com/vedmack/yadcf http.//doi.org/10.1186/s13321-015-0077-3 not persistent

Slide 21

Slide 21 text

Recommendation Use persistent identifier expressed as URI, e.g. http.//doi.org/10.1186/s13321-015-0077-3. Always include basic metadata, e.g. authors, title, publication date and publication venue.

Slide 22

Slide 22 text

5. Access Data citations should facilitate access to the data themselves and to such associated metadata, documentation, code, and other materials, as are necessary for both humans and machines to make informed use of the referenced data.

Slide 23

Slide 23 text

5. Access http://xkcd.com/1592/

Slide 24

Slide 24 text

5. Access - the citation string DataCite’s suggested formats for the citation string are: ● Creator (PublicationYear): Title. Publisher. Identifier ● Creator (PublicationYear): Title. Version. Publisher. ResourceType. Identifier Irino, T; Tada, R (2009): Chemical and mineral compositions of sediments from ODP Site 127‐797. Geological Institute, University of Tokyo. http://dx.doi.org/10. 1594/PANGAEA.726855 The clickable link that facilitates access to the data itself. Enough information so if the link doesn’t work, a web search might be able to find the resource.

Slide 25

Slide 25 text

5. Access - the role of landing pages

Slide 26

Slide 26 text

6. Persistence Unique identifiers, and metadata describing the data, and its disposition, should persist -- even beyond the lifespan of the data they describe.

Slide 27

Slide 27 text

Fun Fact The Joint Declaration of Data Citation Principles doesn’t use a persistent method for identification.

Slide 28

Slide 28 text

Metadata for data that have been cited should persist. Not all research data and their metadata can or should persist. Metadata for most data that have been published should persist.

Slide 29

Slide 29 text

7. Specificity and Verifiability Data citations should facilitate identification of, access to, and verification of the specific data that support a claim. Citations or citation metadata should include information about provenance and fixity sufficient to facilitate verifying that the specific timeslice, version and/or granular portion of data retrieved subsequently is the same as was originally cited.

Slide 30

Slide 30 text

7. Specificity and Verifiability - “Frozen” Data We can meet the requirements of Principle 7 if the data (and corresponding metadata) is “frozen” - i.e.: complete and finalised, not going to be modified or updated - also known as “fixity” Data isn’t that simple! - e.g.: long running data collections Proper version control can help here

Slide 31

Slide 31 text

7. Specificity and Verifiability - Dynamic Data ● Special cases: ○ Timeslicing ○ Append-only datasets ● RDA Working Group on Data Citation (https://rd-alliance.org/groups/data- citation-wg.html ) ○ Recommendations to enable data citation of evolving data (https://rd-alliance. org/system/files/documents/RDA-DC-Recommendations_150924.pdf ) ○ Instead of static data exports or textual descriptions of data subsets, support a centric view of data sets. ○ Proposed solution enables precise identification of the very subset and version of data used, supporting reproducibility of processes, sharing and reuse of data. ○ The set of recommendations is undergoing evaluation in a series of pilots in different domains.

Slide 32

Slide 32 text

8. Interoperability and Flexibility Data citation methods should be sufficiently flexible to accommodate the variant practices among communities, but should not differ so much that they compromise interoperability of data citation practices across communities.

Slide 33

Slide 33 text

Not so happy with this principle (but might be the wording).

Slide 34

Slide 34 text

8. Interoperability and Flexibility (modified) Data citation methods should follow users expectations. They should not be different from citation methods for journal articles or other scholarly content, unless there is a very compelling reason to do so. Data citation methods should be generic rather than specific to a particular community.

Slide 35

Slide 35 text

Conclusions? Principles are great, but we need to implement them. The devil is in the details - there will be no “one size fits all” solution. “Common sense” solutions will work, but will vary across communities - important to collaborate to keep things moving in (roughly) the same directions. Don’t let the perfect be the enemy of the good!