from a campus research lab contacted RDS and SLIS about managing their data • The group ran out of room on available servers – Typical experiments generate 100-200GB of raw data – Created an idealized workflow to justify purchasing a new data storage server (55TB), but – In the processes they realized they were not sure what or where their old data was • Summer practicum project goals – Understand the lab’s current practices – Help the group establish better practices for the future
your published articles to be retracted • You don’t want to lose your funding for non-compliance with policies • You don’t want to waste time finding your old data • You want to be able to share data with your groupmates (including future group members)
Watch (The Center for Scientific Integrity) estimates 500 to 600 per year • Some are clear fraud – Andrew Wakefield – Hwang Woo Suk But papers can also be retracted if researchers cannot provide the raw data in response to questions!
apoptosis involves the angiotensin II type 2 receptor, GATA-6 activation, and FasL-Fas engagement – As we reported in December, UNSW cleared Levon Khachigian of misconduct, concluding that his previous issues stemmed from “genuine error or honest oversight.” Now, Circulation Research is retracting one of his papers after an investigation commissioned by UNSW was unable to find electronic records for two similar images from a 2009 paper, nor records of the images in original lab books. http://retractionwatch.com/2016/02/05/investigation-prompts-5th-retraction-for-cancer- researcher-for-unresolvable-concerns/#more-36568 • Docosahexaenoic acid in combination with celecoxib modulates HSP70 and p53 proteins in prostate cancer cells – A 2006 paper investigating the effects of docosahexaenoic acid (DHA) and celecoxib on prostate cancer cells has been retracted because it appears to contain panels that were duplicated, and the authors could not provide the raw data to show otherwise. http://retractionwatch.com/2016/02/23/we-are-living-in-hell-authors-retract-2nd-paper-due-to- missing-raw-data/#more-36865 • Low sodium versus normal sodium diets in systolic heart failure: systematic review and meta-analysis – The Committee considered that without sight of the raw data on which the two papers containing the duplicate data were based, their reliability could not be substantiated. Following inquiries, it turns out that the raw data are no longer available having been lost as a result of computer failure. Under the circumstances, it was the Committee’s recommendation that the Heart meta-analysis should be retracted on the ground that the reliability of the data on which it is based cannot be substantiated http://retractionwatch.com/2013/05/02/heart-pulls-sodium-meta-analysis-over- duplicated-and-now-missing-data/#more-13986
Policy: UW-Madison must retain research data in sufficient detail and for an adequate period of time to enable appropriate responses to questions about accuracy, authenticity, primacy and compliance with laws and regulations governing the conduct of the research. It is the responsibility of the Principal Investigator to determine what needs to be retained under this policy.
Stewardship and Retention: Principal Investigators should adopt an orderly system of Data organization, access, and retention and should communicate the chosen system to all members of a research group and to the appropriate administrative personnel, where applicable. Particularly for long-term research projects, PIs should establish and maintain procedures for the protection of essential records in the event of a natural disaster or other emergency. Research Data must be archived for a minimum of seven years after the final project close-out, with original Data retained wherever possible.
Roles and Responsibilities: The Principal Investigator is responsible for the stewardship and retention of research Data as well as for determinations concerning access to and appropriate use of Data. Other Research Contributors are responsible to cooperate with the PI in carrying out the requirements of this policy.
February 2013, OSTP released a memo on “Increasing Access to the Results of Federally Funded Scientific Research” – A formal statement on the importance of sharing data obtained from federally funded research – Requires agencies that invest in research and development to have clear and coordinated data access policies, including policies related to data management – Requires granting agencies to evaluate the merits of proposed data management plans and enforce researchers to comply with these plans – Also requires that agencies allow proposals to include costs for data management and access • To date, no coordinated policy has been released
that, under the NIH Grants Policy Statement, they are required to keep the data for 3 years following closeout of a grant or contract agreement...Thus, the grantee institution may have additional policies and procedures regarding the custody, distribution, and required retention period for data produced under research awards.
Misconduct • The Regulation imposes a 6-year time limitation for occurrences of research misconduct to be brought to the attention of an institution or the Department of Health and Human Services (HHS) (see § 93.105)
are expected to be able to explain and defend their results. Doing so usually entails maintaining complete records of how data were collected. The manner in which one maintains such records and makes them available to others will vary from project to project. What constitutes reasonable procedures will be determined by the community of interest through the process of peer review and program management. These standards are likely to evolve as new technologies and resources become available. (http://www.nsf.gov/bfa/dias/policy/dmpfaqs.jsp)
management plan, here is how I plan to deal with research data in the future: I will store all data on at least one, and possibly up to 50, hard drives in my lab. The directory structure will be custom, not self- explanatory, and in no way documented or described. Students working with the data will be encouraged to make their own copies and modify them as they please, in order to ensure that no one can ever figure out what the actual real raw data is. C Titus Brown (2010). http://ivory.idyll.org/blog/data-management.html
Figures 2, 3, 5; Table 1 Figure 4 plots S1 2013 Figures 2, 3, 4B, 6, and 7 Figures 4A, 5 (raw or analyzed) 2015 Figures 3 through 6 Figure 2 (various) Figure 7 (Bliss model) Most raw images BEC data (S1, Figure 2) β-actin data (S2) 2016 Possible raw image data Possibly related code All 2016 SigmaPlot versions of Figures 2, 3, 4, 6, 8, and 11 Raw data (confirmation) Figures 4, 9, 10, 12, 13 Tables 1, 2, and 3 2016 All analyzed data Raw images for Figure 2 Part A
names – And they are logical and correspond to each other • Ties between paper lab notebook and digital files – Print-outs pasted into lab notebook • Descriptive file and directory names that provided information about the data they contain • Logical storage locations • Scanned copies of lab notebooks are easier to search through – And provide a backup in case of disaster
directories and/or on various servers with no clear links between them • Data located on lab computers, not one of the servers (i.e., not backed up) • Copies of figures in PowerPoint slides but the underlying data is not linked • Lack of dates in file and directory names • Lack of link between lab notebook and digital data • Non-descriptive file and directory names • Incomplete lab notebooks (e.g., a notebook indicates an experiment was rerun in response to reviewer comments on October 20, 2013 but there are no entries after this date) • Analysis files (codes, output) not stored on one of the shared servers or not with the data • Codes not documented sufficiently • In Excel files, much of the data was on sheets named “Sheet1” and the columns were not labeled
Some journals (e.g., PLoS, Nature) are beginning to require publishing of raw data along with the paper • Submitting the raw data to the journal is not sufficient data management – Data files do not include enough documentation – Without the code used to analyze the data, the authors will not be able to respond to questions about their work • Don’t be like Sam: https://youtu.be/N2zK3sAtr-4
(Ed.), Data information literacy: librarians, data, and the education of a new generation of researchers. West Lafayette, Indiana: Purdue University Press. • Akmon, D., Zimmerman, A., Daniels, M., & Hedstrom, M. (2011). The application of archival concepts to a data-intensive environment: Working with scientists to understand data management and preservation needs. Archival Science, 11(3), 329- 348. • Briney, K. (2015). Data management for researchers : organize, maintain and share your data for research success. Exeter, UK :Pelagic Publishing.
for yourself and for the PI • You must name the files in a way that current and future group members will understand • You must save all relevant data in a consistent, logical locations • You must identify exactly what data is used in each publication
scientists doing publicly funded and/or published research at the UW – Librarians can remind researchers of these obligations • Researchers can be really bad at RDM – Librarians can teach very basic skills • Start small and build from there – Librarians can help a lab get started
cell sheets into recipient mice – “After learning of concerns that two figures are “very similar” and “some of the error bars look unevenly positioned,” the rest of the authors were unable to locate the raw data, according to the note.” http://retractionwatch.com/2016/02/26/stap-stem-cell-researcher-obokata-loses-another-paper/#more-37272 • Experimental evidence that maternal corticosterone controls adaptive offspring sex ratios – “But after questions about the data were raised, the authors were unable to address the “mismatch” between the experimental data and those that were published.” http://retractionwatch.com/2015/07/23/data-mismatch-and- authors-illness-pluck-finch-study-from-literature/#more-29839 • Eleven papers by one author – “Following an investigation by Nanyang Technological University, primary data are no longer available to be authenticated and we have been informed that there are serious concerns about the ethical environment in which the data were collected.” http://retractionwatch.com/2016/06/14/journal-to-retract-all-yes-all-articles-by-education- researcher-after-investigation/#more-40947 • Three “expressions of concern from two journals” for one author (with seven other retractions) – “In the past, Walumbwa has said he only keeps data until his papers are published, but a lack of raw data has become a common theme in his notices, which now also include four corrections, and one other EOC (making a new total of four). There are no standard rules about how long to store raw data, but one journal that issued two of the new EOCs has since updated its submission policy to require that authors keep data for at least five years.” http://retractionwatch.com/2016/04/01/concerns-attached-to-three-more-papers-by-retraction-laden-management- researcher/#more-36700
• PI recognized that the group’s data management practices were poor, but did not feel that she had the expertise to put in place and enforce standards • RDM is generally worse in “little science” labs than “big science” labs (Borgman, Wallis & Enydey, 2007) • In the absence of formal procedures, students will create their own data management and documentation systems. • Inconsistent data management and documentation systems prevent data from being used by other researchers in the group. • Descriptive file names are essential for allowing researchers to understand their own data in the future as well as to share data.
ecology lab • Developed a “one-shot” training session that addressed good lab notebook practices, file naming, data structure, sharing data (e.g., through repositories in this field), metadata, and data ownership and preservation • Presented policies and guidelines and asked the students to reflect on how their own practices aligned, using materials from DataONE – Explaining the policies underlying why data management is important.RDM practices must be aligned with research workflow and publication practices – Training of this type works best with faculty who have “bought into” the concept of research data management
graduate students • Created an online course to help students understand and track data quality in published research (not for credit) • Assignments for each module helped students develop a data management plan for their research • The first step in developing a data management plan is to inventory the types of data and data storage options • Files need to have descriptive names that indicate content and need to be consistent across team members so that data can be used when one teammate leaves • Directory structures should be predictable and have shared terminology across the team • There needs to be documentation for how data moved from raw to processed states • The content of data files should identify the creator and date(s) generated, and data should be clearly labeled (e.g., axis labels on charts, column headings on spreadsheets).
biological engineering lab • A series of three workshops with homework assignments between – Presented basic data information literacy skills and had students discuss the procedures their faculty advisor had developed – Students revised procedures so they fit better with the experimental workflows – Students explore metadata standards by searching for existing data that could be of use in their research. – Students analyzed their own data and described them using metadata standards from an appropriate online repository • Standards need to be detailed and unambiguous • Standards need to be realistic with respect to the laboratory’s research methods • Standards must be developed collaboratively so that students will follow them
notebook with your name and a number indicating the sequential order of notebooks over time • Label each page with the date • Do not skip pages – If you accidentally skip a page, mark an ‘X” over the page in pen • Record all data – If errors are made, cross entry out with a single line and note reason – Do not remove pages or use whiteout to cover entries • Tape or paste all external materials to the page. – Do NOT insert notes on separate pieces of paper (e.g., notes recorded on paper towels) without pasting them to the page • Follow naming conventions for experiments • If you need to annotate a previous entry, use a different color of ink to mark the change and initial and date the revision
experiment name (follow in naming conventions) • Researcher name/initials • Rationale behind experiment (i.e., why you are doing the experiment) • Date(s) of experiment • Type(s) of data collected (e.g., microscopy images, plaque counts) • Cell type/line(s) and provenance (e.g., source, passage) • Virus type(s) and provenance (e.g., source) • Conditions (e.g., temperatures, pressures, growth media, dilutions and stock sources) • Protocols and methods – This can be a reference to a standard protocol with all deviations, planned or accidental, recorded – This also includes names of the files used in processing the data (e.g., JEX scripts used for processing microscope images) • Instruments used • Results, which may include – Hand-recorded data – Processed data, if of reasonable size to paste into the notebook • Location (directory path) of raw data on Raw Data Storage server • Location (directory path) of processed data and results on the Document Storage server (or other server)