Slide 1

Slide 1 text

Research(data(management((RDM):( Convincing(a(lab(group(that(they(really(need( to(manage(their(data( ! Allison!Langham|!School!of!Library!&!Informa6on!Studies|!UW=Madison! !

Slide 2

Slide 2 text

Research data management (RDM): Convincing a lab group that they really need to manage their data August 25, 2016

Slide 3

Slide 3 text

Outline for today • What the project was • Why RDM is important • How a research group currently manages their data • What the group can do to improve their practices • What comes next

Slide 4

Slide 4 text

Exploring RDM practices in a campus lab • The PI from a campus research lab contacted RDS and SLIS about managing their data • The group ran out of room on available servers – Typical experiments generate 100-200GB of raw data – Created an idealized workflow to justify purchasing a new data storage server (55TB), but – In the processes they realized they were not sure what or where their old data was • Summer practicum project goals – Understand the lab’s current practices – Help the group establish better practices for the future

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Why researchers should care about RDM • You don’t want your published articles to be retracted • You don’t want to lose your funding for non-compliance with policies • You don’t want to waste time finding your old data • You want to be able to share data with your groupmates (including future group members)

Slide 7

Slide 7 text

You really don’t want one, but bad RDM could cause one RETRACTIONS

Slide 8

Slide 8 text

Journal article retractions are not just for fraud • Retraction Watch (The Center for Scientific Integrity) estimates 500 to 600 per year • Some are clear fraud – Andrew Wakefield – Hwang Woo Suk But papers can also be retracted if researchers cannot provide the raw data in response to questions!

Slide 9

Slide 9 text

Retractions for missing data • Angiotensin II-inducible smooth muscle cell apoptosis involves the angiotensin II type 2 receptor, GATA-6 activation, and FasL-Fas engagement – As we reported in December, UNSW cleared Levon Khachigian of misconduct, concluding that his previous issues stemmed from “genuine error or honest oversight.” Now, Circulation Research is retracting one of his papers after an investigation commissioned by UNSW was unable to find electronic records for two similar images from a 2009 paper, nor records of the images in original lab books. http://retractionwatch.com/2016/02/05/investigation-prompts-5th-retraction-for-cancer- researcher-for-unresolvable-concerns/#more-36568 • Docosahexaenoic acid in combination with celecoxib modulates HSP70 and p53 proteins in prostate cancer cells – A 2006 paper investigating the effects of docosahexaenoic acid (DHA) and celecoxib on prostate cancer cells has been retracted because it appears to contain panels that were duplicated, and the authors could not provide the raw data to show otherwise. http://retractionwatch.com/2016/02/23/we-are-living-in-hell-authors-retract-2nd-paper-due-to- missing-raw-data/#more-36865 • Low sodium versus normal sodium diets in systolic heart failure: systematic review and meta-analysis – The Committee considered that without sight of the raw data on which the two papers containing the duplicate data were based, their reliability could not be substantiated. Following inquiries, it turns out that the raw data are no longer available having been lost as a result of computer failure. Under the circumstances, it was the Committee’s recommendation that the Heart meta-analysis should be retracted on the ground that the reliability of the data on which it is based cannot be substantiated http://retractionwatch.com/2013/05/02/heart-pulls-sodium-meta-analysis-over- duplicated-and-now-missing-data/#more-13986

Slide 10

Slide 10 text

Several policies require you to manage your data POLICIES

Slide 11

Slide 11 text

UW-Madison Policy on Data Stewardship, Access, and Retention • 4.0 Policy: UW-Madison must retain research data in sufficient detail and for an adequate period of time to enable appropriate responses to questions about accuracy, authenticity, primacy and compliance with laws and regulations governing the conduct of the research. It is the responsibility of the Principal Investigator to determine what needs to be retained under this policy.

Slide 12

Slide 12 text

UW-Madison Policy on Data Stewardship, Access, and Retention • 4.2 Stewardship and Retention: Principal Investigators should adopt an orderly system of Data organization, access, and retention and should communicate the chosen system to all members of a research group and to the appropriate administrative personnel, where applicable. Particularly for long-term research projects, PIs should establish and maintain procedures for the protection of essential records in the event of a natural disaster or other emergency. Research Data must be archived for a minimum of seven years after the final project close-out, with original Data retained wherever possible.

Slide 13

Slide 13 text

UW-Madison Policy on Data Stewardship, Access, and Retention • 5.0 Roles and Responsibilities: The Principal Investigator is responsible for the stewardship and retention of research Data as well as for determinations concerning access to and appropriate use of Data. Other Research Contributors are responsible to cooperate with the PI in carrying out the requirements of this policy.

Slide 14

Slide 14 text

White House Office of Science and Technology Policy • In February 2013, OSTP released a memo on “Increasing Access to the Results of Federally Funded Scientific Research” – A formal statement on the importance of sharing data obtained from federally funded research – Requires agencies that invest in research and development to have clear and coordinated data access policies, including policies related to data management – Requires granting agencies to evaluate the merits of proposed data management plans and enforce researchers to comply with these plans – Also requires that agencies allow proposals to include costs for data management and access • To date, no coordinated policy has been released

Slide 15

Slide 15 text

NIH Data Sharing Policy and Implementation • Grantees should note that, under the NIH Grants Policy Statement, they are required to keep the data for 3 years following closeout of a grant or contract agreement...Thus, the grantee institution may have additional policies and procedures regarding the custody, distribution, and required retention period for data produced under research awards.

Slide 16

Slide 16 text

NIH Research Integrity Requirements for Making a Finding of Research Misconduct • The Regulation imposes a 6-year time limitation for occurrences of research misconduct to be brought to the attention of an institution or the Department of Health and Human Services (HHS) (see § 93.105)

Slide 17

Slide 17 text

NSF Dissemination and Sharing of Research Results • All researchers are expected to be able to explain and defend their results. Doing so usually entails maintaining complete records of how data were collected. The manner in which one maintains such records and makes them available to others will vary from project to project. What constitutes reasonable procedures will be determined by the community of interest through the process of peer review and program management. These standards are likely to evolve as new technologies and resources become available. (http://www.nsf.gov/bfa/dias/policy/dmpfaqs.jsp)

Slide 18

Slide 18 text

A sadly true satire Now, as to my actual data management plan, here is how I plan to deal with research data in the future: I will store all data on at least one, and possibly up to 50, hard drives in my lab. The directory structure will be custom, not self- explanatory, and in no way documented or described. Students working with the data will be encouraged to make their own copies and modify them as they please, in order to ensure that no one can ever figure out what the actual real raw data is. C Titus Brown (2010). http://ivory.idyll.org/blog/data-management.html

Slide 19

Slide 19 text

CURRENT RDM PRACTICES IN A CAMPUS RESEARCH LAB

Slide 20

Slide 20 text

Investigating current practices • Searched for data from six papers published between 2013 and 2016

Slide 21

Slide 21 text

Found and not (yet) found Paper Found Not found 2013 Figures 2, 3, 5; Table 1 Figure 4 plots S1 2013 Figures 2, 3, 4B, 6, and 7 Figures 4A, 5 (raw or analyzed) 2015 Figures 3 through 6 Figure 2 (various) Figure 7 (Bliss model) Most raw images BEC data (S1, Figure 2) β-actin data (S2) 2016 Possible raw image data Possibly related code All 2016 SigmaPlot versions of Figures 2, 3, 4, 6, 8, and 11 Raw data (confirmation) Figures 4, 9, 10, 12, 13 Tables 1, 2, and 3 2016 All analyzed data Raw images for Figure 2 Part A

Slide 22

Slide 22 text

What worked well • Dates in lab notebooks and digital names – And they are logical and correspond to each other • Ties between paper lab notebook and digital files – Print-outs pasted into lab notebook • Descriptive file and directory names that provided information about the data they contain • Logical storage locations • Scanned copies of lab notebooks are easier to search through – And provide a backup in case of disaster

Slide 23

Slide 23 text

Example: Kinetics of antiviral state development cells following treatment by interferons

Slide 24

Slide 24 text

Researcher directory: \_Data Backup\Past Group Members\EAV Lab notebooks Data

Slide 25

Slide 25 text

Scanned lab notebook Dated 11/18/13

Slide 26

Slide 26 text

Digital data organized by date of experiment Dated 11/14/13

Slide 27

Slide 27 text

Images and spreadsheets This file has the analyzed data This directory has the images

Slide 28

Slide 28 text

Analyzed data file

Slide 29

Slide 29 text

Images Α, γ, λ1 λ2, λ3, CM

Slide 30

Slide 30 text

What’s missing • Which rows are which IFN? • How were the images translated to numerical values? No labels, No background images No codes

Slide 31

Slide 31 text

What did not work well • Data stored in various directories and/or on various servers with no clear links between them • Data located on lab computers, not one of the servers (i.e., not backed up) • Copies of figures in PowerPoint slides but the underlying data is not linked • Lack of dates in file and directory names • Lack of link between lab notebook and digital data • Non-descriptive file and directory names • Incomplete lab notebooks (e.g., a notebook indicates an experiment was rerun in response to reviewer comments on October 20, 2013 but there are no entries after this date) • Analysis files (codes, output) not stored on one of the shared servers or not with the data • Codes not documented sufficiently • In Excel files, much of the data was on sheets named “Sheet1” and the columns were not labeled

Slide 32

Slide 32 text

LAB NOTEBOOKS

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

DIGITAL DATA

Slide 40

Slide 40 text

No papers from 2016

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

7,012 TIFs

Slide 45

Slide 45 text

Submitting raw data to a journal is not enough • Some journals (e.g., PLoS, Nature) are beginning to require publishing of raw data along with the paper • Submitting the raw data to the journal is not sufficient data management – Data files do not include enough documentation – Without the code used to analyze the data, the authors will not be able to respond to questions about their work • Don’t be like Sam: https://youtu.be/N2zK3sAtr-4

Slide 46

Slide 46 text

IMPROVING THE LAB’S RDM PRACTICES

Slide 47

Slide 47 text

Case studies and resources • Carlson, J. & Johnston, L.R. (Ed.), Data information literacy: librarians, data, and the education of a new generation of researchers. West Lafayette, Indiana: Purdue University Press. • Akmon, D., Zimmerman, A., Daniels, M., & Hedstrom, M. (2011). The application of archival concepts to a data-intensive environment: Working with scientists to understand data management and preservation needs. Archival Science, 11(3), 329- 348. • Briney, K. (2015). Data management for researchers : organize, maintain and share your data for research success. Exeter, UK :Pelagic Publishing.

Slide 48

Slide 48 text

Standard Operating Procedures (SOPs) • Purpose is to standardize how data is stored • Four drafts – Legal Obligations – Naming Conventions – Data to Record – Preparing Publications • Corresponding form PublicationForm.xlsx

Slide 49

Slide 49 text

Summary of SOPs • You must manage your data, both for yourself and for the PI • You must name the files in a way that current and future group members will understand • You must save all relevant data in a consistent, logical locations • You must identify exactly what data is used in each publication

Slide 50

Slide 50 text

Next steps • Pledge to follow good RDM practices • Try the SOPs • Revise and expand the SOPs • Refresh good lab notebook practices • Consider move to an electronic lab notebook – Searchable! Backed-up! Can upload files from servers (or not)!

Slide 51

Slide 51 text

Why should librarians care? • RDM is required for all scientists doing publicly funded and/or published research at the UW – Librarians can remind researchers of these obligations • Researchers can be really bad at RDM – Librarians can teach very basic skills • Start small and build from there – Librarians can help a lab get started

Slide 52

Slide 52 text

End

Slide 53

Slide 53 text

(More) Retractions for missing data • Reproducible subcutaneous transplantation of cell sheets into recipient mice – “After learning of concerns that two figures are “very similar” and “some of the error bars look unevenly positioned,” the rest of the authors were unable to locate the raw data, according to the note.” http://retractionwatch.com/2016/02/26/stap-stem-cell-researcher-obokata-loses-another-paper/#more-37272 • Experimental evidence that maternal corticosterone controls adaptive offspring sex ratios – “But after questions about the data were raised, the authors were unable to address the “mismatch” between the experimental data and those that were published.” http://retractionwatch.com/2015/07/23/data-mismatch-and- authors-illness-pluck-finch-study-from-literature/#more-29839 • Eleven papers by one author – “Following an investigation by Nanyang Technological University, primary data are no longer available to be authenticated and we have been informed that there are serious concerns about the ethical environment in which the data were collected.” http://retractionwatch.com/2016/06/14/journal-to-retract-all-yes-all-articles-by-education- researcher-after-investigation/#more-40947 • Three “expressions of concern from two journals” for one author (with seven other retractions) – “In the past, Walumbwa has said he only keeps data until his papers are published, but a lack of raw data has become a common theme in his notices, which now also include four corrections, and one other EOC (making a new total of four). There are no standard rules about how long to store raw data, but one journal that issued two of the new EOCs has since updated its submission policy to require that authors keep data for at least five years.” http://retractionwatch.com/2016/04/01/concerns-attached-to-three-more-papers-by-retraction-laden-management- researcher/#more-36700

Slide 54

Slide 54 text

CASE STUDIES

Slide 55

Slide 55 text

Akmon, 2011 • Interviewed RDM in a materials science lab • PI recognized that the group’s data management practices were poor, but did not feel that she had the expertise to put in place and enforce standards • RDM is generally worse in “little science” labs than “big science” labs (Borgman, Wallis & Enydey, 2007) • In the absence of formal procedures, students will create their own data management and documentation systems. • Inconsistent data management and documentation systems prevent data from being used by other researchers in the group. • Descriptive file names are essential for allowing researchers to understand their own data in the future as well as to share data.

Slide 56

Slide 56 text

Westra and Walton, 2015 • Interviews and training for an ecology lab • Developed a “one-shot” training session that addressed good lab notebook practices, file naming, data structure, sharing data (e.g., through repositories in this field), metadata, and data ownership and preservation • Presented policies and guidelines and asked the students to reflect on how their own practices aligned, using materials from DataONE – Explaining the policies underlying why data management is important.RDM practices must be aligned with research workflow and publication practices – Training of this type works best with faculty who have “bought into” the concept of research data management

Slide 57

Slide 57 text

Johnston and Jeffryes, 2015 • Online training for civil engineering graduate students • Created an online course to help students understand and track data quality in published research (not for credit) • Assignments for each module helped students develop a data management plan for their research • The first step in developing a data management plan is to inventory the types of data and data storage options • Files need to have descriptive names that indicate content and need to be consistent across team members so that data can be used when one teammate leaves • Directory structures should be predictable and have shared terminology across the team • There needs to be documentation for how data moved from raw to processed states • The content of data files should identify the creator and date(s) generated, and data should be clearly labeled (e.g., axis labels on charts, column headings on spreadsheets).

Slide 58

Slide 58 text

Bracke and Fosmire, 2015 • Training for an agricultural and biological engineering lab • A series of three workshops with homework assignments between – Presented basic data information literacy skills and had students discuss the procedures their faculty advisor had developed – Students revised procedures so they fit better with the experimental workflows – Students explore metadata standards by searching for existing data that could be of use in their research. – Students analyzed their own data and described them using metadata standards from an appropriate online repository • Standards need to be detailed and unambiguous • Standards need to be realistic with respect to the laboratory’s research methods • Standards must be developed collaboratively so that students will follow them

Slide 59

Slide 59 text

GOOD LAB NOTEBOOK PRACTICES REFRESHER

Slide 60

Slide 60 text

General lab notebook rules • Label spine and cover of notebook with your name and a number indicating the sequential order of notebooks over time • Label each page with the date • Do not skip pages – If you accidentally skip a page, mark an ‘X” over the page in pen • Record all data – If errors are made, cross entry out with a single line and note reason – Do not remove pages or use whiteout to cover entries • Tape or paste all external materials to the page. – Do NOT insert notes on separate pieces of paper (e.g., notes recorded on paper towels) without pasting them to the page • Follow naming conventions for experiments • If you need to annotate a previous entry, use a different color of ink to mark the change and initial and date the revision

Slide 61

Slide 61 text

What to record in the lab notebook • Project or experiment name (follow in naming conventions) • Researcher name/initials • Rationale behind experiment (i.e., why you are doing the experiment) • Date(s) of experiment • Type(s) of data collected (e.g., microscopy images, plaque counts) • Cell type/line(s) and provenance (e.g., source, passage) • Virus type(s) and provenance (e.g., source) • Conditions (e.g., temperatures, pressures, growth media, dilutions and stock sources) • Protocols and methods – This can be a reference to a standard protocol with all deviations, planned or accidental, recorded – This also includes names of the files used in processing the data (e.g., JEX scripts used for processing microscope images) • Instruments used • Results, which may include – Hand-recorded data – Processed data, if of reasonable size to paste into the notebook • Location (directory path) of raw data on Raw Data Storage server • Location (directory path) of processed data and results on the Document Storage server (or other server)