Research data management (RDM): Convincing a lab group they really need to manage their data

60d0e0af6e89ae0f6114f89cb72b21d3?s=47 Research Data Services
August 26, 2016
330

Research data management (RDM): Convincing a lab group they really need to manage their data

Presentation by RDS practicum student, Allison Langham. August 25, 2016.

60d0e0af6e89ae0f6114f89cb72b21d3?s=128

Research Data Services

August 26, 2016
Tweet

Transcript

  1. Research(data(management((RDM):( Convincing(a(lab(group(that(they(really(need( to(manage(their(data( ! Allison!Langham|!School!of!Library!&!Informa6on!Studies|!UW=Madison! !

  2. Research data management (RDM): Convincing a lab group that they

    really need to manage their data August 25, 2016
  3. Outline for today • What the project was • Why

    RDM is important • How a research group currently manages their data • What the group can do to improve their practices • What comes next
  4. Exploring RDM practices in a campus lab • The PI

    from a campus research lab contacted RDS and SLIS about managing their data • The group ran out of room on available servers – Typical experiments generate 100-200GB of raw data – Created an idealized workflow to justify purchasing a new data storage server (55TB), but – In the processes they realized they were not sure what or where their old data was • Summer practicum project goals – Understand the lab’s current practices – Help the group establish better practices for the future
  5. None
  6. Why researchers should care about RDM • You don’t want

    your published articles to be retracted • You don’t want to lose your funding for non-compliance with policies • You don’t want to waste time finding your old data • You want to be able to share data with your groupmates (including future group members)
  7. You really don’t want one, but bad RDM could cause

    one RETRACTIONS
  8. Journal article retractions are not just for fraud • Retraction

    Watch (The Center for Scientific Integrity) estimates 500 to 600 per year • Some are clear fraud – Andrew Wakefield – Hwang Woo Suk But papers can also be retracted if researchers cannot provide the raw data in response to questions!
  9. Retractions for missing data • Angiotensin II-inducible smooth muscle cell

    apoptosis involves the angiotensin II type 2 receptor, GATA-6 activation, and FasL-Fas engagement – As we reported in December, UNSW cleared Levon Khachigian of misconduct, concluding that his previous issues stemmed from “genuine error or honest oversight.” Now, Circulation Research is retracting one of his papers after an investigation commissioned by UNSW was unable to find electronic records for two similar images from a 2009 paper, nor records of the images in original lab books. http://retractionwatch.com/2016/02/05/investigation-prompts-5th-retraction-for-cancer- researcher-for-unresolvable-concerns/#more-36568 • Docosahexaenoic acid in combination with celecoxib modulates HSP70 and p53 proteins in prostate cancer cells – A 2006 paper investigating the effects of docosahexaenoic acid (DHA) and celecoxib on prostate cancer cells has been retracted because it appears to contain panels that were duplicated, and the authors could not provide the raw data to show otherwise. http://retractionwatch.com/2016/02/23/we-are-living-in-hell-authors-retract-2nd-paper-due-to- missing-raw-data/#more-36865 • Low sodium versus normal sodium diets in systolic heart failure: systematic review and meta-analysis – The Committee considered that without sight of the raw data on which the two papers containing the duplicate data were based, their reliability could not be substantiated. Following inquiries, it turns out that the raw data are no longer available having been lost as a result of computer failure. Under the circumstances, it was the Committee’s recommendation that the Heart meta-analysis should be retracted on the ground that the reliability of the data on which it is based cannot be substantiated http://retractionwatch.com/2013/05/02/heart-pulls-sodium-meta-analysis-over- duplicated-and-now-missing-data/#more-13986
  10. Several policies require you to manage your data POLICIES

  11. UW-Madison Policy on Data Stewardship, Access, and Retention • 4.0

    Policy: UW-Madison must retain research data in sufficient detail and for an adequate period of time to enable appropriate responses to questions about accuracy, authenticity, primacy and compliance with laws and regulations governing the conduct of the research. It is the responsibility of the Principal Investigator to determine what needs to be retained under this policy.
  12. UW-Madison Policy on Data Stewardship, Access, and Retention • 4.2

    Stewardship and Retention: Principal Investigators should adopt an orderly system of Data organization, access, and retention and should communicate the chosen system to all members of a research group and to the appropriate administrative personnel, where applicable. Particularly for long-term research projects, PIs should establish and maintain procedures for the protection of essential records in the event of a natural disaster or other emergency. Research Data must be archived for a minimum of seven years after the final project close-out, with original Data retained wherever possible.
  13. UW-Madison Policy on Data Stewardship, Access, and Retention • 5.0

    Roles and Responsibilities: The Principal Investigator is responsible for the stewardship and retention of research Data as well as for determinations concerning access to and appropriate use of Data. Other Research Contributors are responsible to cooperate with the PI in carrying out the requirements of this policy.
  14. White House Office of Science and Technology Policy • In

    February 2013, OSTP released a memo on “Increasing Access to the Results of Federally Funded Scientific Research” – A formal statement on the importance of sharing data obtained from federally funded research – Requires agencies that invest in research and development to have clear and coordinated data access policies, including policies related to data management – Requires granting agencies to evaluate the merits of proposed data management plans and enforce researchers to comply with these plans – Also requires that agencies allow proposals to include costs for data management and access • To date, no coordinated policy has been released
  15. NIH Data Sharing Policy and Implementation • Grantees should note

    that, under the NIH Grants Policy Statement, they are required to keep the data for 3 years following closeout of a grant or contract agreement...Thus, the grantee institution may have additional policies and procedures regarding the custody, distribution, and required retention period for data produced under research awards.
  16. NIH Research Integrity Requirements for Making a Finding of Research

    Misconduct • The Regulation imposes a 6-year time limitation for occurrences of research misconduct to be brought to the attention of an institution or the Department of Health and Human Services (HHS) (see § 93.105)
  17. NSF Dissemination and Sharing of Research Results • All researchers

    are expected to be able to explain and defend their results. Doing so usually entails maintaining complete records of how data were collected. The manner in which one maintains such records and makes them available to others will vary from project to project. What constitutes reasonable procedures will be determined by the community of interest through the process of peer review and program management. These standards are likely to evolve as new technologies and resources become available. (http://www.nsf.gov/bfa/dias/policy/dmpfaqs.jsp)
  18. A sadly true satire Now, as to my actual data

    management plan, here is how I plan to deal with research data in the future: I will store all data on at least one, and possibly up to 50, hard drives in my lab. The directory structure will be custom, not self- explanatory, and in no way documented or described. Students working with the data will be encouraged to make their own copies and modify them as they please, in order to ensure that no one can ever figure out what the actual real raw data is. C Titus Brown (2010). http://ivory.idyll.org/blog/data-management.html
  19. CURRENT RDM PRACTICES IN A CAMPUS RESEARCH LAB

  20. Investigating current practices • Searched for data from six papers

    published between 2013 and 2016
  21. Found and not (yet) found Paper Found Not found 2013

    Figures 2, 3, 5; Table 1 Figure 4 plots S1 2013 Figures 2, 3, 4B, 6, and 7 Figures 4A, 5 (raw or analyzed) 2015 Figures 3 through 6 Figure 2 (various) Figure 7 (Bliss model) Most raw images BEC data (S1, Figure 2) β-actin data (S2) 2016 Possible raw image data Possibly related code All 2016 SigmaPlot versions of Figures 2, 3, 4, 6, 8, and 11 Raw data (confirmation) Figures 4, 9, 10, 12, 13 Tables 1, 2, and 3 2016 All analyzed data Raw images for Figure 2 Part A
  22. What worked well • Dates in lab notebooks and digital

    names – And they are logical and correspond to each other • Ties between paper lab notebook and digital files – Print-outs pasted into lab notebook • Descriptive file and directory names that provided information about the data they contain • Logical storage locations • Scanned copies of lab notebooks are easier to search through – And provide a backup in case of disaster
  23. Example: Kinetics of antiviral state development cells following treatment by

    interferons
  24. Researcher directory: \_Data Backup\Past Group Members\EAV Lab notebooks Data

  25. Scanned lab notebook Dated 11/18/13

  26. Digital data organized by date of experiment Dated 11/14/13

  27. Images and spreadsheets This file has the analyzed data This

    directory has the images
  28. Analyzed data file

  29. Images Α, γ, λ1 λ2, λ3, CM

  30. What’s missing • Which rows are which IFN? • How

    were the images translated to numerical values? No labels, No background images No codes
  31. What did not work well • Data stored in various

    directories and/or on various servers with no clear links between them • Data located on lab computers, not one of the servers (i.e., not backed up) • Copies of figures in PowerPoint slides but the underlying data is not linked • Lack of dates in file and directory names • Lack of link between lab notebook and digital data • Non-descriptive file and directory names • Incomplete lab notebooks (e.g., a notebook indicates an experiment was rerun in response to reviewer comments on October 20, 2013 but there are no entries after this date) • Analysis files (codes, output) not stored on one of the shared servers or not with the data • Codes not documented sufficiently • In Excel files, much of the data was on sheets named “Sheet1” and the columns were not labeled
  32. LAB NOTEBOOKS

  33. None
  34. None
  35. None
  36. None
  37. None
  38. None
  39. DIGITAL DATA

  40. No papers from 2016

  41. None
  42. None
  43. None
  44. 7,012 TIFs

  45. Submitting raw data to a journal is not enough •

    Some journals (e.g., PLoS, Nature) are beginning to require publishing of raw data along with the paper • Submitting the raw data to the journal is not sufficient data management – Data files do not include enough documentation – Without the code used to analyze the data, the authors will not be able to respond to questions about their work • Don’t be like Sam: https://youtu.be/N2zK3sAtr-4
  46. IMPROVING THE LAB’S RDM PRACTICES

  47. Case studies and resources • Carlson, J. & Johnston, L.R.

    (Ed.), Data information literacy: librarians, data, and the education of a new generation of researchers. West Lafayette, Indiana: Purdue University Press. • Akmon, D., Zimmerman, A., Daniels, M., & Hedstrom, M. (2011). The application of archival concepts to a data-intensive environment: Working with scientists to understand data management and preservation needs. Archival Science, 11(3), 329- 348. • Briney, K. (2015). Data management for researchers : organize, maintain and share your data for research success. Exeter, UK :Pelagic Publishing.
  48. Standard Operating Procedures (SOPs) • Purpose is to standardize how

    data is stored • Four drafts – Legal Obligations – Naming Conventions – Data to Record – Preparing Publications • Corresponding form PublicationForm.xlsx
  49. Summary of SOPs • You must manage your data, both

    for yourself and for the PI • You must name the files in a way that current and future group members will understand • You must save all relevant data in a consistent, logical locations • You must identify exactly what data is used in each publication
  50. Next steps • Pledge to follow good RDM practices •

    Try the SOPs • Revise and expand the SOPs • Refresh good lab notebook practices • Consider move to an electronic lab notebook – Searchable! Backed-up! Can upload files from servers (or not)!
  51. Why should librarians care? • RDM is required for all

    scientists doing publicly funded and/or published research at the UW – Librarians can remind researchers of these obligations • Researchers can be really bad at RDM – Librarians can teach very basic skills • Start small and build from there – Librarians can help a lab get started
  52. End

  53. (More) Retractions for missing data • Reproducible subcutaneous transplantation of

    cell sheets into recipient mice – “After learning of concerns that two figures are “very similar” and “some of the error bars look unevenly positioned,” the rest of the authors were unable to locate the raw data, according to the note.” http://retractionwatch.com/2016/02/26/stap-stem-cell-researcher-obokata-loses-another-paper/#more-37272 • Experimental evidence that maternal corticosterone controls adaptive offspring sex ratios – “But after questions about the data were raised, the authors were unable to address the “mismatch” between the experimental data and those that were published.” http://retractionwatch.com/2015/07/23/data-mismatch-and- authors-illness-pluck-finch-study-from-literature/#more-29839 • Eleven papers by one author – “Following an investigation by Nanyang Technological University, primary data are no longer available to be authenticated and we have been informed that there are serious concerns about the ethical environment in which the data were collected.” http://retractionwatch.com/2016/06/14/journal-to-retract-all-yes-all-articles-by-education- researcher-after-investigation/#more-40947 • Three “expressions of concern from two journals” for one author (with seven other retractions) – “In the past, Walumbwa has said he only keeps data until his papers are published, but a lack of raw data has become a common theme in his notices, which now also include four corrections, and one other EOC (making a new total of four). There are no standard rules about how long to store raw data, but one journal that issued two of the new EOCs has since updated its submission policy to require that authors keep data for at least five years.” http://retractionwatch.com/2016/04/01/concerns-attached-to-three-more-papers-by-retraction-laden-management- researcher/#more-36700
  54. CASE STUDIES

  55. Akmon, 2011 • Interviewed RDM in a materials science lab

    • PI recognized that the group’s data management practices were poor, but did not feel that she had the expertise to put in place and enforce standards • RDM is generally worse in “little science” labs than “big science” labs (Borgman, Wallis & Enydey, 2007) • In the absence of formal procedures, students will create their own data management and documentation systems. • Inconsistent data management and documentation systems prevent data from being used by other researchers in the group. • Descriptive file names are essential for allowing researchers to understand their own data in the future as well as to share data.
  56. Westra and Walton, 2015 • Interviews and training for an

    ecology lab • Developed a “one-shot” training session that addressed good lab notebook practices, file naming, data structure, sharing data (e.g., through repositories in this field), metadata, and data ownership and preservation • Presented policies and guidelines and asked the students to reflect on how their own practices aligned, using materials from DataONE – Explaining the policies underlying why data management is important.RDM practices must be aligned with research workflow and publication practices – Training of this type works best with faculty who have “bought into” the concept of research data management
  57. Johnston and Jeffryes, 2015 • Online training for civil engineering

    graduate students • Created an online course to help students understand and track data quality in published research (not for credit) • Assignments for each module helped students develop a data management plan for their research • The first step in developing a data management plan is to inventory the types of data and data storage options • Files need to have descriptive names that indicate content and need to be consistent across team members so that data can be used when one teammate leaves • Directory structures should be predictable and have shared terminology across the team • There needs to be documentation for how data moved from raw to processed states • The content of data files should identify the creator and date(s) generated, and data should be clearly labeled (e.g., axis labels on charts, column headings on spreadsheets).
  58. Bracke and Fosmire, 2015 • Training for an agricultural and

    biological engineering lab • A series of three workshops with homework assignments between – Presented basic data information literacy skills and had students discuss the procedures their faculty advisor had developed – Students revised procedures so they fit better with the experimental workflows – Students explore metadata standards by searching for existing data that could be of use in their research. – Students analyzed their own data and described them using metadata standards from an appropriate online repository • Standards need to be detailed and unambiguous • Standards need to be realistic with respect to the laboratory’s research methods • Standards must be developed collaboratively so that students will follow them
  59. GOOD LAB NOTEBOOK PRACTICES REFRESHER

  60. General lab notebook rules • Label spine and cover of

    notebook with your name and a number indicating the sequential order of notebooks over time • Label each page with the date • Do not skip pages – If you accidentally skip a page, mark an ‘X” over the page in pen • Record all data – If errors are made, cross entry out with a single line and note reason – Do not remove pages or use whiteout to cover entries • Tape or paste all external materials to the page. – Do NOT insert notes on separate pieces of paper (e.g., notes recorded on paper towels) without pasting them to the page • Follow naming conventions for experiments • If you need to annotate a previous entry, use a different color of ink to mark the change and initial and date the revision
  61. What to record in the lab notebook • Project or

    experiment name (follow in naming conventions) • Researcher name/initials • Rationale behind experiment (i.e., why you are doing the experiment) • Date(s) of experiment • Type(s) of data collected (e.g., microscopy images, plaque counts) • Cell type/line(s) and provenance (e.g., source, passage) • Virus type(s) and provenance (e.g., source) • Conditions (e.g., temperatures, pressures, growth media, dilutions and stock sources) • Protocols and methods – This can be a reference to a standard protocol with all deviations, planned or accidental, recorded – This also includes names of the files used in processing the data (e.g., JEX scripts used for processing microscope images) • Instruments used • Results, which may include – Hand-recorded data – Processed data, if of reasonable size to paste into the notebook • Location (directory path) of raw data on Raw Data Storage server • Location (directory path) of processed data and results on the Document Storage server (or other server)