Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Complexity and Data Diversity panel for BD2K Wo...

James Taylor
February 25, 2015

Complexity and Data Diversity panel for BD2K Workshop on Community-Based and Metadata Standards Development

Slides to facilitate panel discussion, February 25 2015.

Slides contributed by panelists: Philippe Rocca-Serra, Michel Dumontier, Charles Bailey, Olivier Bodenreider

James Taylor

February 25, 2015
Tweet

More Decks by James Taylor

Other Decks in Science

Transcript

  1. Community-Based Data and Metadata Standards: Complexity and Data Diversity Moderator:

    James Taylor (@jxtx) Panelists: Philippe Rocca-Serra (@Phil_at_OeRC) Michel Dumontier (@micheldumontier) Charles Bailey Olivier Bodenreider
  2. Initial Questions 1. Why do we need data standards? How

    will we know one is useful a priori? 2. How do you know you need a standard? Determining requirements 3. How do you know a standard is working well? How can you evaluate them? 4. How do you know what standard will work best for any given application?
  3. When do we want standards? Proliferation of data types and

    formats – research data types and standards of varying quality and maturity When does it make sense to create a standard? Standards are always good, but when are they worth the effort? If something is really needed by the community, why do we need to create additional incentives? When do standards help vs. hinder research? How can we predict or evaluate the impact of a standard? What are the appropriate metrics? Where are the success stories? How can we determine what domains / types of data will benefits most from standards efforts? Where should we put our resources?
  4. Initial Q&A: ISA point of view • Why do we

    need standards? to enable communication, release and preservation of data • How do you know that you need a standard? I/O bottlenecks: platform specific API + costly licensed tools for data access Disruptive technology with massive but uncontrolled data growth • How do you know a standard is working well? Uptake by Repositories, Software Vendors, Publishers, Open Source efforts Reuse /Extension to other fields than those initially targeted User support requests • How can you evaluate them? ease of use, support, documentation, implementation guides, flexibility, extensibility, maintainability, interoperability with other standards -> promote modularity and factorisation of reusable components • How to determine what standard will work best for any given application? Create and curate a registry of standards (biosharing) Create metrics and evaluation criteria (fitness for purposes tests) Neutral Assessment by review or standardization bodies (NIST? )
  5. Towards sustainable data sharing and development of community standards •

    Time. W3C HCLS Dataset Description Guideline – “a couple of months tops” - Over 2 years now. – 51 metadata elements identified, 12+ vocabularies surveyed • No single vocabulary to cover all needs. – Weekly 1hr teleconference calls, chairing, scribing, input from dozens of participants, mailing lists, issue tracker, document revisions -> Dedicated project staff to stay on track over 12-24months. • Effort. Bio2RDF – linked data for the life sciences – “it’s low quality” Why? “they’re out of date” – all 30 database conversion scripts updated in yearly update. – Daily/weekly updates to perpetually changing schemas is near impossible. -> Add data interoperability into data sharing plans @micheldumontier::NIH CBS Workshop:25-02-2015 7
  6. What stands in the way of standards? • Need for

    common terminologies • Intentions of data collector (e.g. clinical vs research vs advocacy) • Historical differences in usage • Repeatable data characterization • Clinical vs research provenance • Definition of data quality requirements • Domain-specific requirements • Pediatrics - growth-related normalization • Study-specific granularity - need for hierarchical relationships
  7. Standards divide in the community (Olivier Bodenreider) • Content: Multiple

    disconnected standards – Makes it difficult to integrate translational data – Requires bridging • Prospective harmonization efforts – Among bio standards: OBO Foundry – Between clinical standards: SNOMED CT-LOINC; SNOMED CT-ICD-11 • Post hoc mapping efforts (UMLS, BioPortal, GEM, cross-references) • Technical: How to best distribute standards? – RDF/OWL (Linked Open Data) – APIs (integration in software) • Social: Listening to the community – Use cases; feedback from users – Licensing restrictions (e.g., UMLS license agreement)
  8. Initial Questions – Discussion 1. Why do we need data

    standards? How will we know one is useful a priori? 2. How do you know you need a standard? Determining requirements 3. How do you know a standard is working well? How can you evaluate them? 4. How do you know what standard will work best for any given application?
  9. • How well did this work? Pretty well owing the

    acceptance, uptake for an effort which starting very small, and faced an uphill battle • How effective was it in making data more findable? structuring information is distinct from indexing as we have a lot of private users but this is a first step. the interaction with users always results in creating a data management plan and a curation policy as discussion often identify a big gap “findable”: 2 things: syntax + vocabulary curation policies are essential documentation coding patterns in the form of implementation guidelines convincing end users of the value of those patterns for long term • Does it actually aid reusability? making data available is the first step so to that extend, ISA-Tab definitely aids reusability how to assess it? we would need to be able to detect datasets citation -> ongoing work Can we improve? certainly, there is always room for improvement in expressivity, tooling,pattern documentation Initial Q&A: ISA point of view
  10. • Does it actually aid reusability? making data available is

    the first step so to that extend, ISA- Tab definitely aids reusability. • How to assess it ? we would need to be able to detect datasets citation -> ongoing work • Can we improve? certainly, there is always room for improvement in expressivity, tooling, documentation of coding patterns. • Discuss any other aspect of your perspective and experience that you like in your slide. The goal is to highlight the problems and their diversity, and discuss social, technical, and financial solutions to solving them. Technology Geeks versus wet-lab biologists: keep it simple was the winning point for ISA-Tab. Think presentation layer and make it easy for end users. Prospective Data Management vs Retrospective Data Forensics -> changing the habits / the practice Big problem: sustainability of standards development! Most of standard related work in academia regularly faces the axe, which is a major threat to any standardization effort, which requires long-term support to establish authoritative status. Furthermore, the goal is to operate in an open, free to access framework. Some standardization development Organization such ISO make standards specification available at a fee or required user registration for accessing material. This model makes it difficult to ensure diffusion of standards (a single ISO standard document can reach several thousands of USD). Big Question: How to properly support standardization activities – Support for Biosharing effort to establish an umbrella , one stop shop for funding agencies/developments to come together, avoid duplication of efforts and broker development pathways. Initial Q&A: ISA point of view