Complexity and Data Diversity panel for BD2K Workshop on Community-Based and Metadata Standards Development

Community-Based Data and Metadata Standards: Complexity and Data Diversity Moderator:
James Taylor (@jxtx) Panelists: Philippe Rocca-Serra (@Phil_at_OeRC) Michel Dumontier (@micheldumontier) Charles Bailey Olivier Bodenreider

Initial Questions 1. Why do we need data standards? How
will we know one is useful a priori? 2. How do you know you need a standard? Determining requirements 3. How do you know a standard is working well? How can you evaluate them? 4. How do you know what standard will work best for any given application?

When do we want standards? Proliferation of data types and
formats – research data types and standards of varying quality and maturity When does it make sense to create a standard? Standards are always good, but when are they worth the effort? If something is really needed by the community, why do we need to create additional incentives? When do standards help vs. hinder research? How can we predict or evaluate the impact of a standard? What are the appropriate metrics? Where are the success stories? How can we determine what domains / types of data will benefits most from standards efforts? Where should we put our resources?

Philippe Rocca-Serra

Initial Q&A: ISA point of view • Why do we
need standards? to enable communication, release and preservation of data • How do you know that you need a standard? I/O bottlenecks: platform specific API + costly licensed tools for data access Disruptive technology with massive but uncontrolled data growth • How do you know a standard is working well? Uptake by Repositories, Software Vendors, Publishers, Open Source efforts Reuse /Extension to other fields than those initially targeted User support requests • How can you evaluate them? ease of use, support, documentation, implementation guides, flexibility, extensibility, maintainability, interoperability with other standards -> promote modularity and factorisation of reusable components • How to determine what standard will work best for any given application? Create and curate a registry of standards (biosharing) Create metrics and evaluation criteria (fitness for purposes tests) Neutral Assessment by review or standardization bodies (NIST? )

Michel Dumontier

Towards sustainable data sharing and development of community standards •
Time. W3C HCLS Dataset Description Guideline – “a couple of months tops” - Over 2 years now. – 51 metadata elements identified, 12+ vocabularies surveyed • No single vocabulary to cover all needs. – Weekly 1hr teleconference calls, chairing, scribing, input from dozens of participants, mailing lists, issue tracker, document revisions -> Dedicated project staff to stay on track over 12-24months. • Effort. Bio2RDF – linked data for the life sciences – “it’s low quality” Why? “they’re out of date” – all 30 database conversion scripts updated in yearly update. – Daily/weekly updates to perpetually changing schemas is near impossible. -> Add data interoperability into data sharing plans @micheldumontier::NIH CBS Workshop:25-02-2015 7

Charles Bailey

What stands in the way of standards? • Need for
common terminologies • Intentions of data collector (e.g. clinical vs research vs advocacy) • Historical differences in usage • Repeatable data characterization • Clinical vs research provenance • Definition of data quality requirements • Domain-specific requirements • Pediatrics - growth-related normalization • Study-specific granularity - need for hierarchical relationships

Olivier Bodenreider

Standards divide in the community (Olivier Bodenreider) • Content: Multiple
disconnected standards – Makes it difficult to integrate translational data – Requires bridging • Prospective harmonization efforts – Among bio standards: OBO Foundry – Between clinical standards: SNOMED CT-LOINC; SNOMED CT-ICD-11 • Post hoc mapping efforts (UMLS, BioPortal, GEM, cross-references) • Technical: How to best distribute standards? – RDF/OWL (Linked Open Data) – APIs (integration in software) • Social: Listening to the community – Use cases; feedback from users – Licensing restrictions (e.g., UMLS license agreement)

Initial Questions – Discussion 1. Why do we need data
standards? How will we know one is useful a priori? 2. How do you know you need a standard? Determining requirements 3. How do you know a standard is working well? How can you evaluate them? 4. How do you know what standard will work best for any given application?

• How well did this work? Pretty well owing the
acceptance, uptake for an effort which starting very small, and faced an uphill battle • How effective was it in making data more findable? structuring information is distinct from indexing as we have a lot of private users but this is a first step. the interaction with users always results in creating a data management plan and a curation policy as discussion often identify a big gap “findable”: 2 things: syntax + vocabulary curation policies are essential documentation coding patterns in the form of implementation guidelines convincing end users of the value of those patterns for long term • Does it actually aid reusability? making data available is the first step so to that extend, ISA-Tab definitely aids reusability how to assess it? we would need to be able to detect datasets citation -> ongoing work Can we improve? certainly, there is always room for improvement in expressivity, tooling,pattern documentation Initial Q&A: ISA point of view

• Does it actually aid reusability? making data available is
the first step so to that extend, ISA- Tab definitely aids reusability. • How to assess it ? we would need to be able to detect datasets citation -> ongoing work • Can we improve? certainly, there is always room for improvement in expressivity, tooling, documentation of coding patterns. • Discuss any other aspect of your perspective and experience that you like in your slide. The goal is to highlight the problems and their diversity, and discuss social, technical, and financial solutions to solving them. Technology Geeks versus wet-lab biologists: keep it simple was the winning point for ISA-Tab. Think presentation layer and make it easy for end users. Prospective Data Management vs Retrospective Data Forensics -> changing the habits / the practice Big problem: sustainability of standards development! Most of standard related work in academia regularly faces the axe, which is a major threat to any standardization effort, which requires long-term support to establish authoritative status. Furthermore, the goal is to operate in an open, free to access framework. Some standardization development Organization such ISO make standards specification available at a fee or required user registration for accessing material. This model makes it difficult to ensure diffusion of standards (a single ISO standard document can reach several thousands of USD). Big Question: How to properly support standardization activities – Support for Biosharing effort to establish an umbrella , one stop shop for funding agencies/developments to come together, avoid duplication of efforts and broker development pathways. Initial Q&A: ISA point of view

Complexity and Data Diversity panel for BD2K Wo...

Complexity and Data Diversity panel for BD2K Workshop on Community-Based and Metadata Standards Development

James Taylor

More Decks by James Taylor

Other Decks in Science

Featured

Transcript

Community-Based Data and Metadata Standards: Complexity and Data Diversity Moderator:

Initial Questions 1. Why do we need data standards? How

When do we want standards? Proliferation of data types and

Philippe Rocca-Serra

Initial Q&A: ISA point of view • Why do we

Michel Dumontier

Towards sustainable data sharing and development of community standards •

Charles Bailey

What stands in the way of standards? • Need for

Olivier Bodenreider

Standards divide in the community (Olivier Bodenreider) • Content: Multiple

Initial Questions – Discussion 1. Why do we need data

• How well did this work? Pretty well owing the

• Does it actually aid reusability? making data available is