Colloquium M. Musen : Online datasets will never be FAIR without semantic technology: The case for CEDAR

Colloquium M. Musen – Mercredi 29 mai 2019

Online datasets will never be FAIR without semantic technology: The case for CEDAR

Abstract: The past few years have seen a flurry of interest in making online scientific data “FAIR”—findable, accessible, interoperable, and reusable. The problem is that most scientific datasets are not FAIR. When left to their own devices, scientists do an absolutely terrible job creating the metadata that describe the experimental datasets that make their way into online repositories. The lack of standardization makes it extremely difficult for other investigators to search for relevant datasets, to reanalyze them, and to integrate those datasets with other data. There is an urgent need to make it easy for investigators to author metadata that adhere to community standards and that describe datasets in reproducible terms. The Center for Expanded Data Annotation and Retrieval (CEDAR) has the goal of enhancing the authoring of experimental metadata to make online datasets more useful to the scientific community. CEDAR technology includes methods for managing a library of templates for representing metadata, and interoperability with a repository of standard scientific ontologies that normalize the way in which the templates may be filled out. CEDAR uses a repository of previously authored metadata from which it learns rules that drive predictive data entry, making it easier for metadata authors to perform their work. Ongoing collaborations with several major research consortia are allowing us to explore how CEDAR may ease access to scientific data sets stored in online repositories and enhance the reuse of the data to drive new discoveries.

A short biography: Dr. Musen is Professor of Biomedical Informatics at Stanford University, where he is Director of the Stanford Center for Biomedical Informatics Research. Dr. Musen conducts research related to intelligent systems, reusable ontologies, metadata for publication of scientific data sets, and biomedical decision support. His group developed Protégé, the world’s most widely used technology for building and managing terminologies and ontologies. He is principal investigator of the National Center for Biomedical Ontology, one of the original National Centers for Biomedical Computing created by the U.S. National Institutes of Heath (NIH). He is principal investigator of the Center for Expanded Data Annotation and Retrieval (CEDAR). CEDAR is a center of excellence supported by the NIH Big Data to Knowledge Initiative, with the goal of developing new technology to ease the authoring and management of biomedical experimental metadata. Dr. Musen chaired the Health Informatics and Modeling Topic Advisory Group for the World Health Organization’s revision of the International Classification of Diseases (ICD-11) and he currently directs the WHO Collaborating Center for Classification, Terminology, and Standards at Stanford University.