Machine-learning based automatic
assignment of Semantic Types to biomedical concepts
Supervisor: Clement Jonquet (LIRMM, University of Montpellier) –
jonquet@lirmm.fr
Profile: Computer
science or (bio)informatics master students
Context: Project SIFR (Semantic Indexing of French
Biomedical Data Resources).
Collaboration
with LGI2P (EMA): Andon Tchechmedjiev)
Where: University of
Montpellier,
Laboratory of Informatics, Robotics, & Microelectronics of Montpellier (LIRMM)
When: 2nd
semester 2018-2019
Semantic Web, biomedical ontologies,
knowledge representation, machine learning, classification.
BioPortal, UMLS Metathesaurus
and Semantic Network, Semantic Web technologies (RDF, OWL, SKOS), Machine
learning or classification framework (TensorFlow, Weka, etc.)
A key aspect in addressing semantic
interoperability for life sciences is the use of terminologies and ontologies
as a common denominator to structure biomedical data and make them interoperable.
Ontologies formalize the knowledge of a domain by means of concepts, relations
and rules that apply to that domain [1]. Stanford University has invested
lot of efforts in developing terminology/ontology-based tools and services to
assist health professionals and users in their search for electronic
information available on the Web and in the use of ontologies. The group has developed
a Web-based portal, the NCBO BioPortal
(http://bioportal.bioontology.org) [2] that offers a variety of services
to search or index biomedical data as well as searching, exploring, annotating
and visualizing the available standards ontologies. In parallel, we develop the
SIFR BioPortal (http://bioportal.lirmm.fr),
a similar resource but dedicated to French [3]. These two platforms exploit
terminologies extracted from the UMLS Metathesaurus [4] and value the Semantic Types [5] that are assigned to the concepts
of these terminologies. This typing, done manually by experts, allows to
manipulate through all the resources of the UMLS only certain types (virus, tissue,
chemical, etc.) of concepts. The Semantic Network offers 133 Semantic Types and
they have been grouped also within 15 Semantic Groups (anatomy, objects,
procedures, etc.) [6]. However, this "coarse"
typing is only available for UMLS resources [7].
The internship aims to develop an automatic
classification component to assign concepts of any terminology or biomedical
ontology one or more Semantic Types. To do this, we will adopt a supervised machine
learning approach that will use already tagged resources as training and test
corpus. We will identify the features (e.g., labels, label-patterns, source
ontology, hierarchy, etc.) that will help to classify new concepts and start
first by assigning semantic groups, then types. We will first focus on French
resources (in the SIFR BioPortal) and then generalize on a larger scale to
resources in English (or other language, in the NCBO BioPortal). The internship
will result in a web application prototype that eventually will be incorporated
into the NCBO technology.
The intern tasks will consist of:
- Reviewing the papers describing the context
of the project
- Select a machine learning framework and
identify the classification features
- Extract the relevant training/test data from
UMLS
- Implement a methodology to automatically
classify concepts from a new ontology (Semantic Groups first, then Semantic
Types)
- Evaluate the results using cross validation
- Enrich the existing ontologies and
terminologies in the SIFR BioPortal and involve their developpers
for validation
- Write a publication about the project and its
outcomes
1. Gruber, T.R.:
A translation approach to portable ontologies. Knowl. Acquis. 5, 199–220
(1993).
2. Noy,
N.F., Shah, N.H., Whetzel, P.L., Dai, B., Dorf, M., Griffith, N.B., Jonquet,
C., Rubin, D.L., Storey, M.-A., Chute, C.G., Musen, M.A.: BioPortal: ontologies
and integrated data resources at the click of a mouse. Nucleic Acids Res. 37, 170–173 (2009).
3. Jonquet, C., Annane, A., Bouarech, K.,
Emonet, V., Melzi, S.: SIFR BioPortal : Un portail ouvert et générique
d’ontologies et de terminologies biomédicales françaises au service de
l’annotation sémantique. In: 16th Journées Francophones d’Informatique
Médicale, JFIM’16. p. 16. , Genève, Suisse (2016).
4. Bodenreider,
O.: The Unified Medical Language System (UMLS): integrating biomedical
terminology. Nucleic Acids Res. 32, 267–270 (2004).
5. McCray,
A.T.: An Upper-Level Ontology for the Biomedical Domain. Comp. Funct. Genomics.
4, 80–84 (2003).
6. McCray,
A.T., Burgun, A., Bodenreider, O.: Aggregating UMLS semantic types for reducing
conceptual complexity. Stud. Health Technol. Inform. 84, 216 (2001).
7. Tchechmedjiev,
A., Jonquet, C.: Enrichment of French Biomedical Ontologies with UMLS Concepts
and Semantic Types for Biomedical Named Entity Recognition Though Ontological
Semantic Annotation. In: Workshop on Language, Ontology, Terminology and
Knowledge Structures, LOTKS’17. , Montpellier, France (2017).
UMLS Semantic
Network: https://semanticnetwork.nlm.nih.gov/
Current
Semantic Types: https://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html
- Computer science or (bio)informatics
master degree students.
- Experience
with machine learning tools and motivation to learn more.
- Experience
with semantic Web technologies will be appreciated but is not mandatory.
- Good English oral and writing skills. Good
knowledge of French or motivation to learn is desirable.
- Excellent writing skills as reports,
documentations, and technical notes will always be necessary.
- Autonomy and initiative, take on technical
decisions within the project and justify choices.
- Friendly person to join a small research team
in Montpellier.
For more information about this position, please contact Clement Jonquet
(jonquet@lirmm.fr)
and Andon Tchechmedjiev (andon.tchechmedjiev@mines-ales.fr).
To apply, please send an email including links to (Please NO ATTACHED
DOCUMENTS) the following:
- a
motivation letter describing an explanation of YOUR interest for the intern;
- a
curriculum vitae describing your experience and the matches with the expected
profile;
- names
and contact details of referees.
Date are
flexibles over the 2018-2019 scholar year.