Marie-Curie Action

Marie-Curie Funded PhD position on Computational Pan-genomics: Algorithms & Applications.

Computational Pan-genomics

Job description

We are looking for a highly motivated and talented PhD student with a strong interest in algorithms, data structures, and applications in genomics and bioinformatics. The PhD student will join a computer science and bioinformatics group from CNRS within the framework “Algorithms for PAngenome” (ALPACA) network (website) funded by the European Commission through the Horizon 2020 Marie Sklodowska-Curie ITN Programme.

We offer a 36 months fellowship (two contracts of 18 months), full social security, an interdisciplinary team in a renowned institute, high level scientific environment of Montpellier with many opportunities for collaborations, integration in Marie-Curie network ALPACA for training and networking.

Start: september or october 2021.

Requirements:

  • a Master degree in Computer Science, Mathematics, or in Bioinformatics, with a specialization in algorithms or combinatorial optimization.
  • Strong abilities and know-how in programming in C++.

Preferable qualifications and skills

  • research talent in a Master Degree project
  • an excellent command of English
  • good academic writing and presentation skills
  • mastering a script language (Python)
  • interest in biology, ecology, or life sciences.

Scientific context and topics

Since the first breakthroughs in genome sequencing in the early 90’s, obtaining a reference genome sequence for a species of interest has become a essential step for research programs molecular biology, medicine, or ecology. With the advent of deep sequencing technologies circa 2005, individual genome or transcriptome sequencing has become common place, for it is necessary to account for genomic variations at individual levels to reach the Grail of precision medicine. Similarly to understand biodiversity, ecology requires a view of variations at the population level. Currently, large scale research initiatives, like the LifeTime or Human Cell Atlas consortia, go one step beyond and investigate cell heterogeneity within an organ or e.g. within a tumor, planning thousands of single cell sequencing experiments in routine. This will give us access to numerous “individual” (and highly similar) genomes to interrogate and mine, but the algorithms to leverage this effort and data are still in their infancy.

A single reference genome cannot represent well the genetic diversity of one species. Numerous individual genomes are now being sequenced and provide together a much better view on the diversity of variants. However, current sequence analysis methods overwhelmingly use a single reference sequence, thereby limiting their power. An alternative is to represent a collection of individual genomes via a sequence graph or variation graph, which stores both common sequence regions and specific variations of each individual. To date, a few data structures have been proposed for this sake. The community lacks algorithms able to exploit such graphs for numerous sequence analysis tasks. This domain of research is called Computational Pangenomics. In this project, we intend to develop new algorithms and approaches to fill this gap.

Among possible questions related to sequence graph, we are particularly interested in

  1. efficient algorithms to search for approximate, complex, or probabilistic patterns in sequence graphs
  2. construction of specific variation graphs for sets of viral genomes.

Known patterns built from biological observations are key tools for annotating genes, regulatory regions, protein-DNA or protein-RNA sequences in genomes and in transcriptomes. For instance, transcription factor binding sites are built and collected in international databases like JASPAR. These patterns represent a set of highly variables sequences, and one wishes to find similar sequence regions in new genomes or transcriptomes, and to predict whether genomic variations disrupt their function. Also one may want to determine which individuals/populations share common binding sites.

During a viral disease outbreak, sequencing of individual samples help assessing the variability of viral strains within a single host or within a population. Because the immune system of infected hosts, viral genomes undergo a pressure to mutate, to generate new variants that may better escape the host defense or resist to therapies. We aim at designing a structure for representing these variations and associated functional information, such that it can be fully exploited in medical genomics. It may be useful to investigate the evolutionary relationships between variants. The number of viral variants and their large range of relative frequencies make it a very challenging computational question.

Remarks:

  • Although it may not be obvious, both topics are related: indeed, viruses need to recruit proteins from their host for colonizing it, for living and reproducing within it. One way of achieving that is having viral RNAs or proteins that bind to regions of the host genome or transcriptome to trigger some functions for their own needs.
  • Other topics than the above two may be discussed and addressed within this PhD thesis project; suggestions are welcome.

More background on variation graphs, computational pan-genomics here or in the review article <10.1093/bib/bbw089>.

Information and application

To apply please send at the contact address, an application file including

  • A detailed CV, highlighting your achievements
  • Copies of BSc and MSc transcripts and list of courses
  • A motivation letter in English, explaining your adequation to the position and topics, and why you want to engage in a PhD within this network;
  • Contact details and recommendation letters of two referees in English or French
  • A list of skills and qualifications of the applicant.
  • A list of publications if applicable indicating whether you are a major contributor and your contribution.

Conditions (for details see https://alpaca-itn.eu/vacancies):

  • You qualify as an Early Stage Researcher, meaning that – on the starting date of your employment with the host institute – you are in the first four years of your research career and have not (yet) been awarded a doctoral degree.
  • You have not resided and/or have had your main activity (study, work, etc.) in the country where the position is announced for more than 12 months during the 3 years prior to the starting date of your employment with the respective host institute.
  • You are proficient in English language (academic level)

Note: For residents outside the EER-area, a Toefl English language test might be required.

Location: Montpellier, South of France

Contact: Eric RIVALS, rivals@lirmm.fr

Department of Computer Science, LIRMM, University of Montpellier and CNRS.

First deadline for application: Feb 15th, 2021. Last deadline for application: July 15th, 2021.

Eric Rivals
CNRS Research Director in Computer Science and Bioinformatics

My research interests include string algorithms, bioinformatics, genomics.