Text Algorithms and Methods for High Throughput Sequencing Analysis

Sequencing technologies provide the sequence (series of symbols) which represents a molecule of DNA, RNA or even proteins. Bioinformatics aims to analyze these sequences thanks to algorithms from the field of text algorithms (stringology). With the revolution in high throughput sequencing technologies, the volume of sequences to be processed is exploding. Genomics or transcriptomics projects each produce hundreds of millions of sequences (also called reads). We design powerful algorithms to analyze biological sequences on a large scale. These algorithms are implemented in bioinformatics softwares which combine efficiency and sensitivity, for example for:

correct sequencing errors (LoRDEC, LoRMA)
find the position of a sequence in a genome (CRAC, MPSCAN)
compare genomes (QOD, YOC)
for assembly and scaffolding of genomes (SAVAGE, SCAFFOLDER)
for the search for protein binding pattern.

These software are distributed through our bioinformatics platform ATGC. They are then used on real data for studies in molecular biology, biodiversity and evolutionary biology. For example,

the genome project of a marine microalgae
study the replication of the vertebrate genome
or human, animal or plant transcriptomes.

Figure 1 shows an application result in a genomic project: Circular visualization of the chromsomes of the genome of O. tauri, a pico-algae present in all the seas of the globe. (Blanc-Mathieu et al., BMC Genomics, 15: 1103 10.1186 / 1471-2164-15-1103, 2014)
Representation of the O. Tauri genome

The effectiveness of these tools often comes from the use of indexing data structures, such as the suffix tree, the Burrows-Wheeler transform, or wavelet trees. For example, we have proposed algorithms for:

construct assembly graphs (Figure 2)
calculate superchains (Figures 3, 4)
compare sequences to graphs.

Figure 2: Representation of a De Bruijn graph in a generalized suffix tree structure. (Cazaux et al., J. Computer System & Sciences, http://dx.doi.org/10.1016/j.jcss.2016.06.008, 2016).
Representation of a De Bruijn graph in a generalized suffix tree structure

Most computer questions fall under the category of optimization problems and are difficult. We then look for approximation algorithms and we study their approximation ratio which measures a distance to an optimal solution. Greedy algorithms are one example.

Figure 3: For a set of three P sequences: = {ATCA, AGTA, CTGA} and their reversed complements, representation of the (classic) overlap graph (Cazaux et al., Data Compression Conf. 2016).
For a set of three sequences P: = {ATCA, AGTA, CTGA} and their inverted complementaries, representation of the (classical) overlap graph

Figure 4: For a set of three sequences P: = {ATCA, AGTA, CTGA} and their reversed complements, representations of the overlap graph, which contains the same information as the overlap graph, but uses a linear space memory (Cazaux et al. al., Data Compression Conf. 2016).