Fast Discerning Repeats in DNA Sequences with a Compression Algorithm

Abstract

Long direct repeats in genomes arise from molecular duplication mechanisms like retrotransposition, copy of genes, exon shuffling, … Their study in a given sequence reveals its internal repeat structure as well as part of its evolutionary history. Moreover, detailed knowledge about the mechanisms can be gained from a systematic investigation of repeats. The problem of finding such repeats is viewed as an NP-complete problem of the optimal compression of a sequence thanks to the encoding of its exact repeats. The repeats chosen for compression must not overlap each other as do the repeats which result from molecular duplications. We present a new heuristic algorithm, Search_Repeats, where the selection of exact repeats is guided by two biologically sound criteria: their length and the absence of overlap between those repeats. Search_Repeats detects approximate repeats, as clusters of exact sub-repeats, and points out large insertions/deletions in them. Search_Repeats takes only 3 seconds of CPU time for the genome of Haemophilus influenzae on a Sun Ultrasparc workstation.

Publication
Proceedings of 8th Workshop on Genome Informatics (GIW 97)
text compression repeats DNA chromosome genome Haemophilus influenzae bacteria