q-gram based database searching using a suffix array (QUASAR)

Abstract

With the increasing amount of DNA sequence information deposited in our databases searching for similarity to a query sequence has become a basic operation in molecular biology But even today s fast algorithms reach their limits when applied to all versus all comparisons of large databases Here we present a new data base searching algorithm dubbed QUASAR Q gram Alignment based on Su x ARrays which was designed to quickly detect se quences with strong similarity to the query in a context where many searches are conducted on one database Our algorithm applies a modi cation of q tuple ltering implemented on top of a su x array Two versions were de veloped one for a RAM resident su x array and one for access to the su x array on disk We compared our implementation with BLAST and found that our approach is an order of magnitude faster It is however restricted to the search for strongly similar DNA sequences as is typically required e g in the context of clustering expressed sequence tags ESTs.

Publication
Proceedings of the third annual international conference on Computational molecular biology - RECOMB ‘99
suffix array database BLAST search algorithm random-access memory time complexity N-gram cluster analysis