Bioinformatics Vol. 17 no. 90001 2001
Pages S5-S12
© 2001 Oxford University Press
An efficient algorithm for finding short approximate non-tandem repeats
1 Wilhelm-Schickard-Institut
für Informatik,
Universität
Tübingen, Sand 13,
Tübingen, 72076, Germany
2 Department of Computer Science and
Engineering, College of Engineering, University of California,
Riverside, USA
Received on February 6, 2001
; revised on April 2, 2001
; accepted on April 2, 2001
We study the problem of approximate non-tandem repeat
extraction. Given a
long subject string S of length N over a finite
alphabet
and a threshold D, we would like to
find all short substrings of S of length P that repeat
with at most D differences, i.e., insertions,
deletions, and mismatches. We give a careful theoretical
characterization of the set of seeds (i.e., some
maximal exact repeats) required by the algorithm, and prove a
sublinear bound on their expected numbers. Using this result, we
present a sub-quadratic algorithm for finding all short
(i.e., of length O(log N)) approximate
repeats. The running time of our algorithm is
O(DN3pow(
)-1log N), where
= D/P and pow(
) is an increasing, concave
function that is 0 when
= 0 and about
0.9 for DNA and protein sequences.
Contact: adebiyi{at}informatik.uni-tuebingen.de
Throughout the paper we only consider
non-tandem repeats.