Bioinformatics Vol. 17 no. 90001 2001
Pages S225-S233
© 2001 Oxford University Press
Fragment assembly with double-barreled data
1 Department of Computer Science and
Engineering, University of California at San Diego, La Jolla, CA
92093,
2 Department of Mathematics, University of
Southern California, Los Angeles, CA 90089,
Received on February 6, 2001
; revised on April 2, 2001
; accepted on April 2, 2001
For the last twenty years fragment assembly was dominated by the "overlap - layout - consensus" algorithms that are used in all currently available assembly tools. However, the limits of these algorithms are being tested in the era of genomic sequencing and it is not clear whether they are the best choice for large-scale assemblies. Although the "overlap - layout - consensus" approach proved to be useful in assembling clones, it faces difficulties in genomic assemblies: the existing algorithms make assembly errors even in bacterial genomes. We abandoned the "overlap - layout - consensus" approach in favour of a new Eulerian Superpath approach that outperforms the existing algorithms for genomic fragment assembly (Pevzner et al. 2001 InProceedings of the Fifth Annual International Conference on Computational Molecular Biology (RECOMB-01), 25626). In this paper we describe our new EULER-DB algorithm that, similarly to the Celera assembler takes advantage of clone-end sequencing by using the double-barreled data. However, in contrast to the Celera assembler, EULER-DB does not mask repeats but uses them instead as a powerful tool for contig ordering. We also describe a new approach for the Copy Number Problem: "How many times a given repeat is present in the genome?". For long nearly-perfect repeats this question is notoriously difficult and some copies of such repeats may be "lost" in genomic assemblies. We describe our EULER-CN algorithm for the Copy Number Problem that proved to be successful in difficult sequencing projects.
Contact: ppevzner{at}ucsd.edu