Skip Navigation


Bioinformatics Advance Access originally published online on January 22, 2004
This Article
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow FREE Full Text (Screen PDF)
Right arrow All Versions of this Article:
20/4/467    most recent
btg431v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (57)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Leslie, C. S.
Right arrow Articles by Noble, W. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Leslie, C. S.
Right arrow Articles by Noble, W. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Bioinformatics 20(4) © Oxford University Press 2004; all rights reserved.

Mismatch string kernels for discriminative protein classification

Christina S. Leslie 1,*, Eleazar Eskin 1, Adiel Cohen 1, Jason Weston 2 and William Stafford Noble 3,{dagger}

1 Department of Computer Science, Columbia University, 1214 Amsterdam Avenue, Mail Code 0401, New York, NY 10027, USA, 2 Max-Planck Institute for Biological Cybernetics, Spemannstrasse 38, 72076 Tübingen, Germany and 3 Department of Genome Sciences, University of Washington, 1705 NE Pacific Street, Seattle, WA 98195, USA

Received on February 19, 2003 ; revised on June 21, 2003 ; accepted on August 5, 2003
Advance Access Publication January 22, 2004

Motivation: Classification of proteins sequences into functional and structural families based on sequence homology is a central problem in computational biology. Discriminative supervised machine learning approaches provide good performance, but simplicity and computational efficiency of training and prediction are also important concerns.

Results: We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the problem of protein classification and remote homology detection. These kernels measure sequence similarity based on shared occurrences of fixed-length patterns in the data, allowing for mutations between patterns. Thus, the kernels provide a biologically well-motivated way to compare protein sequences without relying on family-based generative models such as hidden Markov models. We compute the kernels efficiently using a mismatch tree data structure, allowing us to calculate the contributions of all patterns occurring in the data in one pass while traversing the tree. When used with an SVM, the kernels enable fast prediction on test sequences. We report experiments on two benchmark SCOP datasets, where we show that the mismatch kernel used with an SVM classifier performs competitively with state-of-the-art methods for homology detection, particularly when very few training examples are available. Examination of the highest-weighted patterns learned by the SVM classifier recovers biologically important motifs in protein families and superfamilies.

Availability: SVM software is publicly available at http://microarray.cpmc.columbia.edu/gist. Mismatch kernel software is available upon request.

Contact: cleslie{at}cs.columbia.edu

* To whom correspondence should be addressed.

{dagger} Formerly William Noble Grundy, see www.gs.washington.edu/noble/name-change.html


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
S. Ma and J. Huang
Penalized feature selection and classification in bioinformatics
Brief Bioinform, September 1, 2008; 9(5): 392 - 403.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
T. Damoulas and M. A. Girolami
Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection
Bioinformatics, May 15, 2008; 24(10): 1264 - 1270.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
P. Sonego, A. Kocsor, and S. Pongor
ROC analysis: applications to the classification of biological sequences and 3D structures
Brief Bioinform, May 1, 2008; 9(3): 198 - 209.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
B. Jiang, M. Q. Zhang, and X. Zhang
OSCAR: One-class SVM for accurate recognition of cis-elements
Bioinformatics, November 1, 2007; 23(21): 2823 - 2828.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
N. Nagamine and Y. Sakakibara
Statistical prediction of protein chemical interactions based on chemical structure and mass spectrometry data
Bioinformatics, August 1, 2007; 23(15): 2004 - 2012.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. Hochreiter, M. Heusel, and K. Obermayer
Fast model-based protein homology detection without alignment
Bioinformatics, July 15, 2007; 23(14): 1728 - 1736.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
T. Lingner and P. Meinicke
Remote homology detection based on oligomer distances
Bioinformatics, September 15, 2006; 22(18): 2224 - 2231.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
F. Ferre and P. Clote
DiANNA 1.1: an extension of the DiANNA web server for ternary cysteine classification.
Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W182 - W185.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Q.-w. Dong, X.-l. Wang, and L. Lin
Application of latent semantic analysis to protein remote homology detection
Bioinformatics, February 1, 2006; 22(3): 285 - 290.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. J. Gordon, M. W. Towsey, J. M. Hogan, S. A. Mathews, and P. Timms
Improved prediction of bacterial transcription start sites
Bioinformatics, January 15, 2006; 22(2): 142 - 148.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
R. Kuang, J. Weston, W. S. Noble, and C. Leslie
Motif-based protein ranking by network propagation
Bioinformatics, October 1, 2005; 21(19): 3711 - 3718.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
F. Lu, S. Keles, S. J. Wright, and G. Wahba
Framework for kernel regularization with application to protein clustering
PNAS, August 30, 2005; 102(35): 12332 - 12337.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
G. Dror, R. Sorek, and R. Shamir
Accurate identification of alternatively spliced exons using support vector machine
Bioinformatics, April 1, 2005; 21(7): 897 - 901.
[Abstract] [Full Text] [PDF]



Disclaimer:
Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.