Bioinformatics Vol. 17 no. 10 2001
Pages 977-987
© 2001 Oxford University Press
Model-based clustering and data transformations for gene expression data
1 Computer Science and Engineering, Box
352350
2 Statistics, Box 354322, University of
Washington, Seattle, WA 98195, USA
3 Insightful Corporation, 1700 Westlake
Avenue North, Suite 500, Seattle, WA 98109, USA
Received on April 20, 2001
; accepted on July 6, 2001
Motivation: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particular, model-based clustering assumes that the data is generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. The issues of selecting a good clustering method and determining the correct number of clusters are reduced to model selection problems in the probability framework. Gaussian mixture models have been shown to be a powerful tool for clustering in many applications.
Results: We benchmarked the performance of model-based clustering on several synthetic and real gene expression data sets for which external evaluation criteria were available. The model-based approach has superior performance on our synthetic data sets, consistently selecting the correct model and the number of clusters. On real expression data, the model-based approach produced clusters of quality comparable to a leading heuristic clustering algorithm, but with the key advantage of suggesting the number of clusters and an appropriate model. We also explored the validity of the Gaussian mixture assumption on different transformations of real data. We also assessed the degree to which these real gene expression data sets fit multivariate Gaussian distributions both before and after subjecting them to commonly used data transformations. Suitably chosen transformations seem to result in reasonable fits.
Availability: MCLUST is available at http://www.stat.washington.edu/fraley/mclust. The software for the diagonal model is under development.
Contact: kayee{at}cs.washington.edu
Supplementary information: http://www.cs.washington.edu/homes/kayee/model
* To whom all correspondence should be addressed.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
G. Nowak and R. Tibshirani Complementary hierarchical clustering Biostat., July 1, 2008; 9(3): 467 - 483. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Tuke, G. F. V. Glonek, and P. J. Solomon Gene profiling for determining pluripotent genes in a time course microarray experiment Biostat., June 18, 2008; (2008) kxn017v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Brehelin, O. Gascuel, and O. Martin Using repeated measurements to validate hierarchical gene clusters Bioinformatics, March 1, 2008; 24(5): 682 - 688. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Kim and H. Kim Clustering of change patterns using Fourier coefficients Bioinformatics, January 15, 2008; 24(2): 184 - 191. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Joshi, Y. Van de Peer, and T. Michoel Analysis of a Gibbs sampler method for model-based clustering of gene expression data Bioinformatics, January 15, 2008; 24(2): 176 - 183. [Abstract] [Full Text] [PDF] |
||||
![]() |
J.-L. Dortet-Bernadet and N. Wicker Model-based clustering on the unit sphere with an illustration using gene expression profiles Biostat., January 1, 2008; 9(1): 66 - 80. [Abstract] [Full Text] [PDF] |
||||
![]() |
Seo Young Kim and J. Won Lee Ensemble clustering method based on the resampling similarity measure for gene expression data Statistical Methods in Medical Research, December 1, 2007; 16(6): 539 - 564. [Abstract] [PDF] |
||||
![]() |
S. Yuan and K.-C. Li Context-dependent clustering for dynamic cellular state modeling of microarray gene expression Bioinformatics, November 15, 2007; 23(22): 3039 - 3047. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Yu, H.-S. Wong, and H. Wang Graph-based consensus clustering for class discovery from gene expression data Bioinformatics, November 1, 2007; 23(21): 2888 - 2896. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Zhu, Y. Li, and H. Li Multivariate correlation estimator for inferring functional relationships from replicated genome-wide data Bioinformatics, September 1, 2007; 23(17): 2298 - 2305. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. C. Tseng Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data Bioinformatics, September 1, 2007; 23(17): 2247 - 2255. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Xiang, Z. S. Qin, and Y. He CRCView: a web server for analyzing and visualizing microarray gene expression data using model-based clustering Bioinformatics, July 15, 2007; 23(14): 1843 - 1845. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Lu, X. He, and S. Zhong Cross-species microarray analysis with the OSCAR system suggests an INSR->Pax6->NQO1 neuro-protective pathway in aging and Alzheimer's disease Nucleic Acids Res., July 13, 2007; 35(suppl_2): W105 - W114. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. S. V. Wong, F. K. Wong, and G. R. Wood A multi-stage approach to clustering and imputation of gene expression profiles Bioinformatics, April 15, 2007; 23(8): 998 - 1005. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. J. Wilkinson Bayesian methods in bioinformatics and computational systems biology Brief Bioinform, April 12, 2007; (2007) bbm007v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Takitoh, S. Fujii, Y. Mase, J. Takasaki, T. Yamazaki, Y. Ohnishi, M. Yanagisawa, Y. Nakamura, and N. Kamatani Accurate automated clustering of two-dimensional data for single-nucleotide polymorphism genotyping by a combination of clustering methods: evaluation by large-scale real data Bioinformatics, February 15, 2007; 23(4): 408 - 413. [Abstract] [Full Text] [PDF] |
||||
![]() |
X.-J. Ma, S. G. Hilsenbeck, W. Wang, L. Ding, D. C. Sgroi, R. A. Bender, C. K. Osborne, D. C. Allred, and M. G. Erlander The HOXB13:IL17BR Expression Index Is a Prognostic Factor in Early-Stage Breast Cancer J. Clin. Oncol., October 1, 2006; 24(28): 4611 - 4619. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Thalamuthu, I. Mukhopadhyay, X. Zheng, and G. C. Tseng Evaluation and comparison of gene clustering methods in microarray analysis Bioinformatics, October 1, 2006; 22(19): 2405 - 2412. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. E. Teschendorff, A. Naderi, N. L. Barbosa-Morais, and C. Caldas PACK: Profile Analysis using Clustering and Kurtosis to find molecular classifiers in cancer Bioinformatics, September 15, 2006; 22(18): 2269 - 2275. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. S. Qin Clustering microarray gene expression data using weighted Chinese restaurant process Bioinformatics, August 15, 2006; 22(16): 1988 - 1997. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Qu and S. Xu Quantitative Trait Associated Microarray Gene Expression Data Analysis Mol. Biol. Evol., August 1, 2006; 23(8): 1558 - 1573. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. K. Ng, G. J. McLachlan, K. Wang, L. Ben-Tovim Jones, and S.-W. Ng A Mixture model with random-effects components for clustering correlated gene-expression profiles Bioinformatics, July 15, 2006; 22(14): 1745 - 1752. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Huang and W. Pan Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data Bioinformatics, May 15, 2006; 22(10): 1259 - 1268. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. R. Segal Microarray gene expression data with linked survival phenotypes: diffuse large-B-cell lymphoma revisited Biostat., April 1, 2006; 7(2): 268 - 285. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Pan Incorporating gene functions as priors in model-based clustering of microarray gene expression data Bioinformatics, April 1, 2006; 22(7): 795 - 801. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. D. Siegmund, A. J. Levine, J. Chang, and P. W. Laird Modeling exposures for DNA methylation profiles. Cancer Epidemiol. Biomarkers Prev., March 1, 2006; 15(3): 567 - 572. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Amato, A. Ciaramella, N. Deniskina, C. D. Mondo, D. di Bernardo, C. Donalek, G. Longo, G. Mangano, G. Miele, G. Raiconi, et al. A multi-step approach to time series analysis and gene expression clustering Bioinformatics, March 1, 2006; 22(5): 589 - 596. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Martella Classification of microarray data with factor mixture models Bioinformatics, January 15, 2006; 22(2): 202 - 208. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Gottardo, J. Besag, M. Stephens, and A. Murua Probabilistic segmentation and intensity estimation for microarray images Biostat., January 1, 2006; 7(1): 85 - 99. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. A. Heard, C. C. Holmes, D. A. Stephens, D. J. Hand, and G. Dimopoulos Bayesian coclustering of Anopheles gene expression time series: Study of immune defense response to multiple experimental challenges PNAS, November 22, 2005; 102(47): 16939 - 16944. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Celeux, O. Martin, and C. Lavergne Mixture of linear mixed models for clustering gene expression profiles from repeated microarray experiments Statistical Modeling, October 1, 2005; 5(3): 243 - 267. [Abstract] [PDF] |
||||
![]() |
D. W. Mount and R. Pandey Using bioinformatics and genome analysis for new therapeutic interventions Mol. Cancer Ther., October 1, 2005; 4(10): 1636 - 1643. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Sanguinetti, M. Milo, M. Rattray, and N. D. Lawrence Accounting for probe-level noise in principal component analysis of microarray data Bioinformatics, October 1, 2005; 21(19): 3748 - 3754. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Handl, J. Knowles, and D. B. Kell Computational cluster validation in post-genomic data analysis Bioinformatics, August 1, 2005; 21(15): 3201 - 3212. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. E. Teschendorff, Y. Wang, N. L. Barbosa-Morais, J. D. Brenton, and C. Caldas A variational Bayesian mixture modelling framework for cluster analysis of gene-expression data Bioinformatics, July 1, 2005; 21(13): 3025 - 3033. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Bensmail, J. Golek, M. M. Moody, J. O. Semmes, and A. Haoudi A novel approach for clustering proteomics data using Bayesian fast Fourier transform Bioinformatics, May 15, 2005; 21(10): 2210 - 2224. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Guthke, U. Moller, M. Hoffmann, F. Thies, and S. Topfer Dynamic network reconstruction from gene expression data applied to immune response during bacterial infection Bioinformatics, April 15, 2005; 21(8): 1626 - 1634. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Soeda, M. C.J.M. Konings, O. Vorst, A. M.M.L. van Houwelingen, G. M. Stoopen, C. A. Maliepaard, J. Kodde, R. J. Bino, S. P.C. Groot, and A. H.M. van der Geest Gene Expression Programs during Brassica oleracea Seed Maturation, Osmopriming, and Germination Are Indicators of Progression of the Germination Process and the Stress Tolerance Level Plant Physiology, January 1, 2005; 137(1): 354 - 368. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Reverter, Y. H. Wang, K. A. Byrne, S. H. Tan, G. S. Harper, and S. A. Lehnert Joint analysis of multiple cDNA microarray studies via multivariate mixed models applied to genetic improvement of beef cattle J Anim Sci, December 1, 2004; 82(12): 3430 - 3439. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. K. Lin, D. Chudova, G. W. Hatfield, P. Smyth, and B. Andersen Identification of hair cycle-associated genes from time-course gene expression profile data by using replicate variance PNAS, November 9, 2004; 101(45): 15955 - 15960. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. J. Moser, A. Reverter, C. A. Kerr, K. J. Beh, and S. A. Lehnert A mixed-model approach for the analysis of cDNA microarray gene expression data from extreme-performing pigs after infection with Actinobacillus pleuropneumoniae J Anim Sci, May 1, 2004; 82(5): 1261 - 1271. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Reverter, K. A. Byrne, H. L. Bruce, Y. H. Wang, B. P. Dalrymple, and S. A. Lehnert A mixture model-based cluster analysis of DNA microarray gene expression data on Brahman and Brahman composite steers fed high-, medium-, and low-quality diets J Anim Sci, August 1, 2003; 81(8): 1900 - 1910. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Ressom, D. Wang, and P. Natarajan Clustering gene expression data using adaptive double self-organizing map Physiol Genomics, June 24, 2003; 14(1): 35 - 46. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Wicker, D. Dembele, W. Raffelsberger, and O. Poch Density of points clustering, application to transcriptomic data analysis Nucleic Acids Res., September 15, 2002; 30(18): 3992 - 4000. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. Mong, C. Krebs, and D. W. Pfaff Perspective: Micoarrays and Differential Display PCR--Tools for Studying Transcript Levels of Genes in Neuroendocrine Systems Endocrinology, June 1, 2002; 143(6): 2002 - 2006. [Abstract] [Full Text] [PDF] |
||||














