Statistical and Machine Learning Methods in Computational Biology June 12 - June 19, 2010 | Lipari School on BioInformatics and Computational Biology

Silvio Bicciato, University of Modena and Reggio Emilia

An Overview of Statistical Tests for the Identification of Differentially expressed genes
Statistical methodologies for microarray data have evolved in this last decade from simple visual inspection of the results to complex algorithms for data modelling and expression changes identification. Very recently, new comparative approaches have been proposed to clarify and evaluate the performance of different algorithms on gene expression data. In a microarray experiment it is possible to identify four steps in which statistical analysis is necessary: experimental design, data filtering and normalization, testing for the identification of differentially expressed genes and multivariate analysis. In this talk, different methodologies for the identification of differentially expressed genes will be discussed.
The first method used for the identification of significantly deregulated genes was the fold-change (FC). However, FC is now considered an inadequate statistical test because it does not incorporate variance and offer no associated levels of confidence. The parametric t-test or the analysis of variance (ANOVA) are usually robust approaches for the comparison of groups of data (such as list of differentially expressed genes). However, due to the small sample size typical of microarray data, parametric approaches are not recommended and the usual t-test/ANOVA should be changed with a moderated test with non-parametric approach (1-3). Moderation consists in a statistical test denominator modification giving the advantage to correct gene variance estimate in case small sample size, while non-parametric approach estimate the null distribution through permutational analysis. At the same time, several different approaches have been proposed based on gene ranking (4-5), principal component analysis (6) and correlation (7).
Microarray data represent an important challenge for statistical analysis since the testing the expression level of ten of thousands transcripts (multiple testing) may produce hundreds of false positive, if the type I error (the well-known alpha level, commonly used in statistical test) is applied. Multiple test corrections are therefore applied. The Bonferroni method is too conservative in case of large number of tests; a good alternative is the use of the false discovery rate (FDR), defined as the expected number of false positive in a list of genes (8).

References

Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001;98:5116-21.
Smyth, G. K., Yang, Y.-H., Speed, T. P. (2003). Statistical issues in microarray data analysis. Methods in Molecular Biology 224, 111-136.
Smyth, G. K. (2005). Limma: linear models for microarray data. In: 'Bioinformatics and Computational Biology Solutions using R and Bioconductor'. R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber (eds), Springer, New York, 2005.
Breitling R, Herzyk P: Rank-based methods as a non-parametric alternative of the T-statistic for the analysis of biological microarray data. J Bioinform Comput Biol 2005, 3(5):1171-1189.
Breitling R, Armengaud P, Amtmann A, Herzyk P: Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett 2004, 573(1–3):83-92.
Culhane AC, Perriere G, Considine EC, Cotter TG, Higgins DG: Between-group analysis of microarray data. Bioinformatics 2002, 18(12):1600-1608.
Pavlidis P, Noble WS: Analysis of strain and regional variation in gene expression in mouse brain. Genome Biol 2001, 2(10):RESEARCH0042.
Benjamini Y, and Hochberg Y Controlling the false discovery rate – a practical and powerful approach for multicomparison testing J R Stat Soc Ser B 1995;57:289-300