An Overview of Statistical Tests for the Identification of Differentially expressed genes
Statistical methodologies for microarray data have evolved in this last decade from simple visual inspection of the results to complex algorithms for data modelling and expression changes identification. Very recently, new comparative approaches have been proposed to clarify and evaluate the performance of different algorithms on gene expression data. In a microarray experiment it is possible to identify four steps in which statistical analysis is necessary: experimental design, data filtering and normalization, testing for the identification of differentially expressed genes and multivariate analysis. In this talk, different methodologies for the identification of differentially expressed genes will be discussed.
The first method used for the identification of significantly deregulated genes was the fold-change (FC). However, FC is now considered an inadequate statistical test because it does not incorporate variance and offer no associated levels of confidence. The parametric t-test or the analysis of variance (ANOVA) are usually robust approaches for the comparison of groups of data (such as list of differentially expressed genes). However, due to the small sample size typical of microarray data, parametric approaches are not recommended and the usual t-test/ANOVA should be changed with a moderated test with non-parametric approach (1-3). Moderation consists in a statistical test denominator modification giving the advantage to correct gene variance estimate in case small sample size, while non-parametric approach estimate the null distribution through permutational analysis. At the same time, several different approaches have been proposed based on gene ranking (4-5), principal component analysis (6) and correlation (7).
Microarray data represent an important challenge for statistical analysis since the testing the expression level of ten of thousands transcripts (multiple testing) may produce hundreds of false positive, if the type I error (the well-known alpha level, commonly used in statistical test) is applied. Multiple test corrections are therefore applied. The Bonferroni method is too conservative in case of large number of tests; a good alternative is the use of the false discovery rate (FDR), defined as the expected number of false positive in a list of genes (8).
References