Charles Lawrence, Brown University, Providence, USA

The two sides of genomic statistical Inference

Advances in genomics have rendered increasingly large data sets available for analysis. While the emergence of such large data sets seems to lead to increasingly more precise estimates of parameters, paradoxically just the opposite is becoming increasingly common. This paradoxical circumstance has emerged because these technologies have simultaneously opened opportunities to draw inferences on previously unanswerable high dimensional questions. As a result traditional highest scoring estimation methods such as maximum likelihood estimation or maximum a posteriori (MAP) estimation no longer enjoy the asymptotic favorable properties for which they rightfully became famous, and as a result can be seriously misleading.

On the other hand when attention is focused on low to modest dimensional unknowns, the massive size of genomic data sets is ideal for statistical inference using maximum likelihood and related methods since now the asymptotic requirements that require the data to be large compared to the unknowns come strongly to bear. However, this ideal setting also presents its own challenges. Specifically, these large data sets will greatly enhance the power of statistical tests. As a result even very small differences become highly significant. While this may seem ideal, small experimental artifacts often yield highly statistically significant results, and as a result familiar statistical tests components including p-values, FDR, and q-values become useless.

These lectures in this course will describe both sides of this problem, illustrate them with examples, and describe alternative ensemble based methods appropriate to this new setting.

References:

Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis, (2004) Efron B, Journal of the American Statistical Association: 99:96-104.

Centroid estimators for inference in high-dimensional discrete spaces (2008) Carvalho, LE. and Lawrence, CE. PNAS: USA, 105: 3209-3214.

Measuring Global Credibility with Application to Local Sequence Alignment, (2008) Webb-Robertson BJM. McCue LA. and Lawrence CE. PLoS Computational Biology 4(5): e1000077 doi:10.1371/journal.pcbi.1000077.

Exact Calculation of Distributions on Integers, with Application to Sequence Alignment (2009), Newberg LA. Lawrence CE. Journal of Computational Biology, 16(1): 1-18.

Centroid estimators for inference in high-dimensional discrete spaces (2008) Luis E. Carvalho, and Charles E. Lawrence, PNAS: USA, 105:3209-3214.

Automated mapping of large-scale chromatin structure in ENCODE. (2008) Lian H, Thompson WA, Thurman R, Stamatoyannopoulos JA, Noble WS, Lawrence CE. Bioinformatics. 1;24(17):1911-6. Epub 2008 Jun 30.

Measuring Global Credibility with Application to Local Sequence Alignment, (2008) Webb-Robertson BJM. McCue LA. and Lawrence CE. PLoS Computational Biology 4(5): e1000077 doi:10.1371/journal.pcbi.1000077.

A phylogenetic Gibbs sampler that yields centroid solutions for cis regulatory site prediction (2007) Lee A. Newberg, William A. Thompson, Sean P. Conlan, Thomas M. Smith, Lee Ann McCue, Charles E. Lawrence. Bioinformatics, 23: 1718-1727;doi:10.1093/bioinformatics/btm241doi.

Clustering of RNA secondary structures with application to messenger RNAs, (2006) Ye Ding Y, Chan CY, and Lawrence CE, Journal of Molecular Biology, 359: 554–571.

RNA Secondary Structure Prediction by Centroids in a Boltzmann Weighted Ensemble. Ding Y, Chan CY, and Lawrence CE. RNA, 11 (8):1157-1166, 2005.

Sfold Web Server for Statistical Folding and Rational Design of Nucleic Acids. Ding Y, Chan CY, Lawrence CE. Nucleic Acids Res, 1;32(Web Server issue):W135-141, 2004.