Abstract
It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have substantial effects on gene expression levels. In addition to the measured variable(s) of interest, there will tend to be sources of signal due to factors that are unknown, unmeasured, or too complicated to capture through simple models. We show that failing to incorporate these sources of heterogeneity into an analysis can have widespread and detrimental effects on the study. Not only can this reduce power or induce unwanted dependence across genes, but it can also introduce sources of spurious signal to many genes. This phenomenon is true even for well-designed, randomized studies. We introduce "surrogate variable analysis" (SVA) to overcome the problems caused by heterogeneity in expression studies. SVA can be applied in conjunction with standard analysis techniques to accurately capture the relationship between expression and any modeled variables of interest. We apply SVA to disease class, time course, and genetics of gene expression studies. We show that SVA increases the biological accuracy and reproducibility of analyses in genome-wide expression studies.
Keywords
Affiliated Institutions
Related Publications
Expression Atlas update—an integrated database of gene and protein expression in humans, animals and plants
Expression Atlas (http://www.ebi.ac.uk/gxa) provides information about gene and protein expression in animal and plant samples of different cell types, organism parts, developme...
The Lymphochip: A Specialized cDNA Microarray for the Genomic-scale Analysis of Gene Expression in Normal and Malignant Lymphocytes
Immunologists have a long tradition of dissecting thecellular components of the immune system based on theexpression of cell surface markers. Because of the easewith which immun...
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays
Oligonucleotide arrays can provide a broad picture of the state of the cell, by monitoring the expression level of thousands of genes at the same time. It is of interest to deve...
How Many Genes Are Needed for a Discriminant Microarray Data Analysis ?
The analysis of the leukemia data from Whitehead/MIT group is a discriminant analysis (also called a supervised learning). Among thousands of genes whose expression levels are m...
The 16s/23s ribosomal spacer region as a target for DNA probes to identify eubacteria.
Variable regions of the 16s ribosomal RNA have been frequently used as the target for DNA probes to identify microorganisms. In some situations, however, there is very little se...
Publication Info
- Year
- 2007
- Type
- article
- Volume
- 3
- Issue
- 9
- Pages
- e161-e161
- Citations
- 2074
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1371/journal.pgen.0030161