Classify gene pair expression

From CoGepedia
Jump to: navigation, search

Classifying the expression of a gene pair.

Removing datasets where the genes are off

Highly regulated genes will often be off in a majority of all datasets. This can bias attempts to measure correlation by creating a large cluster of datapoints around 0,0. To avoid this issue any datapoints were gene1 is expressed at less than 1/10th the maximum observed expression of gene1 and gene2 is also expressed at less than 1/10th the maximum observed expression of gene2 are masked from downstream analysis.

The number of datasets analyzed for a given gene pair over to total number of datasets avaliable is reported in the spreadsheet as "X".

Differential Gene Expression

The following criteria are NOT statistically significant and do not correct for multiple testing. Ideally any data gained through these analysis should be verified with qPCR prior to publication or downstream labor intensive work.

Groups of datasets where plants were or were not exposed to a stress/stimulus were collected. Within these datasets, a gene was considered differentially expressed if there was a two-fold difference in average expression between the two conditions AND the highest expressed datapoint in the less expressed condition was less than the least expressed datapoint from the more expressed condition.

These criteria were picked heuristically and can be fine-tuned as we gain more experience with these data-types.

Here are the datasets currently used for testing:

Light Response

Stimulus: Circadian2, Circadian3, Control Shoot

Control: Circadian1, Circadian5, Circadian6 (I dropped circadian4 because the lights would have just turned off and the plants wouldn't have had a chance to adapt yet).

Anerobic Shoot

Stimulus: Anerobic Shoot, Anerobic Shoot2

Control: Control Shoot, Circadian2, Circadian3

Anerobic Root

Stimulus: Anerobic Root

Control: Control Root

Dark Stress

Stimulus: Circadian1-6

Control: Constant Dark1-6

Overall Gene Pair Patterns

Automatically classified patterns of expression for a gene pair

db: Both Dead

Criteria: neither gene is expressed >= 1 FPKM in any dataset in the analysis. These gene pairs are not necessarily dead, but their are either only turned on under conditions not studied in the analysis or are expressed at such a low level pattern analysis is useless.

d1: Gene1 Dead

Criteria: gene1 is expressed < 1 FPKM in all datasets in the analysis AND is always expressed at least 10x less than the average expression of gene2.

d2: Gene2 Dead

Criteria: same as above just switch gene1 and gene2

nc: No Correlation

Criteria: The p-value of the spearman correlation between the expression of the two genes (after removing omitted conditions) is > .01 or the R value is .65 and greater than -.65.

c1: Correlated, Gene1 dominates

Criteria: Gene pair didn't fail the no-correlation test, and the average expression of gene1 is at least 2x that og gene2 (after removing omitted conditions).

c2: Correlated, Gene2 dominates

Criteria: Same as above just switch gene1 and gene2

ce: Correlated, Genes even

Criteria: Didn't fail the no correlation test, less than a two fold difference in mean expression between the expression of the two genes. The direction of correlation is positive.

ic: Inverse Correlation

Criteria: Any correlated gene (p-value < .01, absolute value of R > .65) where the correlation is negative.