Hopefully someone on these forums can help me out with this basic problem in gene expression studies.

I did deep sequencing of an experimental and a control tissue. I then obtained fold enrichment values of genes in the experimental sample over control. The reference genome has ~15,000 genes. 3,000 out of 15,000 genes are enriched above a certain cut-off in my sample of interest compared to control.

So:

A = total gene population = 15,000

B = RNA-Seq enriched subpopulation = 3,000.In a previous ChIP-chip experiment, I found 400 genes that are enriched by ChIP-chip. Of the 400 ChIP-chip genes, 100 genes are in the group of 3,000 enriched RNA-Seq transcripts.

So:

C= total # of ChIP-chip enriched genes = 400.What is the probability that my 100 ChIP-chip genes would be enriched by RNA-Seq by chance alone? In other words, what is the most prudent way to calculate if my observed overlap between B and C (100 genes) is any better than that obtained by chance alone? From what I have read so far, the best way to test this is by using hypergeometric distribution.

I used an online calculator (stattrek.com) to set up a hypergeometric distribution test with the following parameters:

– pop size=15,000

– # of successes in population=3,000

– sample size=400,

-# of successes in sample=100.

I get the following for Hypergeometric Probability P(x=100)= 0.00224050636447747The actual # of genes overlapping between B and C = 100. Is this better than by chance alone? Doesn’t look like it is if the chance of any one gene being enriched is 1:5 (3,000 out of 15,000). That’s why I don’t understand how come my P(x=100) I calculated above is 0.0022. That amounts to a 0.2% chance of the overlap occurring by chance. Shouldn’t this be much higher?

If I sampled 400 random genes rom the big list of 15,000, then any 80 of these genes would be expected to be enriched by chance alone (1:5). The number of genes that are actually overlapping is 100, so this is just slightly better than by chance.

I also tried to come up with a solution using the dhyper or phyper functions in R (using what I saw in another post):

A=all genes in the genome (15,000)

B=RNA-Seq enriched genes (3,000)

C=ChIP-chip enriched genes (400)

Here’s the R input/output (adapted from a previous stackexchange post):`> totalpop <- 15000 > sample1 <- 3000 > sample2 <- 400 > dhyper(0:2, sample1, totalpop-sample1, sample2) [1] 4.431784e-40 4.584209e-38 2.364018e-36 > phyper(-1:2, sample1, totalpop-sample1, sample2) [1] 0.000000e+00 4.431784e-40 4.628526e-38 2.410304e-36`

I’m not sure how to interpret these numbers. I believe 2.36e-36 is the probability of getting a complete overlap between B and C by chance alone? But this makes no sense, since that probability is much closer to 1:5. If I start with 15,000 genes, 3,000 will be enriched. Similarly, if I start with 400 ChIP-chip genes, 80 of them should be enriched in the RNA-Seq alone due to the 1:5 chances of enrichment in that data set.

What is the proper way to calculate the p-value, according to the hypergeometric distribution, for the overlap of B and C?

**Answer**

You are close, with your use of `dhyper`

and `phyper`

, but I don’t understand where `0:2`

and `-1:2`

are coming from.

The p-value you want is the probability of getting 100 *or more* white balls in a sample of size 400 from an urn with 3000 white balls and 12000 black balls. Here are four ways to calculate it.

```
sum(dhyper(100:400, 3000, 12000, 400))
1 - sum(dhyper(0:99, 3000, 12000, 400))
phyper(99, 3000, 12000, 400, lower.tail=FALSE)
1-phyper(99, 3000, 12000, 400)
```

These give 0.0078.

`dhyper(x, m, n, k)`

gives the probability of drawing exactly `x`

. In the first line, we sum up the probabilities for 100 – 400; in the second line, we take 1 minus the sum of the probabilities of 0 – 99.

`phyper(x, m, n, k)`

gives the probability of getting `x`

or fewer, so `phyper(x, m, n, k)`

is the same as `sum(dhyper(0:x, m, n, k))`

.

The `lower.tail=FALSE`

is a bit confusing. `phyper(x, m, n, k, lower.tail=FALSE)`

is the same as `1-phyper(x, m, n, k)`

, and so is the probability of `x+1`

or more. [I never remember this and so always have to double check.]

At that stattrek.com site, you want to look at the last row, “Cumulative Probability: P(X ≥ 100),” rather than the first row “Hypergeometric Probability: P(X = 100).”

Any *particular* number that you draw is going to have small probability (in fact, `max(dhyper(0:400, 3000, 12000, 400))`

gives ∼0.050), and getting 101 or 102 or any larger number is even more interesting that 100, and the p-value is the probability, if the null hypothesis were true, of getting a result as interesting or more so than what was observed.

Here’s a picture of the hypergeometric distribution in this case. You can see that it’s centered at 80 (20% of 400) and that 100 is pretty far out in the right tail.

**Attribution***Source : Link , Question Author : stlandroidfan , Answer Author : Karl*