# Linear Regression
# May 29, 2012
# Genome 560

######################################################################
# We are going to do 5,194 tests and plot their P values. This is for the data set of 
# Storey, Akey et al., the one called RMA_Filtered.txt. Then we're going to use 
# Storey's method to compute False Discovery Rates.

# The data consists of measurements of log gene expression for 5,194 loci for two 
# populations (European and African).  In each population they looked at 8 
# individuals and for each of them measured them twice.  We are going to simplify 
# things by pretending that these 16 values (for each population) are just 
# independent samples.   

# In a more complete analyses we would have to take the levels of individuals and 
# replicate measurements into account by an analysis of variance (or some such). 

# For this exercise we just use a t-test.

# The two populations each have 16 values.  These are the first 16 columns and the 
# next 16 columns. First we have to read in the data frame calling it (say) a:

a <- read.table(header = T, file="http://www.cs.washington.edu/homes/suinlee/genome560/RMA_Filtered.txt")

# (that makes sure the headings of columns are dealt with correctly.

b <- a[,2:33]

# This copies the numbers out into a 5194 by 32 matrix, with columns 1:32 of the new 
# matrix being the old columns 2:33 (so omitting the locus name)

# Now we make up a function "fun" of one argument which is needed to apply it to each 
# row of that matrix.  It returns the P-value for the two-sample  t-test: 

fun <- function(d){return(t.test(d[1:16],d[17:32])$p.value)}

# Do you see what it does to row  i  of matrix b?  Try typing "fun(b[10,])" or 
# whatever row you want, to just make sure it works and gets the P value for that row.

# Now we apply it to the rows of the matrix and get the P values in a vector

p <- apply(b, 1, fun)

# That may take a minute or two to run.
# Now  p  is the vector of  P values.

# To see what they look like, plot the histogram of the p's, using nclass=50 so you 
# can see where the 0.05 cutoff is (in case you want to use that).

# Choose a significance level alpha (maybe it would be 0.05, or 0.10, or whatever.  
# Call it alpha

# Find out how many there are:

alpha = 0.05
tabulate(as.numeric(p < alpha))       # tells you how many there are

# Look at a histogram of  P values.  

# Now, we will apply multiple hypothesis testing 
#p.adjust(p, method = p.adjust.methods, n = length(p))

p.adjust.methods

# Try
padj <- p.adjust(p, "fdr" )

tabulate(as.numeric(padj < alpha) )

par( mfrow = c(1,2) )
hist( p, breaks = 50)
hist( padj, breaks = 50)

# Try different methods for multiple hypothesis testing

padj <- p.adjust(p, "bonferroni" )

tabulate(as.numeric(padj < alpha) )