# Probability Distributions # May 15, 2012 # Stat560 This exercise uses a population genetics problem as an example of maximum likelihood estimation (MLE). It will also serve as some familiarization with how you write functions of variables in R. Suppose that there is a disease (let's say halitosis) which is partly genetically determined. The genotype aa has a 40% chance of getting the disease, and the other two possible genotypes, AA and Aa, each has a 10% chance of getting the disease. We want to estimate the frequency of the A allele, and put a rough confidence interval on it. If the gene frequency of the A allele is p, and that of a is 1-p, then the frequency of the disease is expected to be (if the genotypes are in Hardy-Weinberg proportions as a result of random mating: F = p^2 (0.1) + 2p(1-p) (0.1) + (1-p)^2 (0.4) (where the notation "^2" means "squared"). Now suppose we observe 1000 indviduals and find that the 182 of them have the disease. Using a binomial distribution, the probability that 182 out of 1000 have the disease is the binomial class probability for 182 out of 1000 when the probability of the event is F (which is a function of p). This is 1000! -------- F^(182) (1-F)^(818) 182! 818! we want to maximize this by varying p. The first thing to notice is that the factorial stuff in front doesn't matter -- it will be the same for all values of F. So we drop it. Then the log-likelihood is 182 log(F) + 818 log(1-F) We want to write a function LL, the log-likelihood for p. For this we create two functions (just copy them but look at them too and try to figure out what they do, as this is how you make your own functions): F <- function(p) { return(0.1*(p^2+2*p*(1-p)) + 0.4*(1-p)^2) } LL <- function(f) { return(182*log(f)+818*log(1-f)) } Note -- more generally you put statements inside the parentheses that define the body of the function, and these statements are separated by semicolons. The last one should be return(...) which computes the value of the value of the function. A function can also have multiple arguments such as LL(a, b, x) Enter these commands. Then 1. Calculate LL(F(p)) for some values of p. 2. To make life easier, using commands like seq( ... ) to make a vector (called, say, x) of gene frequencies from just above zero to nearly 1 (don't use 0 or 1 as the logarithms will explode, which is scary). 3. Make a vector of log-likelihood values by doing LL(F(x)). 4. Plot this against x 5. Repeat commands 2 and 3 to narrow down the search, using seq(...) in appropriate ways.