辅导案例-STAT 352
STAT 352: Data Modeling and Inference Take-home for Exam 1. Due on Friday, March 13th. Problem 1: Suppose that a model M0 is nested in a model MA such that M0 can be described as a null hypothesis of MA with a difference in model dimension of d. The AIC based test would say to reject M0 in favor of MA if twice the difference in log-likelihoods of MA to M0 is greater than twice d. Plot the log of the Type I Error Rate (probability of rejecting the null even though it is true) of such a test for d = 1, . . . , 100 using the asymptotic χ2d distribution for twice the difference in log-likelihoods when the null is true. Comment on your results. Hint: Do not make this harder than it needs to be. My code for the solution is three lines. The pchisq is vectorized in the arguments q and df and ncp, has a lower.tail option that defaults to TRUE, and has a log.p option that defaults to FALSE. The ncp input should be 0 because the Non-Centrality Parameter is 0. Problem 2: Suppose we have three sets of random variables Wh, Xi, and Yj (for h = 1, . . . , k, i = 1, . . . ,m, and j = 1, . . . , n) all of which are mutually independent. Assume that the three sets of random variables are all normally distributed with different means but the same standard deviation. The MLE for the means are just the group means and the MLE for the variance is the mean of the squared errors of the observations where each observation is centered at its group’s sample average. Write a function to fit this model to three observed data vectors w, x, y and return both the MLE and log-likelihood evaluated at the MLE. Use the commands data("iris") w = iris$Sepal.Width[iris$Species=="setosa"] x = iris$Sepal.Width[iris$Species=="versicolor"] y = iris$Sepal.Width[iris$Species=="virginica"] to read in some data to analyze using your function. Compare the results from analyzing the data with the model for different means to the results from analyzing the data when it would be assumed that the means are all the same and the standard deviations are the same. Comment on your results. Hint: You can use optim to do this. Although, I have already told you in words what the MLEs are, so you can just translate those words into equations. If you are not comfortable translating the words into equations, just use optim. There is an example comparing two groups with various parameter assumptions in the testing and model selection with likelihoods.pdf file. Problem 3: Generate a simulated data set with 100 observations based on the following model. Each data point is a vector Z = (X, Y ) whereX describes the age of a machineNew, FiveY earsOld, and TenY earsOld and Y describes whether the quality of output from the machine is Normal or Abnormal. The probabilities of a machine being in the three states are P (X = New) = 1 4 P (X = FiveY earsOld) = 1 3 P (X = TenY earsOld) = 5 12 . The probabilities of Normal output conditioned on machine age are P (Y = Normal|X = New) = 8 10 P (Y = Normal|X = FiveY earsOld) = 8 10 P (Y = Normal|X = TenY earsOld) = 4 10 . Your data should consist of two vectors X and Y both of which are of class character. Convert these to factors using the as.factor function. Analyze your simulated data using the chisq.test function with inputs x=x, y=y. This test tries to determine if the random variables can be treated as though they are independent, which is the null hypothesis. By default, it uses an asymp- totic χ2 distribution. Perform the analysis with the exact same function, but with simulated p- values using the inputs x=x, y=y, simulate.p.values=TRUE, B=10000. Would you trust the p-values from the asymptotic distribution or the simulated p-values more? What conclusions can you draw about your simulated data from this analysis? Repeat the simulation 100 times and store the asymptotic and simulated p-values for each simulation. Compare the asymptotic and simulated p-values that you obtain. Hint: There are ways of generating this data in a few lines, but I recommend using a for loop. In this loop, first sampling the age and then using an if{}else{} construct for sampling the output. The sample function is very useful here. For exam- ple sample(c("a","b","c"),1,prob=c(.1,.4,.5)). Also, there are lots of ways to think about comparing these two p-values. Here, the p-values come in sets of 2 for each simulated dataset. You should think about using this fact. Problem 4: There are lots of possible ways to think about estimating a parameter. We have mostly focused on minimizing losses, especially minimizing minus twice the log-likelihood. However, there are other ways of estimating parameters. One is called the “method of moments” and uses relation- ships between expectations of random variables and the parameters of interest. We are going to think about this in the context of the shape parameter of a gamma distribution. If Y ∼ Gamma(shape = α, scale = θ), then we know E[Y ] = αθ and V ar[Y ] = αθ2 and so the method of moments esti- mator for α from data y1, . . . , nn is y¯ 2/s2 where s2 is the sample variance. An alternative method of moments estimator uses the expected value of Y and the fact that E[log(Y )] = ψ(α) + log(θ) where ψ is the digamma function (whatever that is). Computing this second method of moments estimator involves solving digamma(alpha)-log(alpha) = mean(log(y))-log(mean(y)). Some- what surprisingly, this alternative method of moments estimator is actually equivalent to the MLE. Here are two R functions for computing these estimators. mom_est = function(y,ind=1:length(y)){ y = y[ind] mean(y)^2/var(y) } mle_est = function(y,ind=1:length(y)){ y = y[ind] diff = mean(log(y)) - log(mean(y)) f = function(alpha){if(alpha==0){-Inf}else{digamma(alpha) - log(alpha) - diff}} uniroot(f,c(0,100),maxiter=1000,tol=1e-10,extendInt="upX")$root } library(boot) data("melanoma") y = melanoma$thickness mom_est(y) mle_est(y) Use the functions in the boot package (Lab 3) to compute the basic and bca 95% confidence intervals for the data y using 10000 bootstrap replicates. Compare and contrast the intervals you obtained. From these intervals, would you reject the null hypothesis that α = 1 (the exponential distribution) in favor of the alternative hypothesis α 6= 1 (a gamma distribution). Do you prefer the MOM estimate or the MLE? Why? Hint: Do not make this harder than it needs to be. My code is something like five additional lines added to the end of the code above.