辅导案例-STAT 352

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

STAT 352: Data Modeling and Inference
Take-home for Exam 1. Due on Friday, March 13th.
Problem 1: Suppose that a model M0 is nested in a model MA such that M0 can be described as a
null hypothesis of MA with a difference in model dimension of d. The AIC based test would say to
reject M0 in favor of MA if twice the difference in log-likelihoods of MA to M0 is greater than twice
d. Plot the log of the Type I Error Rate (probability of rejecting the null even though it is true)
of such a test for d = 1, . . . , 100 using the asymptotic χ2d distribution for twice the difference in
log-likelihoods when the null is true. Comment on your results. Hint: Do not make this harder than
it needs to be. My code for the solution is three lines. The pchisq is vectorized in the arguments
q and df and ncp, has a lower.tail option that defaults to TRUE, and has a log.p option that
defaults to FALSE. The ncp input should be 0 because the Non-Centrality Parameter is 0.
Problem 2: Suppose we have three sets of random variables Wh, Xi, and Yj (for h = 1, . . . , k,
i = 1, . . . ,m, and j = 1, . . . , n) all of which are mutually independent. Assume that the three
sets of random variables are all normally distributed with different means but the same standard
deviation. The MLE for the means are just the group means and the MLE for the variance is the
mean of the squared errors of the observations where each observation is centered at its group’s
sample average. Write a function to fit this model to three observed data vectors w, x, y and
return both the MLE and log-likelihood evaluated at the MLE. Use the commands
data("iris")
w = iris$Sepal.Width[iris$Species=="setosa"]
x = iris$Sepal.Width[iris$Species=="versicolor"]
y = iris$Sepal.Width[iris$Species=="virginica"]
to read in some data to analyze using your function. Compare the results from analyzing the data
with the model for different means to the results from analyzing the data when it would be assumed
that the means are all the same and the standard deviations are the same. Comment on your results.
Hint: You can use optim to do this. Although, I have already told you in words what the MLEs
are, so you can just translate those words into equations. If you are not comfortable translating
the words into equations, just use optim. There is an example comparing two groups with various
parameter assumptions in the testing and model selection with likelihoods.pdf file.
Problem 3: Generate a simulated data set with 100 observations based on the following model.
Each data point is a vector Z = (X, Y ) whereX describes the age of a machineNew, FiveY earsOld,
and TenY earsOld and Y describes whether the quality of output from the machine is Normal or
Abnormal. The probabilities of a machine being in the three states are
P (X = New) =
1
4
P (X = FiveY earsOld) =
1
3
P (X = TenY earsOld) =
5
12
.
The probabilities of Normal output conditioned on machine age are
P (Y = Normal|X = New) = 8
10
P (Y = Normal|X = FiveY earsOld) = 8
10
P (Y = Normal|X = TenY earsOld) = 4
10
.
Your data should consist of two vectors X and Y both of which are of class character. Convert
these to factors using the as.factor function. Analyze your simulated data using the chisq.test
function with inputs x=x, y=y. This test tries to determine if the random variables can be treated
as though they are independent, which is the null hypothesis. By default, it uses an asymp-
totic χ2 distribution. Perform the analysis with the exact same function, but with simulated p-
values using the inputs x=x, y=y, simulate.p.values=TRUE, B=10000. Would you trust the
p-values from the asymptotic distribution or the simulated p-values more? What conclusions
can you draw about your simulated data from this analysis? Repeat the simulation 100 times
and store the asymptotic and simulated p-values for each simulation. Compare the asymptotic
and simulated p-values that you obtain. Hint: There are ways of generating this data in a few
lines, but I recommend using a for loop. In this loop, first sampling the age and then using an
if{}else{} construct for sampling the output. The sample function is very useful here. For exam-
ple sample(c("a","b","c"),1,prob=c(.1,.4,.5)). Also, there are lots of ways to think about
comparing these two p-values. Here, the p-values come in sets of 2 for each simulated dataset. You
should think about using this fact.
Problem 4: There are lots of possible ways to think about estimating a parameter. We have mostly
focused on minimizing losses, especially minimizing minus twice the log-likelihood. However, there
are other ways of estimating parameters. One is called the “method of moments” and uses relation-
ships between expectations of random variables and the parameters of interest. We are going to think
about this in the context of the shape parameter of a gamma distribution. If Y ∼ Gamma(shape =
α, scale = θ), then we know E[Y ] = αθ and V ar[Y ] = αθ2 and so the method of moments esti-
mator for α from data y1, . . . , nn is y¯
2/s2 where s2 is the sample variance. An alternative method
of moments estimator uses the expected value of Y and the fact that E[log(Y )] = ψ(α) + log(θ)
where ψ is the digamma function (whatever that is). Computing this second method of moments
estimator involves solving digamma(alpha)-log(alpha) = mean(log(y))-log(mean(y)). Some-
what surprisingly, this alternative method of moments estimator is actually equivalent to the MLE.
Here are two R functions for computing these estimators.
mom_est = function(y,ind=1:length(y)){
y = y[ind]
mean(y)^2/var(y)
}
mle_est = function(y,ind=1:length(y)){
y = y[ind]
diff = mean(log(y)) - log(mean(y))
f = function(alpha){if(alpha==0){-Inf}else{digamma(alpha) - log(alpha) - diff}}
uniroot(f,c(0,100),maxiter=1000,tol=1e-10,extendInt="upX")$root
}
library(boot)
data("melanoma")
y = melanoma$thickness
mom_est(y)
mle_est(y)
Use the functions in the boot package (Lab 3) to compute the basic and bca 95% confidence
intervals for the data y using 10000 bootstrap replicates. Compare and contrast the intervals you
obtained. From these intervals, would you reject the null hypothesis that α = 1 (the exponential
distribution) in favor of the alternative hypothesis α 6= 1 (a gamma distribution). Do you prefer
the MOM estimate or the MLE? Why?
Hint: Do not make this harder than it needs to be. My code is something like five additional lines
added to the end of the code above.