STAT 352: Data Modeling and Inference

Take-home for Exam 1. Due on Friday, March 13th.

Problem 1: Suppose that a model M0 is nested in a model MA such that M0 can be described as a

null hypothesis of MA with a difference in model dimension of d. The AIC based test would say to

reject M0 in favor of MA if twice the difference in log-likelihoods of MA to M0 is greater than twice

d. Plot the log of the Type I Error Rate (probability of rejecting the null even though it is true)

of such a test for d = 1, . . . , 100 using the asymptotic χ2d distribution for twice the difference in

log-likelihoods when the null is true. Comment on your results. Hint: Do not make this harder than

it needs to be. My code for the solution is three lines. The pchisq is vectorized in the arguments

q and df and ncp, has a lower.tail option that defaults to TRUE, and has a log.p option that

defaults to FALSE. The ncp input should be 0 because the Non-Centrality Parameter is 0.

Problem 2: Suppose we have three sets of random variables Wh, Xi, and Yj (for h = 1, . . . , k,

i = 1, . . . ,m, and j = 1, . . . , n) all of which are mutually independent. Assume that the three

sets of random variables are all normally distributed with different means but the same standard

deviation. The MLE for the means are just the group means and the MLE for the variance is the

mean of the squared errors of the observations where each observation is centered at its group’s

sample average. Write a function to fit this model to three observed data vectors w, x, y and

return both the MLE and log-likelihood evaluated at the MLE. Use the commands

data("iris")

w = iris$Sepal.Width[iris$Species=="setosa"]

x = iris$Sepal.Width[iris$Species=="versicolor"]

y = iris$Sepal.Width[iris$Species=="virginica"]

to read in some data to analyze using your function. Compare the results from analyzing the data

with the model for different means to the results from analyzing the data when it would be assumed

that the means are all the same and the standard deviations are the same. Comment on your results.

Hint: You can use optim to do this. Although, I have already told you in words what the MLEs

are, so you can just translate those words into equations. If you are not comfortable translating

the words into equations, just use optim. There is an example comparing two groups with various

parameter assumptions in the testing and model selection with likelihoods.pdf file.

Problem 3: Generate a simulated data set with 100 observations based on the following model.

Each data point is a vector Z = (X, Y ) whereX describes the age of a machineNew, FiveY earsOld,

and TenY earsOld and Y describes whether the quality of output from the machine is Normal or

Abnormal. The probabilities of a machine being in the three states are

P (X = New) =

1

4

P (X = FiveY earsOld) =

1

3

P (X = TenY earsOld) =

5

12

.

The probabilities of Normal output conditioned on machine age are

P (Y = Normal|X = New) = 8

10

P (Y = Normal|X = FiveY earsOld) = 8

10

P (Y = Normal|X = TenY earsOld) = 4

10

.

Your data should consist of two vectors X and Y both of which are of class character. Convert

these to factors using the as.factor function. Analyze your simulated data using the chisq.test

function with inputs x=x, y=y. This test tries to determine if the random variables can be treated

as though they are independent, which is the null hypothesis. By default, it uses an asymp-

totic χ2 distribution. Perform the analysis with the exact same function, but with simulated p-

values using the inputs x=x, y=y, simulate.p.values=TRUE, B=10000. Would you trust the

p-values from the asymptotic distribution or the simulated p-values more? What conclusions

can you draw about your simulated data from this analysis? Repeat the simulation 100 times

and store the asymptotic and simulated p-values for each simulation. Compare the asymptotic

and simulated p-values that you obtain. Hint: There are ways of generating this data in a few

lines, but I recommend using a for loop. In this loop, first sampling the age and then using an

if{}else{} construct for sampling the output. The sample function is very useful here. For exam-

ple sample(c("a","b","c"),1,prob=c(.1,.4,.5)). Also, there are lots of ways to think about

comparing these two p-values. Here, the p-values come in sets of 2 for each simulated dataset. You

should think about using this fact.

Problem 4: There are lots of possible ways to think about estimating a parameter. We have mostly

focused on minimizing losses, especially minimizing minus twice the log-likelihood. However, there

are other ways of estimating parameters. One is called the “method of moments” and uses relation-

ships between expectations of random variables and the parameters of interest. We are going to think

about this in the context of the shape parameter of a gamma distribution. If Y ∼ Gamma(shape =

α, scale = θ), then we know E[Y ] = αθ and V ar[Y ] = αθ2 and so the method of moments esti-

mator for α from data y1, . . . , nn is y¯

2/s2 where s2 is the sample variance. An alternative method

of moments estimator uses the expected value of Y and the fact that E[log(Y )] = ψ(α) + log(θ)

where ψ is the digamma function (whatever that is). Computing this second method of moments

estimator involves solving digamma(alpha)-log(alpha) = mean(log(y))-log(mean(y)). Some-

what surprisingly, this alternative method of moments estimator is actually equivalent to the MLE.

Here are two R functions for computing these estimators.

mom_est = function(y,ind=1:length(y)){

y = y[ind]

mean(y)^2/var(y)

}

mle_est = function(y,ind=1:length(y)){

y = y[ind]

diff = mean(log(y)) - log(mean(y))

f = function(alpha){if(alpha==0){-Inf}else{digamma(alpha) - log(alpha) - diff}}

uniroot(f,c(0,100),maxiter=1000,tol=1e-10,extendInt="upX")$root

}

library(boot)

data("melanoma")

y = melanoma$thickness

mom_est(y)

mle_est(y)

Use the functions in the boot package (Lab 3) to compute the basic and bca 95% confidence

intervals for the data y using 10000 bootstrap replicates. Compare and contrast the intervals you

obtained. From these intervals, would you reject the null hypothesis that α = 1 (the exponential

distribution) in favor of the alternative hypothesis α 6= 1 (a gamma distribution). Do you prefer

the MOM estimate or the MLE? Why?

Hint: Do not make this harder than it needs to be. My code is something like five additional lines

added to the end of the code above.