辅导案例-GU4206/GR5206

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

STAT GU4206/GR5206 Sample Final Exam
Gabriel
Exam Sections:
The exam is broken down by the following sections:
Part I: Simulation - Problems (1) - (4)
Part II: MLE - Problem (5)
Part III Bayesian Logistic Model and MCMC - Problems (6) - (9)
Part IV: Split/Apply/Combine and Model Assessment - Problems (10) - (13)
Extra Credit: Problem (14)
Part I: Simulation - Problems (1) - (4)
Consider the continuous random variable X with probability density function
f(x) = −34(x
2 − 8x+ 15) where 3 < x < 5
1) Plot the curve f(x) over the interval 3 < x < 5. Use base R or ggplot to construct the graphic.
Solution
# Solution
2) Use Monte Carlo integration to check that f(x) is a valid probability density function, i.e., check that
the area under the curve equals 1. Use 50,000 cases to solve this problem.
Solution
# Solution
3) Write a function named r.mypdf that simulates n cases from f(x). The function should use the
accept-reject algorithm or the inverse-transform method. Note that the accept-reject method is easier
to implement in this setting. Display 10 simulated cases from f(x).
Solution
# Solution
4) Simulate 10,000 cases from f(x). Store this vector as my.sim.vec. Construct a histogram of the
simulated distribution with the density f(x) overlayed on the plot. Use base R or ggplot to construct
the graphic.
Solution
# Solution
Part II: MLE - Problem (5)
In this section we consider a binary response variable Y , two continuous features X1, X2 and a binary feature
X3. The first 7 cases of logistic.final or (my.data) are displayed below:
1
my.data <- read.csv("logistic.final.csv")
head(my.data,7)
## Y X1 X2 X3
## 1 1 -0.8969 -0.8383 0
## 2 1 0.1848 2.0663 0
## 3 0 1.5878 -0.5622 0
## 4 1 -1.1304 1.2757 0
## 5 1 -0.0803 -1.0476 1
## 6 0 0.1324 -1.9659 0
## 7 1 0.7080 -0.3230 1
Logistic model and MLE
Here we assume Y1, . . . , Y50 are independent Bernoulli random variables with success probabilities
E[Yi] = pi =
exp(β0 + β1xi1 + β2xi2 + β3xi3)
1 + exp(β0 + β1xi1 + β2xi2 + β3xi3)
, i = 1, 2, . . . , 50
The fitted logistic model is estimated using the glm() function:
glm.model <- glm(Y~X1+X2+X3,data=my.data,family=binomial(link = "logit"))
glm.model$coefficients
## (Intercept) X1 X2 X3
## 1.765595 -3.896144 1.652278 2.594704
Basic Optimization: Gradient descent or Newton’s
5) In this question, you required to compute the maximum likelihood estimate of the vector(
β0 β1 β2 β3
)T . This can be done using several methods, all based on gradient-descent. Choose
only one method and clearly label which method you chose!
Choice i) Minimize the negative log-likelihood by utilizing the grad.descent() function from class
and practice HW6. To solve this problem, you must successfully set up the negative log-likelihood
−`(β0, β1, β2, β3|y1, . . . , y50) and optimize the function. Break the algorithm when stopping.deriv < 0.001.
Also compare your answer to glm.model$coefficients and identify how many iterations the algorithm took
to converge.
Choice ii) Minimize the negative log-likelihood by utilizing the Newton.method() function from class
and practice Lab8. To solve this problem, you must successfully set up the negative log-likelihood
−`(β0, β1, β2, β3|y1, . . . , y50) and optimize the function. Break the algorithm when stopping.deriv < 0.001.
Also compare you answer to glm.model$coefficients and identify how many iterations the algorithm took
to converge.
Choice iii) Minimize the negative log-likelihood by applying Iteratively Reweighted Least Squares.
This function was defined in class and on Lab8. To solve this problem, you do not have to set-up the negative
log-likelihood. Break the algorithm when∣∣βi,(t) − βi,(t−1)∣∣ < .0001, for i = 1, 2, 3, 4
Also compare you answer to glm.model$coefficients and identify how many iterations the algorithm took
to converge.
2
# Solution
Part III: Bayesian Logistic Model and MCMC - Problems (6) - (9)
Consider a Bayesian logistic regression model. This is constructed by placing independent priors on
β0, β1, β2, β3. The full model assumes Y1, . . . , Y50 are independent Bernoulli random variables with suc-
cess probabilities
E[Yi] = pi =
exp(β0 + β1xi1 + β2xi2 + β3xi3)
1 + exp(β0 + β1xi1 + β2xi2 + β3xi3)
, i = 1, 2, . . . , 50
and prior distributions
β0 ∼ pi(β0) = exp(λ = 1/2) Note: rate = λ
β1 ∼ pi(β1) = N(µ1 = −3, σ21 = 1)
β2 ∼ pi(β2) = N(µ2 = 1, σ22 = 1)
β3 ∼ pi(β3) = f(x)
The last prior is the probability density function from Part I
f(b) = −34(b
2 − 8x+ 15) where 3 < b < 5.
The full prior is given by:
pi(β0, β1, β2, β3) = pi(β0)pi(β1)pi(β2)pi(β3)
6) Run the MCMC algorithm using the above Bayesian model. Note: Having the four independent priors
does not over-complicate the exercise. You can simply draw one case from each independent distribution
and put them together, i.e.,
β∗ =
(
β∗0 β
∗
1 β
∗
2 β
∗
3
) ∼ pi(β0, β1, β2, β3),
where
β∗0 ∼ exp(λ = 1/2), β∗1 ∼ N(µ1 = −3, σ21 = 1), β∗2 = N(µ2 = 1, σ22 = 1), β∗3 ∼ f(x).
Run 10,000 iterations of the MCMC algorithm and display the first 10 cases of your simulated chain.
Solution:
# Solution
7) Display four traceplots of parameters β0, β1, β2, β3.
# Solution
8) Remove the first 1000 cases of your simulated chain (10% burn-in) and display four histograms of the
simulated posterior(s). This will be one histogram for each βj .
Solution
# Solution
9) Display the Bayesian estimates produced from this MCMC simulation. Compare your result with the
MLE (or glm() output).
Solution
3
# Solution
Part IV: Split/Apply/Combine and Model Assessment - Problems
(10) - (14)
In this section we consider the dataset logistic.final.FULL. The first 6 cases of logistic.final.FULL or
(data.full) are displayed below. The levels of the variable DataIndex are also displayed. This dataset
represents 9 different test-datasets of size n = 40, each simulated from the same population as my.data.
data.full <- read.csv("logistic.final.FULL.csv")
head(data.full)
## Y X1 X2 X3 DataIndex
## 1 1 -0.8969 -0.3836 0 Set1
## 2 0 0.1848 -1.9591 0 Set1
## 3 0 1.5878 -0.8417 1 Set1
## 4 1 -1.1304 1.9035 0 Set1
## 5 1 -0.0803 0.6225 1 Set1
## 6 1 0.1324 1.9909 0 Set1
levels(factor(data.full$DataIndex))
## [1] "Set1" "Set2" "Set3" "Set4" "Set5" "Set6" "Set7" "Set8" "Set9"
The original dataset from Part II (my.data) is our training data. In this case, the trained model is the
logistic regression estimated via MLE (or IRLS).
# Trained model = gml.model
gml.model <- glm(Y~X1+X2+X3,data=my.data,family=binomial(link = "logit"))
gml.model$coefficients
## (Intercept) X1 X2 X3
## 1.765595 -3.896144 1.652278 2.594704
Set-Up: Test Error in Logistic Regression - Problems (10) - (14)
The following function logistic.error computes the test error of a generic test dataset based on the logistic
regression model. The function has input df.test, which is the test dataframe of size n∗. The function also
has input trained.model, which is our trained logistic model defaulted to gml.model. The test error is
computed by the following process:
i) First: compute the linear prediction based on trained coefficients:
Zˆtest = 1.765595− 3.896144X1,test + 1.652278X2,test + 2.594704X3,test
This will give a vector of length n∗. The βˆ’s are estimated based on the single training data my.data.
The test features X1,test, X2,test, X3,test come from the dataframe df.test.
ii) Second: evaluate the predicted test probabilities
pˆtest =
exp(Zˆtest)
1 + exp(Zˆtest)
This will give a vector of length n∗.
iii) Third predict Yˆtest = 1 if pˆtest > .5 and Ytest = 0 otherwise.
4
iv) Fourth: compute the test error by finding the average number of missclassified cases, i.e.,
Errortest =
1
n∗
n∗∑
i=1
I[Yˆi,test 6= Yi,test]
Note: I did all the work for you!! There is no need to run the above process for logistic regression. To
see how the function works, let’s see how well our trained model performs on the 8th test dataset.
logistic_error <- function(df.test,trained.model=gml.model) {
Y.test <- df.test$Y
X.test <- data.frame(X1=df.test$X1,X2=df.test$X2,X3=df.test$X3)
linear.predictions <- predict(trained.model,newdata= X.test)
p.test <- exp(linear.predictions)/(1+exp(linear.predictions))
y.hat.test <- ifelse(p.test>.5,1,0)
test.error <- mean(y.hat.test!=Y.test)
return(test.error)
}
test.data.8 <- data.full[data.full$DataIndex=="Set8",]
logistic_error(df.test=test.data.8,trained.model=gml.model)
## [1] 0.05
The Trained logistic model misclassified 5% of the cases for this test set.
Tasks: Problems (10) - (13)
10) Run a the Split/Apply/Combine procedure by splitting data.full by the the variable DataIndex and
computing the test error for each test dataset. The final result should be a vector of length 9. This
solution should be done using base R commands.
Solution
# Solution
11) Reproduce problem (10) using a function from the plyr package.
Solution
# Solution
12) Reproduce the following code using a pipe from the purrr package. Note that this is not a
Split/Apply/Combine model. Your code should simply compute the trained logistic model with
functions from purrr. You should still use the glm() function.
trained.model <- glm(Y~X1+X2+X3,data=my.data,family=binomial(link = "logit"))
trained.model$coefficients
## (Intercept) X1 X2 X3
## 1.765595 -3.896144 1.652278 2.594704
Solution
# Solution
13) Reproduce problem (10 or 11) using a pipe from the purrr package. You can still use the base R
functions split() and glm().
5
# Solution
Extra Credit: Problem (14)
Compute the test error based on our Bayesian model from Part (III). Compute the error on each test dataset.
Compare the MLE and Bayes’ models. Note: This problem allows up to 10 extra credit exam
points.
# Solution
6