辅导案例-STAT4064

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

STAT4064 Practice Exam
Instructions
• Time allowed: 2 hours.
• The exam has 3 questions for a total of 58 marks available.
• Marks are allocated to questions and parts of questions as indicated.
• Where comments or justifications are required, full marks will only be given if comments/reasons are
given as part of the answer.
• Include relevant output (sparingly) in your answers. Figures, if required, need to be included in your
answers in addition to interpretation of said figures. R code alone is not an acceptable answer to a
question.
• This is an open book exam, carried out either online on your computer or on a computer provided at
the University.
• Calculations are to be performed in R.
• Answers to be submitted via LMS.
• All questions or parts of questions may be attempted.
• You are allowed to use USB sticks, hard drives, books, lecture notes, hand-written notes, pens and
paper.
• Limited use of the internet is allowed, for looking up reference material and R programming resources,
etc. however communication of any other kind is strictly prohibited. That is to say:
– No part of this assessment will be written for you, provided to you by a third party, or will be
completed as a result of a third party assisting you.
– You will take proper and reasonable care to prevent this work from being copied by another
student.
– You will not assist other students in completion of this assessment, except where collaboration is
explicitly authorised by the unit coordinator.
– You will not reproduce, capture, record or share any questions from this exam.
1
Questions
1. [15 marks] Consider the ISLR::Auto data.
(a) [2 marks] Check that the data have no missing values, e.g. using na.omit(). If you have any
missing values, remove them. Then, what is the dimension of the data? List the variables and
their interpretation.
(b) [2 marks] Produce a parallel coordinate plot for the variables mpg, cylinders, displacement,
horsepower, weight, acceleration, year. Comment on the plot.
(c) [2 marks] Compute the correlation matrix of all variables used in Q1(b) and state the pair of
variables that has the highest and lowest (absolute) correlation. Give the value of these correlations.
(d) [1 mark] Select the subset of data corresponding to year >= 73. Call this subset Auto73. How
many observations are there in Auto73?
(e) [3 marks] Fit a simple linear regression model of acceleration as response against horsepower
using the Auto73 data. Show a scatterplot of the observations in Auto73 with acceleration as
the vertical axes and horsepower as the horizontal axes. State the equation of the least squares
regression line, and include this line in the scatterplot.
(f) [2 marks] Show a residual plot resulting from the linear regression analysis of Q1(e) and comment.
(g) [3 marks] Using the regression model fit in Q1(e), what is the predicted acceleration associated with
horsepower = 93 and horsepower = 175 respectively? Calculate the associated 95% confidence
and prediction intervals corresponding to these two values of horsepower. Comment on the
relative size of these intervals.
2. [20 marks] Consider the ISLR::Auto data.
(a) [2 marks] Create a new variable, mpgclass, such that:
mpgclass =
{
0 if mpg ≤ 23
1 if mpg > 23
How many observations are in each of the two resulting classes?
(b) [2 marks] Describe what a confusion matrix for a 2-class classification problem looks like (e.g. in
the form of a table) and explain how the information provided in such a matrix can be interpreted.
(c) [3 marks] Fit a logistic regression model with mpgclass as the response and weight,
acceleration and horsepower as predictors, without interaction terms. Calculate the
classification error. Present your result in a confusion matrix. Summarise your results with a
comment.
(d) [2 marks] In the context of classification, explain “the validation set approach”. You may include
in your answer a brief description of how you might choose a validation set.
(e) [4 marks] Use observations 85:184 as the validation set, and use the complementary set of
observations as the training set. Fit the logistic regression of Q2(c) using the training data. Use
the resulting model to make class predictions for the training data, and present the results in a
confusion matrix. Comment on these results.
(f) [3 marks] Test the model obtained in Q2(e) on the validation set and report your findings in a
confusion matrix. What is the test error?
(g) [2 marks] Explain how the choice of validation set can affect the test results.
(h) [2 marks] Compare the results obtained in the three confusion matrices and interpret these
comparisons.
2
3. [23 marks] Consider the ISLR::Carseats data.
(a) [1 mark] What is the dimension of the data? List the variables. We will use all variables except
ShelveLoc. Create a variable corresponding to these data without the ShelveLoc variable, and
call that variable Carseats10.
(b) [4 marks] Describe K-means and hierarchical methods for cluster analysis (in no more than 10
lines of text). In your description, make sure to highlight some advantages and disadvantages of
the two approaches when compared to one another.
(c) [2 marks] Perform K-means clustering of the Carseats10 data for K = 2 with the option nstart
= 20. Report:
• the cluster mean and size of each of the two clusters,
• the total sum of squares (returned from kmeans() as totss)
• the between-cluster sum of squares (betweenss), the within-cluster sum of squares (withinss),
and the total within-cluster sum of squares (tot.withinss).
(d) [4 marks] Are the two clusters produced in Q3(c) similar is size and variability? Comment, and
refer to values reported in Q3(c) in your comment. What is the relationship between the various
values reported in Q3(c)?
(e) [3 marks] Why could you expect to obtain different results when you perform the K-means
algorithm repeatedly for the same K with nstart = 1? Which of the outcome quantities listed in
Q3(c) would you expect to change? Which would you expect to increase? Give reasons for your
answser.
(f) [6 marks] For each of the values of K = 3, 4, ..., 7 repeat the cluster analysis of Q3(c), using
nstart = 20 each time. For each value of K save and, combined with the results of Q4(c) report
your results by producing the following graphs:
(i) the total within-cluster sum of squares versus K.
(ii) the ratio of between-cluster sum of squares and total sum of squares versus K and
(iii) the ratio of between-cluster sum of squares and total within-cluster sum of squares versus K.
Comment on what you observe in the graphs, including a proposed explanation for why you might
observe what you observe.
(g) [3 marks] Based on the analysis and graphical results on the previous parts of Q3, state the
number of clusters your would choose to use for the Carseats10 data. Explain your reasoning for
this choice.
3

欢迎咨询51作业君