ACTL90023 DATA ANALYTICS IN INSURANCE I
Practice Assessment, Semester 1, 2021
Total Marks: 30 marks
Number of pages:
Authorised materials: R; l
ecture material
Instructions to students:
This is a practice exam only. The types of questions are consistent with the questions in the final assessment.
The length of this practice paper is shorter.
You need to submit a single pdf file named as ACTL90023xxxxxx.pdf where xxxxxx is your student ID by
the due time of this test.
You will need to use R to complete this exam when needed. Creating a final submission by R markdown i s a
preferrable way of editing your answers. You may use a R script file to produce your final submission, but
you will need to include the required outputs and/or hand written answers manually.
Question One (2+4+4 = 10 marks)
You are given a data set which is named “ACTL90023_practice_2021.csv”. You are asked to build statistical
models that help to predict the response variable Y given new predictor observations.
(a)
Load the data set and build a multiple linear regression model using the data. Describe how well the obtained
model fit the given data and show at least two numerical evidences.
(b)
You are suspecting that there might be some non-linear relationship between the response Y and the two
predictors.
• Present a polynomial regression model using the data. The degrees of any polynomial terms in your
model should be optimised and explain how these best degrees are selected.
• Is this non-linear model a better fit to the given data than the MLR obtained in (a)? Why?
(c)
• Using the LOOCV method to estimate the test MSE of the models you obtained in (a) and (b).
• Which model is likely to give a more accurate prediction for any new observations? why?
• Can the estimated test MSE accurately represent the true test MSE values? Why?
1
Question One Solutions:
(a)
str(data)
## 'data.frame': 30 obs. of 3 variables:
## \$ X1: num 2.78 2.63 2.29 2.57 2.71 ...
## \$ X2: num 113 119 101 188 20 ...
## \$ Y : num 99.9 92.6 34.7 78.7 68 ...
attach(data)
lm.fit=lm(Y~., data)
summary(lm.fit)
##
## Call:
## lm(formula = Y ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7900 -5.5297 -0.3682 4.1799 19.5911
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -226.8882 17.2725 -13.136 3.05e-13 ***
## X1 108.2979 6.9420 15.600 4.96e-15 ***
## X2 0.1736 0.0257 6.753 2.99e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.566 on 27 degrees of freedom
## Multiple R-squared: 0.927, Adjusted R-squared: 0.9216
## F-statistic: 171.3 on 2 and 27 DF, p-value: 4.547e-16
The MLR obtained above is a good fit to the given data. Firstly, the overall relationship between the response
variable and two predictors is very significant with a p-value = 4.55× 10−16. Secondly, the R2 = 92.7% which
also indicate that the MLR is a strong fit.
(b)
poly.fit=lm(Y~poly(X1,5)+poly(X2,5),data=data)
summary(poly.fit)
##
## Call:
## lm(formula = Y ~ poly(X1, 5) + poly(X2, 5), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4999 -1.7411 0.0192 2.6680 7.3079
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
2
## (Intercept) 62.745 0.846 74.167 < 2e-16 ***
## poly(X1, 5)1 120.097 5.372 22.355 4.17e-15 ***
## poly(X1, 5)2 13.413 4.758 2.819 0.010959 *
## poly(X1, 5)3 -21.217 5.338 -3.975 0.000812 ***
## poly(X1, 5)4 -3.592 4.833 -0.743 0.466439
## poly(X1, 5)5 6.229 4.836 1.288 0.213251
## poly(X2, 5)1 43.778 5.508 7.948 1.85e-07 ***
## poly(X2, 5)2 -11.919 4.701 -2.535 0.020182 *
## poly(X2, 5)3 -9.746 4.877 -1.998 0.060212 .
## poly(X2, 5)4 8.163 4.964 1.644 0.116554
## poly(X2, 5)5 10.088 5.086 1.983 0.061961 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.634 on 19 degrees of freedom
## Multiple R-squared: 0.9807, Adjusted R-squared: 0.9706
## F-statistic: 96.66 on 10 and 19 DF, p-value: 4.04e-14
poly.fit2=lm(Y~poly(X1,3)+poly(X2,2),data=data)
summary(poly.fit2)
##
## Call:
## lm(formula = Y ~ poly(X1, 3) + poly(X2, 2), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.2453 -3.1879 -0.8365 2.6919 11.4377
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.7447 0.9751 64.347 < 2e-16 ***
## poly(X1, 3)1 121.0470 5.5155 21.947 < 2e-16 ***
## poly(X1, 3)2 15.3261 5.3580 2.860 0.008625 **
## poly(X1, 3)3 -25.2440 5.9896 -4.215 0.000306 ***
## poly(X2, 2)1 40.3076 6.1094 6.598 7.99e-07 ***
## poly(X2, 2)2 -12.3973 5.3995 -2.296 0.030714 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.341 on 24 degrees of freedom
## Multiple R-squared: 0.9676, Adjusted R-squared: 0.9609
## F-statistic: 143.6 on 5 and 24 DF, p-value: < 2.2e-16
• From the first polynomial model that we try one can see that X1 probably has a degree 3 and X2 has
degree 2. Of course, CV apporach can be used to find the best degrees as well. The best polynomial
regression model is given in poly.fit2.
• Yes. The best polynomial regression model has an increased R2 value which means can better fit the
given data than the MLR in (a).
(c)
library(boot)
glm.fit1=glm(Y~.,data=data)
cv.err1=cv.glm(data,glm.fit1)
3
cv.err1\$delta[1]
## [1] 64.58769
glm.fit2=glm(Y~poly(X1,3)+poly(X2,2),data=data)
cv.err2=cv.glm(data,glm.fit2)
cv.err2\$delta[1]
## [1] 36.21384
• The estimated test MSE for the MLR in (a) is 64.59 and the one for the polynomial regression model
in (b) is 36.21. So the polynomial regression model tends to give a more accurate prediction than the
MLR.
• In general the LOOCV can estimate the test MSE properly. However, in this case, as the number of
observations in the data is only 30, there is a small chance that the estimated test MSE will overestimate
the true test MSE.
Question Two (3+1+4+2+3+2 = 15 marks)
This question aims to use a simulated data set to study a classification problem.
(a) Generate two predictors X1 and X2 with n = 50 as follows:
• the first 25 observations of X1 are normal with mean 2 and s.d. 1;
• the remaining 25 observations of X1 are normal with mean -1 and s.d. 2;
• the first 25 observations of X2 are normal with mean 10 and s.d. 5;
• the remaining 25 observations of X2 are normal with mean 8 and s.d. 2.
(b) Generate a response Y that takes value 1 for the first 25 observations and takes value 2 for the remaining
25 observations. Note here both 1 and 2 are labels.
(c) Use an appropriate discriminant method to construct a classifier for Y and state the reasons that you
choose the method.
(d) Describe the shape of the decision boundary of the classifier you build in (c) with reasons.
(e) Construct a confusion matrix using the classifier you obtain in (c) and calculate the correct classification
rate of this classifier on this simulated data set.
(f) If you build another classifier that gives you a lower correct classification rate on this simulated data
set, then can you conclude that the classifier that you obtain in (c) is a better one? Why?
Question Two Solutions:
(a)
set.seed(1)
x=matrix(rnorm(50*2), ncol=2)
x[1:25,1]=x[1:25,1]+2
x[1:25,2]=x[1:25,2]*5+10
x[26:50,1]=x[26:50,1]-1
x[26:50,2]=x[26:50,2]*5+8
plot(x)
4
−2 −1 0 1 2 3
0
5
10
15
20
x[,1]
x[,
2]
## (b)
y=rep(1, 50)
y[26:50]=2
y=as.factor(y)
data=data.frame(y, x)
str(data)
## 'data.frame': 50 obs. of 3 variables:
## \$ y : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## \$ X1: num 1.37 2.18 1.16 3.6 2.33 ...
## \$ X2: num 11.99 6.94 11.71 4.35 17.17 ...
(c)
According to the information of the data set, we know that the two predictors follow a two-dimensional
normal distribution under each class of response, and these two normal distribution have different covariance
matricies. Therefore, QDA would be a suitable approach to build a classifier for the response.
library(MASS)
qda.fit=qda(y~.,data=data)
qda.fit
## Call:
## qda(y ~ ., data = data)
##
## Prior probabilities of groups:
## 1 2
5
## 0.5 0.5
##
## Group means:
## X1 X2
## 1 2.1686652 10.827045
## 2 -0.9677686 8.346219
(d)
We will get a quadratic decision boundary using the QDA approach.
(e)
qda.class=predict(qda.fit,data)\$class
table(qda.class,y)
## y
## qda.class 1 2
## 1 23 0
## 2 2 25
mean(qda.class==y)
## [1] 0.96
The correct classification rate is 96%.
(f)
Having a higher correct classification rate on the training set, i.e. a lower training error rate, doesn’t guarantee
a higher correct classification rate in classifying a new data set. On the contrary, very low training error rate
might indicate a potential over-fitting. So we can’t conclude that the classfier obtained in (c) is definitely a
better one.
Question Three (3 + 2 = 5 marks)
This is an unsupervised learning problem. There are four observations in a given data set, labelled as A, B,
C and D, which have the following dissimilarity matrix:

0.2 0.4 0.6
0.2 0.1 0.5
0.4 0.1 0.3
0.6 0.5 0.3

For example, the dissimilarity between A and C is 0.4.
(a) (3 marks)
On the basis of this dissimilarity matrix, sketch the average linkage dendrogram that results from hierarchically
clustering these four observations. Indicate in your dendrogram the height at which each fusion occurs, as
well as the observations corresponding to each leaf in the dendrogram.
6
(b) (2 marks)
Suppose that we cut the dendrogram obtained in (a) such that two clusters result. Suggest a height of cutting
and find which observations are in each cluster.
Question Three solutions:
(a)
You can draw this dendrogram by hand and present the scanned picture here.
d = as.dist(matrix(c(0, 0.2, 0.4, 0.6,
0.2, 0, 0.1, 0.5,
0.4, 0.1, 0.0, 0.3,
0.6, 0.5, 0.3, 0.0), nrow=4))
4
1
2 3
0.
1
0.
2
0.
3
0.
4