ACTL90023 Data Analytics in Insurance 1 -
2021 Assignment 2
1 Data
This assignment is going to be based on a data set named “Assignt2 data.csv”
which can be downloaded in Canvas. The dataset is from an ongoing cardio-
vascular study on residents of a town in USA. The classification goal is to
predict whether the patient has 10-year risk of future coronary heart disease
(CHD).The dataset provides the patients’ information. It includes over 4,000
records and 15 variables.
This data contains 14 predictors. Each predictor is a potential risk factor.
They are demographic, behavioral or medical risk factors.
• sex: male or female(1= male; 0 = female)
• age: Age of the patient;(numerical)
• smoker: whether or not the patient is a current smoker (1 = Yes; 0 =
No)
• cigs: the number of cigarettes that the person smoked on average in
one day.(numerical)
• BPMeds: whether or not the patient was on blood pressure medication
(1 = Yes; 0 = No)
• stroke: whether or not the patient had previously had a stroke (1 =
Yes; 0 = No)
• Hyp: whether or not the patient was hypertensive (1 = Yes; 0 = No)
1
• diabetes: whether or not the patient had diabetes (1 = Yes; 0 = No)
• chol: total cholesterol level (numerical)
• sysBP: systolic blood pressure (numerical)
• diaBP: diastolic blood pressure (numerical)
• BMI: Body Mass Index (numerical)
• heartRate: heart rate (numerical)
• glucose: glucose level (numerical)
• CHD10 (response): 10 year risk of coronary heart disease CHD (1 =
Yes; 0 = No)
2.1 Descriptive analysis of the data set
Here you will need to perform some descriptive analysis of the data set.
Consider the variable CHD10 as the response variable.
1. Load the data set and check it. Perform necessary treatments on the
data set.
2. Show numerical summary of the variables in the data.
3. Draw plots that can display relationships among the variables that you
think are significant/interesting.
4. Generate a training data set that contains 80% of the observations in
the given data and the remaining observations as the test set. Specify
a random seed for this step.
2
2.2 Logistic regression
In this part you will need to finish the following tasks:
1. Use the training set to build a logistic regression model using all pre-
dictors.
2. Discuss the significance of the relationship between the response vari-
able and the predictors and suggest any improvement to your model.
3. Calculate the confusion matrix using your best logistic model and find
its training error rate.
4. Generate predictions on the test set using the best logistic model. Cal-
culate the confusion matrix and find the test error rate.
2.3 Linear discriminant analysis (LDA)
In this part you will need to finish the following tasks:
1. Use the training set to build a LDA model using appropriate predictors.
2. Suggest any potential improvement to your model.
3. Calculate the confusion matrix using your best LDA model and find
its training error rate.
4. Generate predictions on the test set using the best LDA model. Cal-
culate the confusion matrix and find the test error rate.
2.4 KNN method
In this part you will need to finish the following tasks:
1. Use the training set to implement the KNN method to generate pre-
dictions on the test set.
2. Discuss the choice of K values that gives you the best predictions.
3. Generate predictions on the test set using the best KNN model that
you find. Calculate the confusion matrix and find the test error rate.
3
2.5 Tree-based methods
In this part you will need to finish the following tasks:
1. Use the training set to build a classification tree.
2. Generate predictions on the test set using the tree that you build.
Calculate the confusion matrix and find the test error rate.
3. Are there any ways that can improve the prediction performance of the
decision trees? Try them and show the improved results.
2.6 Model comparison
In this part you will need to finish the following tasks:
1. Compare the model prediction performance based on the results you
obtain in sections 2.2-2.5 and suggest the best classification approach.
2. Change the random seed in section 2.1.4 and generate new training/test
subsets. Repeat all tasks in 2.2-2.5 to see whether you still make the
same conclusion as the first time. You need to attempt the whole thing
at least three times to see whether you can make consistent conclusions.
3 Instructions
• This assignment is due at 5pm on Sunday 30th May. Similar to As-
signment 1, Assignment 2 needs to be submitted in Canvas.
• You should generate an R markdown document and then produce a