ACTL90023 Data Analytics in Insurance 1 -

2021 Assignment 2

1 Data

This assignment is going to be based on a data set named “Assignt2 data.csv”

which can be downloaded in Canvas. The dataset is from an ongoing cardio-

vascular study on residents of a town in USA. The classification goal is to

predict whether the patient has 10-year risk of future coronary heart disease

(CHD).The dataset provides the patients’ information. It includes over 4,000

records and 15 variables.

This data contains 14 predictors. Each predictor is a potential risk factor.

They are demographic, behavioral or medical risk factors.

• sex: male or female(1= male; 0 = female)

• age: Age of the patient;(numerical)

• smoker: whether or not the patient is a current smoker (1 = Yes; 0 =

No)

• cigs: the number of cigarettes that the person smoked on average in

one day.(numerical)

• BPMeds: whether or not the patient was on blood pressure medication

(1 = Yes; 0 = No)

• stroke: whether or not the patient had previously had a stroke (1 =

Yes; 0 = No)

• Hyp: whether or not the patient was hypertensive (1 = Yes; 0 = No)

1

• diabetes: whether or not the patient had diabetes (1 = Yes; 0 = No)

• chol: total cholesterol level (numerical)

• sysBP: systolic blood pressure (numerical)

• diaBP: diastolic blood pressure (numerical)

• BMI: Body Mass Index (numerical)

• heartRate: heart rate (numerical)

• glucose: glucose level (numerical)

• CHD10 (response): 10 year risk of coronary heart disease CHD (1 =

Yes; 0 = No)

2 Tasks

2.1 Descriptive analysis of the data set

Here you will need to perform some descriptive analysis of the data set.

Consider the variable CHD10 as the response variable.

1. Load the data set and check it. Perform necessary treatments on the

data set.

2. Show numerical summary of the variables in the data.

3. Draw plots that can display relationships among the variables that you

think are significant/interesting.

4. Generate a training data set that contains 80% of the observations in

the given data and the remaining observations as the test set. Specify

a random seed for this step.

2

2.2 Logistic regression

In this part you will need to finish the following tasks:

1. Use the training set to build a logistic regression model using all pre-

dictors.

2. Discuss the significance of the relationship between the response vari-

able and the predictors and suggest any improvement to your model.

3. Calculate the confusion matrix using your best logistic model and find

its training error rate.

4. Generate predictions on the test set using the best logistic model. Cal-

culate the confusion matrix and find the test error rate.

2.3 Linear discriminant analysis (LDA)

In this part you will need to finish the following tasks:

1. Use the training set to build a LDA model using appropriate predictors.

2. Suggest any potential improvement to your model.

3. Calculate the confusion matrix using your best LDA model and find

its training error rate.

4. Generate predictions on the test set using the best LDA model. Cal-

culate the confusion matrix and find the test error rate.

2.4 KNN method

In this part you will need to finish the following tasks:

1. Use the training set to implement the KNN method to generate pre-

dictions on the test set.

2. Discuss the choice of K values that gives you the best predictions.

3. Generate predictions on the test set using the best KNN model that

you find. Calculate the confusion matrix and find the test error rate.

3

2.5 Tree-based methods

In this part you will need to finish the following tasks:

1. Use the training set to build a classification tree.

2. Generate predictions on the test set using the tree that you build.

Calculate the confusion matrix and find the test error rate.

3. Are there any ways that can improve the prediction performance of the

decision trees? Try them and show the improved results.

2.6 Model comparison

In this part you will need to finish the following tasks:

1. Compare the model prediction performance based on the results you

obtain in sections 2.2-2.5 and suggest the best classification approach.

2. Change the random seed in section 2.1.4 and generate new training/test

subsets. Repeat all tasks in 2.2-2.5 to see whether you still make the

same conclusion as the first time. You need to attempt the whole thing

at least three times to see whether you can make consistent conclusions.

3 Instructions

• This assignment is due at 5pm on Sunday 30th May. Similar to As-

signment 1, Assignment 2 needs to be submitted in Canvas.

• You should generate an R markdown document and then produce a

pdf version for submission.

• You should name your submission file by your student id number.

• This assignment counts for 15% in the total assessment of this subject.

4

欢迎咨询51作业君