辅导案例-SEHS4538

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top
1

SEHS4538 Big Data Analytics
Individual Assignment

Question 1 [Total 10 marks]

For this question, use the Framiningham_training data set.

This data set contains information about patients in a hospital who are being treated for heart
disease, and whether they die or not.

Do not perform any data preparation at all on this data set.

All 3 predictor variables should be used:

 Sex – 1 means male, 2 means female
 Age – Age of person in years
 Educ – Education level represented as an integer between 1 and 4

The target variable is:

 Death – Whether the person dies from heart disease. 0 means no, 1 means yes.



Question 1a [4 marks]

Write code to plot a CART Decision Tree, showing the proportion (not total) of records in each
category.



Question 1b [2 marks]

Using only the plot (and nothing else) from Question 1a, describe the characteristics of people
who are most likely to die from heart disease.

Write your answer as a comment in the code, and clearly label your comment as “Question 1b”.



Question 1c [4 marks]

Write code to use a Random Forest (not the CART decision tree from Question 1a) and print out
the probability of death for a 63 year old female who has the highest level of education. Do not
print any other statistics.
2

Question 2 [Total 10 marks]

For this question, use the clothing_store_PCA_training data set.

This data set contains information about customers in a clothing store.

Perform only the data preparation given in the questions below.

The predictor variables are:

 Days since Purchase – Number of days since the customer’s last purchase
 Purchase Visits – Number of times the customer has bought something
 Days on File – Number of days the customer’s details have been stored
 Days between Purchases – Average number of days between purchases for the customer
 Diff Items Purchased – Number of items the customer has bought

The target variable is:

 Sales per Visit – Average amount (in dollars) the customer spends when they buy



Question 2a [4 marks]

In lecture 2, slide 11, the data presented has almost certainly been modified, because all bank
mortgages older than 10 years have been rounded down to 10 years. The evidence to support this
is the fact that there are so many mortgages that are 10 years old compared to all other years, and
that there are no mortgages older than 10 years, which is extremely unlikely.

In lecture 2, slide 12, the data presented has also almost certainly been modified, because there
are so many bank account holders with zero and negative ages, which obviously does not make
sense. Also, there are a significant number of customers older than 100 years, even though there
are no customers between 90 and 100 years old, which seems extremely unlikely, suggesting that
these very old customers are modified data.

Similarly, the clothing_store_PCA_training data set has also almost certainly been modified.

Load the data set into your R program, and use RStudio to analyse the data set, explaining where
the data has been modified, and show the evidence that supports your analysis.

Write your answer as a comment in the code, and clearly label your comment as “Question 2a”.
3

Question 2b [3 marks]

If a customer goes for a long time without coming back to buy again from our clothing store,
then they are more likely to never return, and we will lose a valuable customer.

Write code to create an appropriate diagram to show roughly how long before a customer is
unlikely to return to our store to make a purchase.



Question 2c [3 marks]

Create a new field in the data set to store the total amount of sales each customer has purchased.

There is no need to generate a new file. Just change the data set so that the new field can be
viewed in RStudio’s Environment (top-right of screen)



Important: For all questions, you are required to write comments for any code that you write, in
the same fashion and standard as has been demonstrated in the tutorials and sample solutions.
Otherwise, up to 50% of the marks value of that question may be deducted.


There are no more questions

欢迎咨询51作业君
51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468