1 SEHS4538 Big Data Analytics Individual Assignment Question 1 [Total 10 marks] For this question, use the Framiningham_training data set. This data set contains information about patients in a hospital who are being treated for heart disease, and whether they die or not. Do not perform any data preparation at all on this data set. All 3 predictor variables should be used: Sex – 1 means male, 2 means female Age – Age of person in years Educ – Education level represented as an integer between 1 and 4 The target variable is: Death – Whether the person dies from heart disease. 0 means no, 1 means yes. Question 1a [4 marks] Write code to plot a CART Decision Tree, showing the proportion (not total) of records in each category. Question 1b [2 marks] Using only the plot (and nothing else) from Question 1a, describe the characteristics of people who are most likely to die from heart disease. Write your answer as a comment in the code, and clearly label your comment as “Question 1b”. Question 1c [4 marks] Write code to use a Random Forest (not the CART decision tree from Question 1a) and print out the probability of death for a 63 year old female who has the highest level of education. Do not print any other statistics. 2 Question 2 [Total 10 marks] For this question, use the clothing_store_PCA_training data set. This data set contains information about customers in a clothing store. Perform only the data preparation given in the questions below. The predictor variables are: Days since Purchase – Number of days since the customer’s last purchase Purchase Visits – Number of times the customer has bought something Days on File – Number of days the customer’s details have been stored Days between Purchases – Average number of days between purchases for the customer Diff Items Purchased – Number of items the customer has bought The target variable is: Sales per Visit – Average amount (in dollars) the customer spends when they buy Question 2a [4 marks] In lecture 2, slide 11, the data presented has almost certainly been modified, because all bank mortgages older than 10 years have been rounded down to 10 years. The evidence to support this is the fact that there are so many mortgages that are 10 years old compared to all other years, and that there are no mortgages older than 10 years, which is extremely unlikely. In lecture 2, slide 12, the data presented has also almost certainly been modified, because there are so many bank account holders with zero and negative ages, which obviously does not make sense. Also, there are a significant number of customers older than 100 years, even though there are no customers between 90 and 100 years old, which seems extremely unlikely, suggesting that these very old customers are modified data. Similarly, the clothing_store_PCA_training data set has also almost certainly been modified. Load the data set into your R program, and use RStudio to analyse the data set, explaining where the data has been modified, and show the evidence that supports your analysis. Write your answer as a comment in the code, and clearly label your comment as “Question 2a”. 3 Question 2b [3 marks] If a customer goes for a long time without coming back to buy again from our clothing store, then they are more likely to never return, and we will lose a valuable customer. Write code to create an appropriate diagram to show roughly how long before a customer is unlikely to return to our store to make a purchase. Question 2c [3 marks] Create a new field in the data set to store the total amount of sales each customer has purchased. There is no need to generate a new file. Just change the data set so that the new field can be viewed in RStudio’s Environment (top-right of screen) Important: For all questions, you are required to write comments for any code that you write, in the same fashion and standard as has been demonstrated in the tutorials and sample solutions. Otherwise, up to 50% of the marks value of that question may be deducted. There are no more questions
欢迎咨询51作业君