辅导案例-OMEWORK 1

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

BUSINESS DATA MINING (IDS 572)
HOMEWORK 1
DUE DATE: WEDNESDAY, SEPTEMBER 16 AT 3:30 PM
• Please provide succinct answers to the questions below.
• You should submit an electronic pdf or word file in blackboard.
• Please include the names of all team-members in your write up and in the name of the file.
• One submission is sufficient for the entire group.
• You should include all the R functions you use in your pdf file.
• Please make sure your graphs have titles, labels, legends (if necessary), appropriate colors and
etc.
Problem 1. This question should be done without using R.
We have just put out a special promotion and would like to determine who responded to the mailing. We
have a sample of consumers, including both purchasers and non-purchasers (all received the promotion),
and would like to predict who is a purchaser. For each consumer, we have their age (bucketed into
ranges), their income which is a numerical variable, and whether or not they responded to last year’s
mailing.
Purchase? Age Income Last Year?
Yes Young 60K No
Yes Middle 60K Yes
Yes Old 100K Yes
Yes Young 60K Yes
Yes Young 100K Yes
Yes Middle 60K No
No Old 150K Yes
No Middle 100K No
No Young 150K Yes
No Middle 100K Yes
No Old 150K No
Please show your calculations in the following questions,
(a) Using the 1-rule method discussed in class, find the relevant sets of classification rules for the
target variable (Purchase?) by testing each of the input attributes. Which of these sets of rules
has the lowest misclassification rate?
(b) Now, we want to construct a decision tree using this data set. What is the entropy measure of
the entire data set?
(c) What is the information gain for splitting on age? On income? On last year? Which should be
the initial split variable?
1
2 HOMEWORK 1 DUE DATE: WEDNESDAY, SEPTEMBER 16 AT 3:30 PM
(d) Construct the entire tree using the information gain. What is the decision at each terminal
node?
(e) What is the accuracy of your decision tree on this data set? Explain your answer.
(f) What would your tree predict for a Middle aged person with 90K income who did not purchase
last year? Justify your answer. Consider the following instances as test data points.
Purchase? Age Income Last Year?
Yes Middle 60K No
No Young 90K Yes
Yes Old 100K No
Yes Young 60K Yes
Yes Middle 140K No
What is the accuracy of your decision tree model on the test data? Justify your answer.
(g) What are the support and confidence of the rules
– If Last Year = Yes, then Yes.
– If Age = Middle, and Income = 50K, then Yes.
(h) Based on your decision tree, which variable is considered the most important variable? Justify
your answer.
The goal of the remaining questions is to review statistical R programming and get comfortable with
coding in R, especially for those ones of you who did not learn R in IDS 570. The questions are asking
for simple tasks but try your best to code them in the most elegant way that you can. For example,
you can use the function “tibbles” instead of data-frame for a nicer looking data frames. Take care of
details and try to play with the functions arguments to check how they work. Again, since some of you
are new with R, please feel free to contact me or the TA if you have any specific questions.
Problem 2. In this question, we use the built-in R dataset called attitude which contains information
from a survey of the clerical employees of a large financial organization. To access this date set use
“data(“attitude”)”. Learn more about each variable by reading the variable description in ?attitude.
(a) Summarize the main statistics of all the variables in the data set.
(b) How many observations are in the attitude dataset? What function in R did you use to display
this information?
(c) Produce a scatterplot matrix of the variables in the attitude dataset. What seems to be most
correlated with the overall rating?
(d) Produce a scatterplot of rating (on the y-axis) vs. learning (on the x-axis). Add a title to the
plot.
BUSINESS DATA MINING (IDS 572) 3
(e) Produce 2 side-by-side histograms, one for rating and one for learning. You will need to use
par(mfrow=...) to get the two plots together.
Problem 3. Use hw1.xls to answer the following questions. Include all the charts with proper labels
in your report. Please, for each of the chart you produced, write a sentence or two explaining what you
see from the chart.
(a) Make a frequency distribution table for the gender variable to see the frequency distribution.
(b) Make a bar chart for gender variable.
(c) Make a histogram to display the distribution of the Height variable.
(d) Make a cluster bar chart (side-by-side bar chart) to examine the correlation between gender and
Ate Fried Food variables.
(e) Make a scatter plot to examine the correlation between Weight and Height variables, and write
a sentence to describe the trend you observed from the scatter plot.
(f) Find the 5-number summary for the Height data and make a boxplot for the Height data with
mild and extreme outliers identified using inner and outer fences. Draw the boxplot.
Problem 4. To do this question you need the following packages in R: MASS, plyr, dplyr, tibble, and
ggplot2. We are going to use the builtin data set “birthwt” (Risk Factors Associated with Low Infant
Birth Weight) from the MASS library. This dataset contains 189 instances and 10 variables. To learn
more about this data set you can use ?birthwt.
(a) All the variables are represented as integer. Write your own function that automatically converts
all the integer variables to factors (categorical).
(b) Repeat part (a) using mutate() and mapvalues() functions.
(c) Use the tapply() function to see what the average birthweight looks like when broken down
by race and smoking status. Does smoking status appear to have an effect on birth weight?
Does the effect of smoking status appear to be consistent across racial groups? What is the
association between race and birth weight?
(d) Use kable() function from knitr to dispaly the table you get in part (c).
(e) Use ddply() function to get the average birthweight by mother’s race and compare it with
tapply() function.
(f) Use ggplot2() to plot the average birthweight (computed in part (e)) for each race group in a
bar plot.
(g) Use ddply() function to look at the average birthweight and proportion of babies with low
birthweight broken down by smoking status.
(h) Split the data further by adding “mother smokes” to the ddply() formula used in part (g).
4 HOMEWORK 1 DUE DATE: WEDNESDAY, SEPTEMBER 16 AT 3:30 PM
(i) Is the mother’s age correlated with birth weight? Does the correlation vary with smoking status?
Problem 5. “ggplot() produces far better and more easily customizable graphics than anythor visual-
ization functions in R. There are two basic calls in ggplot:
• qplot(x, y, . . . , data): a “quick-plot” routine, which essentially replaces the base plot().
• ggplot(data, aes(x, y, ...), ...): defines a graphics object from which plots can be generated, along
with aesthetic mappings that specify how variables are mapped to visual properties.
In this question, we would like to quickly practice drawing different plots using ggplot2(). For this
purpose, we use the “diamonds” dataset in R. You can access this dataset by writing “data(diamonds)”.
(a) What type of variable is price? Would you expect its distribution to be symmetric, right-skewed,
or left-skewed? Why? Make a histogram of the distribution of diamond prices. Does the shape
of the distribution match your expectation? (Use geom histogram()).
(b) Visualize a few other numerical variables in the dataset and discuss any interesting features.
When describing distributions of numerical variables we might also want to view statistics like
mean, median, etc.
(c) What type of variable is color? Which color is most prominently represented in the dataset?
(d) Make a bar plot of the distribution of cut, and describe its distribution (Use georm bar())
(e) Make a histogram of the depths of diamonds, with binwidth of 0.2%, and add another variable
(say, cut) to the visualization. You can do this either using an aesthetic or a facet. Typical
diamonds of which cut have the highest depth? On average, does depth increase or decrease as
cut grade increase or decrease?
(f) Compare the distribution of price for the different cuts. Does anything seem unusual? Describe.
(g) Draw a scatterplot showing the price (y-axis) as a function of the carat (size).
(h) Shrink the points in your scatter plot in part (g) using the alpha argument in geom point.
(i) Use facet wrap(∼ factor1 + factor2 + ... + factorn) command to create scatter plots showing
how diamond price varies with carat size for different values of “cut” (use colour = color in
aes()).