辅导案例-CMDA-3654 2019

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

CMDA-3654 2019 Summer II
Homework 6
Your name here
Due Oct 24th as a .pdf upload
1
Instructions:
Delete the Instructions section from your write-up!!
I have given you this assignment as an .Rmd (R Markdown) file.
• Change the name of the file to: Lastname_Firstname_CMDA_3654_HW6.Rmd, and your output should therefore match
but with a .pdf extension.
• You need to edit the R Markdown file by filling in the chunks appropriately with your code. Output will be generated
automatically when you compile the document.
• You also need to add your own text before and after the chunks to explain what you are doing or to interpret the output.
• Feel free to add additional chunks if needed. I will not be providing assignments to you like this for the entire semester,
just long enough for you to learn how to do it for yourself.
Required: The final product that you turn in must be a .pdf file.
• You can Knit this document directly to a PDF if you have LaTeX installed (which is preferred).
• If you absolutely can’t get LaTeX installed and/or working, then you can compile to a .html first, by clicking on the
arrow button next to knit and selecting Knit to HTML.
• You must then print you .html file to a .pdf by using first opening it in a web browser and then printing to a .pdf
Problem 1: [25 pts] Exploring Relationships between variables.
Load the DatasaurusDozen.tsv file into R.
This data consists of x and y observations for 13 sub-datasets that have the following names:
dino, away, h_lines, v_lines, x_shape, star, high_lines, dots, circle, bullseye, slant_up, slant_down, wide_lines
a. Use dplyr functions to summarize each dataset in the following way: Compute the mean for x, mean for y, sd for x,
sd for y, and the correlation coefficient between x and y. Please round your answers to 2 decimal places. The
answers should be returned automatically in a tibble. Use kable() or pandoc.table() (use results=‘asis’ in chunk
definition if using pandoc.table()) or some other function to make nicely formatted table of your results.
b. What does the numerical summaries tell you about the data in the 12 different data sets? In particular, does the
correlation coefficient provide you with much information about the relationship between x and y?
c. Now make a basic scatterplot of x and y for the 13 different datasets. Use a different color for each dataset. My best
advice is to simply use ggplot() with facet_wrap(), as this can be done in a singe line.
d. How does your interpretation about the relationships between x and y change after seeing the plots?
e. What lesson can be learned here?
Problem 2: [25 pts] Linear Regression
Consider the mtcars dataset. Say we want to build a linear regression model that predicts mpg, using any subset of the other
variables as predictors.
a. Begin by creating a scatterplot matrix between mpg and all other predictors. Report the correlations as well in either
the upper or lower half of the scatterplot matrix.
b. What are the three variables most highly correlated with mpg?
c. Fit three simple linear regression models using your previous three variables/predictors. Report summaries for the
models. Which model would you choose and why?
2
d. Create a multiple linear regression (MLR) model using stepAIC() to identify the best subset of predictors from all of
the variables in mtcars (obviously mpg is still the response variable). Report these predictors, and a summary of the
model these predictors produced.
e. Compare your MLR model to your three simple linar regression models earlier. Are any of those predictors in your
MLR model? Are the coefficients the same for those predictors? If not, explain what may have caused the change.
Problem 3: [25 pts] More Linear Regression
Sometimes your dataset is rather small, but you see that a simple linear regression is not appropriate so you try harder to fit a
more complicated model. This is an example of such a situation.
A poultry scientist was studying various dietary additives to increase the rate at which chickens gain weight. One of the
potential additives was studied by creating a new diet that consisted of a standard basal diet supplemented with varying
amounts of the additive (0, 20, 40, 60, 80, and 100 grams). There were 60 chicks available for the study. Each of the six diets
was randomly assigned to 10 chicks. At the end of 4 weeks, the feed efficiency ratio, feed consumed (gm) to weight gain (gm),
was obtained for the 60 chicks. The experiment was also concerned with the effects of high levels of copper in the chick feed.
Five of the 10 chicks in each level of the feed additive received 400 ppm of copper, while the remaining five chicks received no
copper.
The data is contained in the chicken.csv data file.
a. In order to explore the relationship between feed efficiency ratio (FER) and feed additive (A), plot the FER versus A.
b. What type of regression appears most appropriate?
c. Fit first-order, quadratic, and cubic regression models to the data. Which regression equation provides the best fit to
the data? Justify your answer using evidence based upon plots and relevant summaries.
d. Is there anything peculiar about any of the data values? Provide an explanation of what may have happened. (Hint:
Look at regression diagnostics like plots of the residuals versus the fitted values (or x), plot the leverages, or plot some
measure of influence.)
e. Using your best polynomial model from (b) & (c). Fit a new model that includes the linear addition of copper and
display the estimate table. Does Copper provide a significant improvement to the fit? Carry out an F-test that compares
the Full model that contains Copper and the reduced model that has your polynomial model fit on the additive only.
Discuss the results.
Problem 4: [10 pts] Linear Regression with Indicator Variables
Consider the data in smoking_birthweight.csv. This data contains 3 variables. The birth weight of a baby (Weight), the
length of gestation (Gestation) in weeks, and the smoking status of the mother (Smoke). The smoking status of the mother
in this case is coded as yes or no. This is a categorical variable (aka factor) with 2 categories (a binary variable). We could
have coded the levels of this factor as an indicator variable using TRUE or FALSE, or equivalently 1 or 0, respectively.
a. Fit a first-order regression model with birth weight as the response variable and the gestation and smoking status as
predictors. Write down the fitted regression model equation and interpret the regression coefficients. If you can do this,
you should have no problem handling the extra credit.
b. Plot the fitted regression lines (yes plural), why are there two?
3
Problem 5: [15 pts] Parameter Interpretation with Indicator Variables
Recall that indicator variables, sometimes called “dummy” variables, are binary variables that indicate whether an event is
recognized or not (i.e., 1 if TRUE 0 if FALSE). Suppose we have a data set of reported salaries and highest achieved education
levels. Suppose the variables are as follows: salary, noHS, highSchoolGrad, Assoc, Bach, Masters, Doctorate, where the
levels of education are either a 1 or 0 depending on whether that is the given observation’s highest level of achieved education.
• Write down the multiple linear regression model. Specify which βi are indicator variables.
• Write interpretations for all of your model parameters, that is βi, for i ∈ {0, 1, 2, 3, 4, 5}.
• Now assume we were to add another variable to this data set: an observation’s gender. Write down this new model,
and now interpret β0.
4