辅导案例-CMDA-3654 2019
CMDA-3654 2019 Summer II Homework 6 Your name here Due Oct 24th as a .pdf upload 1 Instructions: Delete the Instructions section from your write-up!! I have given you this assignment as an .Rmd (R Markdown) file. • Change the name of the file to: Lastname_Firstname_CMDA_3654_HW6.Rmd, and your output should therefore match but with a .pdf extension. • You need to edit the R Markdown file by filling in the chunks appropriately with your code. Output will be generated automatically when you compile the document. • You also need to add your own text before and after the chunks to explain what you are doing or to interpret the output. • Feel free to add additional chunks if needed. I will not be providing assignments to you like this for the entire semester, just long enough for you to learn how to do it for yourself. Required: The final product that you turn in must be a .pdf file. • You can Knit this document directly to a PDF if you have LaTeX installed (which is preferred). • If you absolutely can’t get LaTeX installed and/or working, then you can compile to a .html first, by clicking on the arrow button next to knit and selecting Knit to HTML. • You must then print you .html file to a .pdf by using first opening it in a web browser and then printing to a .pdf Problem 1: [25 pts] Exploring Relationships between variables. Load the DatasaurusDozen.tsv file into R. This data consists of x and y observations for 13 sub-datasets that have the following names: dino, away, h_lines, v_lines, x_shape, star, high_lines, dots, circle, bullseye, slant_up, slant_down, wide_lines a. Use dplyr functions to summarize each dataset in the following way: Compute the mean for x, mean for y, sd for x, sd for y, and the correlation coefficient between x and y. Please round your answers to 2 decimal places. The answers should be returned automatically in a tibble. Use kable() or pandoc.table() (use results=‘asis’ in chunk definition if using pandoc.table()) or some other function to make nicely formatted table of your results. b. What does the numerical summaries tell you about the data in the 12 different data sets? In particular, does the correlation coefficient provide you with much information about the relationship between x and y? c. Now make a basic scatterplot of x and y for the 13 different datasets. Use a different color for each dataset. My best advice is to simply use ggplot() with facet_wrap(), as this can be done in a singe line. d. How does your interpretation about the relationships between x and y change after seeing the plots? e. What lesson can be learned here? Problem 2: [25 pts] Linear Regression Consider the mtcars dataset. Say we want to build a linear regression model that predicts mpg, using any subset of the other variables as predictors. a. Begin by creating a scatterplot matrix between mpg and all other predictors. Report the correlations as well in either the upper or lower half of the scatterplot matrix. b. What are the three variables most highly correlated with mpg? c. Fit three simple linear regression models using your previous three variables/predictors. Report summaries for the models. Which model would you choose and why? 2 d. Create a multiple linear regression (MLR) model using stepAIC() to identify the best subset of predictors from all of the variables in mtcars (obviously mpg is still the response variable). Report these predictors, and a summary of the model these predictors produced. e. Compare your MLR model to your three simple linar regression models earlier. Are any of those predictors in your MLR model? Are the coefficients the same for those predictors? If not, explain what may have caused the change. Problem 3: [25 pts] More Linear Regression Sometimes your dataset is rather small, but you see that a simple linear regression is not appropriate so you try harder to fit a more complicated model. This is an example of such a situation. A poultry scientist was studying various dietary additives to increase the rate at which chickens gain weight. One of the potential additives was studied by creating a new diet that consisted of a standard basal diet supplemented with varying amounts of the additive (0, 20, 40, 60, 80, and 100 grams). There were 60 chicks available for the study. Each of the six diets was randomly assigned to 10 chicks. At the end of 4 weeks, the feed efficiency ratio, feed consumed (gm) to weight gain (gm), was obtained for the 60 chicks. The experiment was also concerned with the effects of high levels of copper in the chick feed. Five of the 10 chicks in each level of the feed additive received 400 ppm of copper, while the remaining five chicks received no copper. The data is contained in the chicken.csv data file. a. In order to explore the relationship between feed efficiency ratio (FER) and feed additive (A), plot the FER versus A. b. What type of regression appears most appropriate? c. Fit first-order, quadratic, and cubic regression models to the data. Which regression equation provides the best fit to the data? Justify your answer using evidence based upon plots and relevant summaries. d. Is there anything peculiar about any of the data values? Provide an explanation of what may have happened. (Hint: Look at regression diagnostics like plots of the residuals versus the fitted values (or x), plot the leverages, or plot some measure of influence.) e. Using your best polynomial model from (b) & (c). Fit a new model that includes the linear addition of copper and display the estimate table. Does Copper provide a significant improvement to the fit? Carry out an F-test that compares the Full model that contains Copper and the reduced model that has your polynomial model fit on the additive only. Discuss the results. Problem 4: [10 pts] Linear Regression with Indicator Variables Consider the data in smoking_birthweight.csv. This data contains 3 variables. The birth weight of a baby (Weight), the length of gestation (Gestation) in weeks, and the smoking status of the mother (Smoke). The smoking status of the mother in this case is coded as yes or no. This is a categorical variable (aka factor) with 2 categories (a binary variable). We could have coded the levels of this factor as an indicator variable using TRUE or FALSE, or equivalently 1 or 0, respectively. a. Fit a first-order regression model with birth weight as the response variable and the gestation and smoking status as predictors. Write down the fitted regression model equation and interpret the regression coefficients. If you can do this, you should have no problem handling the extra credit. b. Plot the fitted regression lines (yes plural), why are there two? 3 Problem 5: [15 pts] Parameter Interpretation with Indicator Variables Recall that indicator variables, sometimes called “dummy” variables, are binary variables that indicate whether an event is recognized or not (i.e., 1 if TRUE 0 if FALSE). Suppose we have a data set of reported salaries and highest achieved education levels. Suppose the variables are as follows: salary, noHS, highSchoolGrad, Assoc, Bach, Masters, Doctorate, where the levels of education are either a 1 or 0 depending on whether that is the given observation’s highest level of achieved education. • Write down the multiple linear regression model. Specify which βi are indicator variables. • Write interpretations for all of your model parameters, that is βi, for i ∈ {0, 1, 2, 3, 4, 5}. • Now assume we were to add another variable to this data set: an observation’s gender. Write down this new model, and now interpret β0. 4