Assignment 1 Q2 Analyzing wine data (30 points) The data for this exercise comes from a paper by Cortez, et al. (2009) (https://www.sciencedirect.com/science/article/abs/pii/S0167923609001377?via%3Dihub) where the authors were trying to relate various chemical properties of red and white wine to perceived quality. For this question, we will analyze only the data for the chemical properties, not the quality. Also the original paper looked at red and white wine, we will only use the data for the red. The data can be read in via: library(tidyverse) wine_data<-read_csv("red_wine_data.csv") # Be sure this is in your current working di rectory glimpse(wine_data) Rows: 1,599 Columns: 12 $ `fixed acidity`
7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7… $ `volatile acidity` 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600… $ `citric acid` 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00,… $ `residual sugar` 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.… $ chlorides 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069… $ `free sulfur dioxide` 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, … $ `total sulfur dioxide` 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 10… $ density 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978,… $ pH 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39,… $ sulphates 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47,… $ alcohol 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 1… $ quality 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5,… The variables are self-evident from the names. We will not want to use the quality varible and we can create a new dataset without it via: wine_data_chem <- wine_data %>% select(-quality) head(wine_data_chem) # A tibble: 6 x 11 `fixed acidity` `volatile acidity` `citric acid` `residual sugar` chlorides 1 7.4 0.7 0 1.9 0.076 2 7.8 0.88 0 2.6 0.098 3 7.8 0.76 0.04 2.3 0.092 4 11.2 0.28 0.56 1.9 0.075 5 7.4 0.7 0 1.9 0.076 6 7.4 0.66 0 1.8 0.075 # … with 6 more variables: free sulfur dioxide , # total sulfur dioxide , density , pH , sulphates , # alcohol This is the data you should analyze. a. (10 points) Using only scatterplots and the sample correlation matrices, summarize what you believe to be are the most interesting associations you observe amongst these characteristics. Show both the plots and summaries you generate to support your summaries. b. (20 points) Perform a principal component analysis of this data using your preferred function. As part of this analysis, please be sure complete the following tasks: Report the eigenvalues for all 11 principal compoments. For the first two principal components, plot and interpret compononents in terms of the original variables. In particular, explain which variables are most highly correlated with each of these two components and how these components are different from each other. Choose the smallest number of principal components that you believe can be used to summarize the information from the data and justify your choice. 欢迎咨询51作业君