Assignment 2 ETC1010-5510 Patricia Menéndez Tuesday, May 18 2021 library(naniar) library(broom) library(ggmap) library(knitr) library(lubridate) library(rwalkr) library(sugrrants) library(timeDate) library(tsibble) library(here) library(readr) library(tidyverse) library(ggResidpanel) library(gridExtra) tree_data0 <- read_csv("Data/Assignment_data.csv") Part I Question 1: Rename the variables Date Planted and Year Planted to Dateplanted and Yearplanted using the rename() function. Make sure Dateplanted is defined as a date variable. Then extract from the variable Dateplanted the year and store it in a new variable called Year. Display the first 6 rows of the data frame. (5pts) tree_data <- tree_data0 %>% rename(Dateplanted = `Date Planted`, Yearplanted = `Year Planted`) %>% mutate(Dateplanted = dmy(Dateplanted), Year = year(Dateplanted)) head(tree_data) ## # A tibble: 6 x 20 ## `CoM ID` `Common Name` `Scientific Nam~ Genus Family `Diameter Breas~ ##
1 ## 1 1057605 White Poplar Populus alba Popu~ Salic~ NA ## 2 1028440 London Plane Platanus x acer~ Plat~ Plata~ 62 ## 3 1058665 Small-leaved~ Tilia cordata Tilia Malva~ 19 ## 4 1026352 Variegated E~ Ulmus minor Ulmus Ulmac~ 26 ## 5 1038440 Canary Islan~ Pinus canariens~ Pinus Pinac~ 91 ## 6 1015128 London Plane Platanus x acer~ Plat~ Plata~ 99 ## # ... with 14 more variables: Yearplanted , Dateplanted , `Age ## # Description` , `Useful Life Expectency` , `Useful Life Expectency ## # Value` , Precinct , `Located in` , UploadDate , ## # CoordinateLocation , Latitude , Longitude , Easting , ## # Northing , Year #write_csv(tree_data, "Data/Assignment_data.csv") Question 2: Have you noticed any differences between the variables Year and Yearplanted? Why is that? Demonstrate your claims using R code. Fix the problem if there is one (Hint: Use ifelse inside a mutate function to fix the problem and store the data in tree_data_clean). After this question, please use the data in tree_data_clean to proceed. (3pts) Yes, the encoding for 1900 has been converted to 2000 instead. length(which(tree_data$Year!=tree_data$Yearplanted)) ## [1] 5321 tree_data_clean <- tree_data %>% mutate(Year = ifelse(Year != Yearplanted, Yearplanted, Year)) Question 3: Investigate graphically the missing values in the vari- able Dateplanted for the last 1000 rows of the data set. What do you observe? (max 30 words) (2pts) tree_data_singlevariable <- tree_data_clean %>% dplyr::select(Dateplanted) vis_miss(tail(tree_data_singlevariable, n = 1000) , warn_large_data = FALSE) 2 Da tep lan ted (0% ) 0 250 500 750 1000 O bs er va tio ns Present (100%) Question 4: What is the proportion of missing values in each vari- able in the tree data set? Display the results in descending order of the proportion. (2pts) miss_var_summary(tree_data_clean) %>% arrange(-pct_miss) ## # A tibble: 20 x 3 ## variable n_miss pct_miss ## ## 1 Precinct 6828 100 ## 2 Diameter Breast Height 1454 21.3 ## 3 Age Description 1454 21.3 ## 4 Useful Life Expectency 1454 21.3 ## 5 Useful Life Expectency Value 1454 21.3 ## 6 Dateplanted 2 0.0293 ## 7 Year 2 0.0293 ## 8 Common Name 1 0.0146 ## 9 Located in 1 0.0146 ## 10 CoM ID 0 0 ## 11 Scientific Name 0 0 ## 12 Genus 0 0 ## 13 Family 0 0 ## 14 Yearplanted 0 0 ## 15 UploadDate 0 0 ## 16 CoordinateLocation 0 0 3 ## 17 Latitude 0 0 ## 18 Longitude 0 0 ## 19 Easting 0 0 ## 20 Northing 0 0 Question 5: How many observations have a missing value in the variable Dateplanted? Identify the rows and display the infor- mation in those rows. Remove all the rows in the data set of which the variable Dateplanted has a missing value recorded and store the data in tree_data_clean1. Display the first 4 rows of tree_data_clean1. Use R inline code to complete the sentense below. (6pts) Two missing values in the following rows: tree_data_clean %>% dplyr::filter(is.na(Dateplanted)) ## # A tibble: 2 x 20 ## `CoM ID` `Common Name` `Scientific Nam~ Genus Family `Diameter Breas~ ## ## 1 1024155 Cyprus Plane Platanus orient~ Plat~ Plata~ 22 ## 2 1023092 London Plane Platanus x acer~ Plat~ Plata~ 29 ## # ... with 14 more variables: Yearplanted , Dateplanted , `Age ## # Description` , `Useful Life Expectency` , `Useful Life Expectency ## # Value` , Precinct , `Located in` , UploadDate , ## # CoordinateLocation , Latitude , Longitude , Easting , ## # Northing , Year tree_data_clean1 <- tree_data_clean %>% dplyr::filter(!is.na(Dateplanted)) head(tree_data_clean1, 4) ## # A tibble: 4 x 20 ## `CoM ID` `Common Name` `Scientific Nam~ Genus Family `Diameter Breas~ ## ## 1 1057605 White Poplar Populus alba Popu~ Salic~ NA ## 2 1028440 London Plane Platanus x acer~ Plat~ Plata~ 62 ## 3 1058665 Small-leaved~ Tilia cordata Tilia Malva~ 19 ## 4 1026352 Variegated E~ Ulmus minor Ulmus Ulmac~ 26 ## # ... with 14 more variables: Yearplanted , Dateplanted , `Age ## # Description` , `Useful Life Expectency` , `Useful Life Expectency ## # Value` , Precinct , `Located in` , UploadDate , ## # CoordinateLocation , Latitude , Longitude , Easting , ## # Northing , Year The number of rows in the cleaned data set are 6826 and the number of columns are 20 4 Question 6: Create a map with the tree locations in the data set. (2pts) # We have created the map below for you melb_map <- read_rds(here::here("Data/melb-map.rds")) # Here you just need to add the location for each tree into the map. ggmap(melb_map) + geom_point(data = tree_data_clean1, aes(x = Longitude, y = Latitude), colour = "#006400", alpha = 0.6, size = 0.2) −37.820 −37.815 −37.810 −37.805 −37.800 144.94 144.95 144.96 144.97 lon la t Question 7: Create another map and draw trees in the Genus groups of Eucalyptus, Macadamia, Prunus, Acacia, and Quercus. Use the “Dark2” color palette and display the legend at the bottom of the plot. (8pts) selected_group <- tree_data_clean1 %>% dplyr::filter(Genus %in% c("Eucalyptus", "Macadamia", "Prunus", 5 "Acacia", "Quercus")) %>% droplevels() ggmap(melb_map) + geom_point(data = selected_group, aes(x = Longitude, y = Latitude, color = Genus), alpha = 0.6, size = 0.2) + labs(x = "Longitude", y = "Latitude") + scale_colour_brewer(palette = "Dark2", name = "Genus") + guides(col = guide_legend(nrow = 2, byrow = TRUE)) + theme(legend.position = "bottom") −37.820 −37.815 −37.810 −37.805 −37.800 144.94 144.95 144.96 144.97 Longitude La tit ud e Genus Acacia Eucalyptus Prunus Quercus 6 Question 8: Filter the data tree_data_clean1 so that only the variables Year, Located in, and Common Name are displayed. Ar- range the data set by Year in descending order and display the first 4 lines. Call this new data set tree_data_clean_filter. Then answer the following question using inline R code: When (Year), where (Located in) and what tree (Common Name) was the first tree planted in Melbourne according to this data set? (8pts) # This will order the trees from the most recent planted to the older onces # becuase we are using descending order for the Year variable tree_data_clean_filter <- tree_data_clean1 %>% select("Year", "Located in", "Common Name") %>% arrange(desc(Year)) head(tree_data_clean_filter, 4) ## # A tibble: 4 x 3 ## Year `Located in` `Common Name` ## ## 1 2000 Street Small-leaved Linden ## 2 2000 Street Spotted Gum ## 3 2000 Street Drooping sheoak ## 4 2000 Park Kanooka # To find out the older trees you could simple look at the tail of the data # created in the previous R code chunk using --> tail(tree_data_clean_filter, 4) # the function tail() will show you the end of the data. # Alternatively you can simply re-do the same steps as above and # arrange the variable Year from smaller to larger as follows: tree_data_clean_filter2 <- tree_data_clean1 %>% select("Year", "Located in", "Common Name") %>% arrange(Year) head(tree_data_clean_filter2, 4) ## # A tibble: 4 x 3 ## Year `Located in` `Common Name` ## ## 1 1900 Park White Poplar ## 2 1900 Park London Plane ## 3 1900 Street Variegated Elm ## 4 1900 Park Canary Island Pine The first tree was planted in 1900 at a Park and the tree name is Small-leaved Linden 7 Question 9: How many trees were planted in parks and how many in streets? Tabulate the results (only for locations in parks and streets) using the function kable() from the kableExtra R package. (3pts) tree_data_clean1 %>% dplyr::filter(`Located in` %in% c("Park", "Street")) %>% count(`Located in`) %>% kable() Located in n Park 2737 Street 4088 Question 10: How many trees are there in each of the Family groups in the data set tree_data_clean1 (display the first 5 lines of the results in descending order)? (2pt) tree_data_clean1 %>% count(Family, sort = TRUE) %>% head(n = 5) ## # A tibble: 5 x 2 ## Family n ## ## 1 Myrtaceae 2102 ## 2 Platanaceae 1512 ## 3 Ulmaceae 1125 ## 4 Fabaceae 327 ## 5 Fagaceae 254 Question 11: Create a markdown table displaying the number of trees planted in each year (use variable Yearplanted) with common names Ironbark, Olive, Plum, Oak, and Elm (Hint: Use kable() from the gridExtra R package). What is the oldest most abundant tree in this group? (8pts) tree_data_clean1 %>% dplyr::filter(`Common Name` %in% c("Ironbark", "Olive", "Plum", "Oak", "Elm")) %>% group_by(Yearplanted, `Common Name`) %>% count(`Common Name`, sort = TRUE) %>% kable() 8 Yearplanted Common Name n 1900 Elm 179 1900 Ironbark 29 2000 Ironbark 23 2000 Elm 18 1900 Olive 17 2000 Oak 9 1900 Oak 4 The oldest most abundant tree was elm. Question 12: Select the trees with diameters (Diameter Breast Height) greater than 40 cm and smaller 100 cm and comment on where the trees are located (streets or parks). (max 25 words) (3pts) large_trees_data <- tree_data_clean1 %>% dplyr::filter(`Diameter Breast Height` > 40 , `Diameter Breast Height` < 100) %>% count(`Located in`) Question 13: Plot the trees within the diameter range that you have selected in Question 12, which are located in parks and streets on a map using 2 different colours to differentiate their locations (streets or parks). (6pts) large_trees_data_parks <- tree_data_clean1 %>% dplyr::filter(`Diameter Breast Height` > 40 , `Diameter Breast Height` < 100) Large trees seem to be concentrated on certain streets. ggmap(melb_map) + geom_point(data = large_trees_data_parks , aes(x = Longitude, y = Latitude, color = `Located in`), alpha = 0.6, size = 0.2) + labs(x = "Longitude", y = "Latitude") + scale_colour_brewer(palette = "Dark2") + guides(col = guide_legend(nrow = 2, byrow = TRUE)) + theme(legend.position = "bottom") 9 −37.820 −37.815 −37.810 −37.805 −37.800 144.94 144.95 144.96 144.97 Longitude La tit ud e Located in Park Street Question 14: Create a time series plot (using geom_line) that dis- plays the total number of trees planted per year in the data set tree_data_clean1 that belong to the Families: Myrtaceae, Are- caceae, and Ulmaceae. What do you observe from the plot? (6pts) Fig_data <- tree_data_clean1 %>% dplyr::filter(Family %in% c("Myrtaceae", "Arecaceae", "Ulmaceae")) %>% mutate(Family = as.factor(Family)) %>% group_by(Year, Family) %>% count(Family, sort = TRUE) ggplot(Fig_data, aes( x = Year, y = n, color = Family)) + geom_line() 10 0500 1000 1500 1900 1925 1950 1975 2000 Year n Family Arecaceae Myrtaceae Ulmaceae With time less arecaceae and ulmaceae family trees have been planted. After 1977 there was an increase in the number of myrtaceae family trees planted. Part 2: Simulation Exercise Question 15: Create a data frame called simulation_data that contains 2 variables with names response and covariate. Gen- erate the variables according to the following model: response = 3.5×covariate+epsilon where covariate is a variable that takes values 0, 1, 2, . . . , 100 and is generated according to a Normal distribution (Hint: Use the function rnorm() to generate epsilon.) (3pts) set.seed(2021) simulation_data <- data.frame( covariate = c(0:100), response = 3.5*c(0:100) + rnorm(101)) 11 Question 16: Display graphically the relationship between the vari- ables response and covariate (1pt) using a point plot. Which kind of relationship do you observe? (2pts) ggplot(simulation_data, aes(x = covariate, y = response)) + geom_point() 0 100 200 300 0 25 50 75 100 covariate re sp on se The relationship between the variables response and covariate is linear Question 17: Fit a linear model between the variables response and covariate that you generate in Question 15 and display the model summary. (2pts) mod <- lm(response ~ covariate, data = simulation_data) summary(mod) ## ## Call: ## lm(formula = response ~ covariate, data = simulation_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.07431 -0.71466 0.05844 0.64196 2.25176 ## ## Coefficients: 12 ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.135896 0.199948 0.68 0.498 ## covariate 3.493775 0.003455 1011.35 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.012 on 99 degrees of freedom ## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999 ## F-statistic: 1.023e+06 on 1 and 99 DF, p-value: < 2.2e-16 Question 18: What are the values for the intercept and the slope in the estimated model in Question 17 (Hint: Use the function coef())? How do these values compare with the values in the simulation model? (max 50 words) (2pts) coef(mod) ## (Intercept) covariate ## 0.1358957 3.4937754 The slope is estimated well and the intercept in this model takes the value of epsilon when covariate takes value 0. Question 19: Create a figure to display the diagnostic plots of the linear model that you fit in Question 17. Comment on the diag- nostic plots (max 50 words). Is this a good/bad model and why? (max 30 words) (4pts) resid_panel(mod, plots = "all") 13 −2 −1 0 1 2 0 100 200 300 Predicted Values R es id ua ls Residual Plot −2 −1 0 1 2 0 25 50 75 100 Observation Number R es id ua ls Index Plot 0 100 200 300 0 100 200 300 Predicted Values re sp on se Response vs Predicted −2 −1 0 1 2 −2 −1 0 1 2 Theoretical Quantiles Sa m pl e Qu an tile s Q−Q Plot 0.0 0.2 0.4 −2.5 0.0 2.5 Residuals D en si ty Histogram −2 −1 0 1 2 R es id ua ls Boxplot 0.00 0.02 0.04 0.06 0 25 50 75 100 Observation CO O K' s D COOK's D Plot 0.0 0.5 1.0 1.5 0 100 200 300 Predicted Values St an da rd ize d Re sid ua ls Location−Scale Plot − − − Cook's distance contours−2 −1 0 1 2 0.00 0.01 0.02 0.03 0.04 LeverageSt an da rd ize d Re sid ua ls Residual−Leverage Plot Question 20: Report R2, Radjusted, AIC, and BIC. Is this a good/bad model? Please explain your answer. (max 30 words) (2pts) broom::glance(mod) ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## ## 1 1.00 1.00 1.01 1022819. 1.58e-200 1 -144. 293. 301. ## # ... with 3 more variables: deviance , df.residual , nobs 14 欢迎咨询51作业君