辅导案例-GU4206 GR5206
STAT GU4206 GR5206 Midterm Gabriel 3/8/2019 Part I - Character data and regular expressions Consider the following toy dataset strings_data.csv. This dataset has 461 rows (or length 461 using readLines) and consists of random character strings. char_data <- readLines("strings_data.csv") head(char_data,4) ## [1] "\"strings\"" ## [2] "\"rmJgFZUGKsBlvmuUOuWnFUyziiyWEEhiRROlJJXRXxOwp\"" ## [3] "\"bacUqblSKDopCEAYWdgD\"" ## [4] "\"qsPuSJdkmv\"" length(char_data) ## [1] 461 Among the 461 cases, several rows contain numeric digits and a specific string of the form “Group_Letter”, where “Letter”" is a single uppercase letter. For example, the 6th element contains the symbols “0.2”,“9.454”,“Group_I”. char_data[6] ## [1] "\"SBolTFf0.2nMoQ9.454lKlgjQZGroup_IOMLFgXj\"" c("0.2","9.454","Group_I") ## [1] "0.2" "9.454" "Group_I" Problem 1 Your task is to extract the numeric digits and the group variable from this character string vector. Notes: 1. The first number x is a single digit followed by a period and at least one digit. There are a few cases where the first number is only a single digit without a period. 2. The second number y is one or two digits followed by a period and at least one digit. Note that the second number can be negative or positive. 3. The group value is the string "Group_" followed by a single capital letter. For example "Group_I" and "Group_S" are both elements of the third string of interest. Once you extract all three symbols, make sure to convert the numeric digits to a numeric mode (use as.numeric()) and organize the scrapped information in a dataframe. Your final dataframe should have 230 rows by 3 columns. The first three rows of your dataframe should look like the following output: data.frame(x=c(0.20,0.36,0.56), y=c(9.454,9.454,9.454), Group=c("Group_I","Group_I","Group_I")) 1 ## x y Group ## 1 0.20 9.454 Group_I ## 2 0.36 9.454 Group_I ## 3 0.56 9.454 Group_I Solution num_exp <- "[0-9]|(-?[0-9]{1,2}\\.[0-9]{0,4})" num_index <- grep(char_data,pattern=num_exp) num_locations <- gregexpr(pattern=num_exp,char_data[num_index]) nums <- regmatches(char_data[num_index],num_locations) x <- sapply(nums,function(x) as.numeric(x[1])) y <-sapply(nums,function(x) as.numeric(x[2])) length(x) ## [1] 230 length(y) ## [1] 230 group_exp <- "Group_[A-Z]" num_locations <- gregexpr(pattern=group_exp,char_data[num_index]) nums <- regmatches(char_data[num_index],num_locations) Letter <- sapply(nums,function(x) x[1]) head(Letter) ## [1] "Group_I" "Group_I" "Group_I" "Group_I" "Group_I" "Group_I" Check result! my_data <- data.frame(x,y,Letter) my_data[1:3,] ## x y Letter ## 1 0.20 9.454 Group_I ## 2 0.36 9.454 Group_I ## 3 0.56 9.454 Group_I Problem 2 Use both base R and ggplot to construct a scatterplot of the variables y versus x and split the colors of the plot by the variable Group. Also include a legend, relabel the axes and include a title. Make sure the legend doesn’t cover up the plot in base R. Base R plot Letter_fac <- factor(my_data$Letter) plot(my_data$x,my_data$y,col=Letter_fac, ylab="y",xlab="x",main="My Base R Plot") legend("topright",legend=levels(Letter_fac), fill=1:length(levels(Letter_fac)),cex=.5) 2 0 2 4 6 8 10 0 2 4 6 8 10 My Base R Plot x y Group_A Group_H Group_I Group_S Group_T ggplot plot library(ggplot2) ggplot(data=my_data)+ geom_point(mapping=aes(x=x,y=y,col=Letter))+ labs(title="My ggplot Plot") 3 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 x y Letter Group_A Group_H Group_I Group_S Group_T My ggplot Plot Part II - Data proccessing and exploratory analysis The data comprise of roughly 25,000 records for males between the age of 18 and 70 who are full time workers. A variety of variables are given for each subject: years of education and job experience, college graduate (yes, no), working in or near a city (yes, no), US region (midwest, northeast, south, west), commuting distance, number of employees in a company, and race (African America, Caucasian, Other). The response variable is weekly wages (in dollars). The data are taken many decades ago so the wages are low compared to current times. salary_data <- read.csv("salary.txt",as.is=T,header=T) head(salary_data) ## wage edu exp city reg race deg com emp ## 1 354.94 7 45 yes northeast white no 24.3 200 ## 2 370.37 9 9 yes northeast white no 26.2 130 ## 3 754.94 11 46 yes northeast white no 26.4 153 ## 4 593.54 12 36 yes northeast other no 9.9 86 ## 5 377.23 16 22 yes northeast white yes 7.1 181 ## 6 284.90 8 51 yes northeast white no 11.4 32 Below I am defining a new variable in the salary_data dataframe which computes the natural logarithm of wages. salary_data$log_wage <- log(salary_data$wage) 4 Problem 3 Use the summary() function on the salary dataset to check if the variables make sense. Specifically, one of the continuous variables has some “funny” values. Remove the rows of the dataframe corresponding to these strange values. If you can’t figure this question out, then move on because you can still solve Problem 4 & 5 without Problem 3. Solution summary(salary_data) ## wage edu exp city ## Min. : 50.39 Min. : 0.00 Min. :-4.00 Length:24823 ## 1st Qu.: 356.13 1st Qu.:12.00 1st Qu.: 9.00 Class :character ## Median : 546.06 Median :12.00 Median :16.00 Mode :character ## Mean : 637.82 Mean :13.06 Mean :18.53 ## 3rd Qu.: 830.96 3rd Qu.:16.00 3rd Qu.:27.00 ## Max. :18777.20 Max. :18.00 Max. :63.00 ## reg race deg com ## Length:24823 Length:24823 Length:24823 Min. : 0.00 ## Class :character Class :character Class :character 1st Qu.: 5.30 ## Mode :character Mode :character Mode :character Median :10.50 ## Mean :11.66 ## 3rd Qu.:16.80 ## Max. :49.70 ## emp log_wage ## Min. : 3.0 Min. :3.920 ## 1st Qu.: 67.0 1st Qu.:5.875 ## Median :111.0 Median :6.303 ## Mean :123.2 Mean :6.265 ## 3rd Qu.:166.0 3rd Qu.:6.723 ## Max. :631.0 Max. :9.840 salary_data <- salary_data[-which(salary_data$exp<0),] Problem 4 Using ggplot, plot log_wages against work experience, i.e., x=exp and y=log_wages. In this graphic, change the transparency of the points so that the scatterplot does not look so dense. Note: the alpha parameter changes the transparency. Also label the plot appropriately. Solution library(ggplot2) ggplot(data=salary_data)+ geom_point(mapping = aes(x=exp,y=log_wage),alpha=1/10) 5 46 8 10 0 20 40 60 exp lo g_ wa ge Notice that your graphic constructed from Problem 4 shows a quadratic or curved relationship between log_wages against exp. The next task is to plot three quadratic functions for each race level “black”, “white” and “other”. To estimate the quadratic fit, you can use the following function quad_fit: quad_fit <- function(data_sub) { return(lm(log_wage~exp+I(exp^2),data=data_sub)$coefficients) } quad_fit(salary_data) ## (Intercept) exp I(exp^2) ## 5.685505082 0.060770975 -0.001095438 The above function computes the least squares quadratic fit and returns coefficients aˆ1,aˆ2 and aˆ3, where Yˆ = aˆ1 + aˆ2x+ aˆ3x2 and Yˆ = log(wage) and x = exp. Use ggplot to accomplish this task or use base R graphics for partial credit. Make sure to include a legend and appropriate labels. Solution data_black <- salary_data[salary_data$race=="black",] data_white <- salary_data[salary_data$race=="white",] data_other <- salary_data[salary_data$race=="other",] fit_black <- quad_fit(data_black) fit_white <- quad_fit(data_white) fit_other <- quad_fit(data_other) 6 x <- seq(0,60,by=.01) y_black <- fit_black[1]+fit_black[2]*x+fit_black[3]*x^2 y_white <- fit_white[1]+fit_white[2]*x+fit_white[3]*x^2 y_other <- fit_other[1]+fit_other[2]*x+fit_other[3]*x^2 len_x <- length(x) plot_data <- data.frame(x=rep(x,3), Average=c(y_black,y_white,y_other), Race=c(rep("Black",len_x), rep("White",len_x), rep("Other",len_x)) ) head(plot_data) ## x Average Race ## 1 0.00 5.514983 Black ## 2 0.01 5.515450 Black ## 3 0.02 5.515918 Black ## 4 0.03 5.516385 Black ## 5 0.04 5.516851 Black ## 6 0.05 5.517318 Black library(ggplot2) ggplot(data=plot_data)+ geom_line(mapping = aes(x=x,y=Average,col=Race))+ labs(x="Work Experience",y="Average Log-Wages",title="My Plot") 5.4 5.7 6.0 6.3 6.6 0 20 40 60 Work Experience Av e ra ge L og −W a ge s Race Black Other White My Plot 7 Part III - The Bootstrap Data and model description Consider a study that assesses how a drug affects someone’s resting heart rate. The study consists of n = 60 respondents. The researcher randomly places the respondents into three groups; control group and two dosage groups (20 each). The first drug group is given 200 mg (x1) and the second drug group is given 500 mg (x2). She then measures each respondent’s resting heart rate 1 hour after the drug was administered (Y ). She also measures other characteristics of each respondent; age (x3), weight (x4), height (x5), gender (x6) and initial resting heart rate before the drug was administered (x7). The statistical linear regression model is: Y = β0 + β1x1 + β2x2 + β3x3 + β4x4 + β5x5 + β6x6 + β7x7 + , iid∼ N(0, σ2). There are three dummy variables for this model: x1 = { 1 if 200mg 0 otherwse x2 = { 1 if 500mg 0 otherwse x6 = { 1 if male 0 otherwse Based on the above variable coding, the control group is described through the intercept β0. Exploratory analysis The dataset drugstudy.csv is read in below. drugstudy <- read.table("drugstudy.txt",header=T) head(drugstudy) ## Final.HR Initial.HR Dose1 Dose2 Age Height Weight Gender ## 1 75.1 73.6 0 0 29 73.1 251.73 1 ## 2 71.6 71.7 0 0 34 72.4 151.59 1 ## 3 65.5 66.5 0 0 25 67.2 133.89 1 ## 4 77.2 72.7 0 0 39 69.8 154.91 1 ## 5 75.8 75.8 0 0 32 72.7 186.59 1 ## 6 67.9 68.7 0 0 25 66.4 205.77 1 Problem 5 Compute the average final resting heart rate for each drug group. Also compute the average initial resting heart rate for each drug group. Display the results in dataframe or table. Solution Drug <- ifelse(drugstudy$Dose1==0 & drugstudy$Dose2==0,"Control", ifelse(drugstudy$Dose1==1 & drugstudy$Dose2==0,"200mg","500mg")) Final <- tapply(drugstudy$Final.HR,Drug,mean) Initial <- tapply(drugstudy$Initial.HR,Drug,mean) data.frame(Initial,Final) ## Initial Final ## 200mg 73.785 71.920 ## 500mg 74.890 71.145 ## Control 72.950 72.695 8 Problem 6 Construct a comparative boxplot of the respondents final resting heart rate for each drug group. Use base R or ggplot. Make sure to label the plot appropriately. Base R plot Drug <- factor(Drug,levels=c("Control","200mg","500mg")) boxplot(drugstudy$Final.HR~Drug,main="Comparative Boxplot",ylab="Final Resting Heart Rate") Control 200mg 500mg 60 65 70 75 80 Comparative Boxplot Fi na l R es tin g He ar t R at e ggplot plot library(ggplot2) ggplot(data=drugstudy)+ geom_boxplot(mapping=aes(x=Drug,y=Final.HR),col="blue")+ labs(title="Comparative Boxplot",x="",y="Final Resting Heart Rate") 9 60 65 70 75 80 Control 200mg 500mg Fi na l R es tin g He ar t R at e Comparative Boxplot Nonparametric analysis (bootstrap) Consider a nonparametric approach to assess the drug’s impact on final resting heart rate. More specifically, the researcher is going to perform a bootstrap procedure on the following parameters: 1. β1 2. β2 3. β1 − β2 The final bootstrap intervals incorporate the three testing procedures: 1. H0 : β1 = 0 vs. HA : β1 6= 0 2. H0 : β2 = 0 vs. HA : β2 6= 0 3. H0 : β1 − β2 = 0 vs. HA : β1 − β2 6= 0 When testing β1 = 0, we are investigating the impact of the 200mg dosage group versus the control group. Similarly, when testing β2 = 0, we are investigating the impact of the 500mg dosage group versus the control group. The third test β1 − β2 = 0 is describing if the low dosage group has the same impact on resting heart rate as the high dosage group. Problem 7 Perform the follwong tasks! Run a bootstrap procedure on parameters β1, β2 and β1 − β2. (i) Construct a table or dataframe displaying the least squares estimators of βˆ1, βˆ2 and βˆ1 − βˆ2 of the original dataset, (ii) the bootstrapped standard 10 errors, and (iii) the bootstrap 95% confidence intervals. Use the traditional bootstrap intervals with B = 1000 boot iterations. The table should look similar to the following output: Parameter Estimate Boot SE 95% Boot L-Bound 95% Boot U-Bound Beta1 # # # # Beta2 # # # # Beta1_Beta2 # # # # Solution LM_original <- lm(Final.HR~.,data=drugstudy)$coefficients LM_original ## (Intercept) Initial.HR Dose1 Dose2 Age Height ## -1.38229005 0.98409928 -2.48514208 -4.06064284 0.23352275 -0.12559980 ## Weight Gender ## 0.02845916 -0.29880464 beta1_hat <- LM_original["Dose1"] beta2_hat <- LM_original["Dose2"] beta1_beta_2_hat <- LM_original["Dose1"]-LM_original["Dose2"] # Boot B <- 1000 n <- nrow(drugstudy) boot_matrix <- matrix(NA,nrow=B,ncol=3) colnames(boot_matrix) <- c("beta1","beta2","beta1-beta2") for (b in 1:B) { boot_index <- sample(1:n,n,replace=T) Boot_slopes <- lm(Final.HR~.,data=drugstudy[boot_index,])$coefficients boot_matrix[b,1] <- Boot_slopes["Dose1"] boot_matrix[b,2] <- Boot_slopes["Dose2"] boot_matrix[b,3] <- Boot_slopes["Dose1"]-Boot_slopes["Dose2"] } out <- data.frame(Estimates=c(beta1_hat,beta2_hat,beta1_beta_2_hat), Boot.SE=apply(boot_matrix,2,sd), Boot.L=2*c(beta1_hat,beta2_hat,beta1_beta_2_hat)-apply(boot_matrix,2,quantile,probs=.975), Boot.U=2*c(beta1_hat,beta2_hat,beta1_beta_2_hat)-apply(boot_matrix,2,quantile,probs=.025)) row.names(out) <- c("beta1","beta2","beta1-beta2") out ## Estimates Boot.SE Boot.L Boot.U ## beta1 -2.485142 0.7058631 -3.9116682 -1.229388 ## beta2 -4.060643 0.6265362 -5.2910791 -2.801721 ## beta1-beta2 1.575501 0.7078828 0.1497114 2.837857 Problem 8 Briefly interpret your results. More specifically, check if zero falls in the bootstrap intervals and conclude if we do or do not show statistical significance. 11