辅导案例-GU4206 GR5206

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

STAT GU4206 GR5206 Midterm
Gabriel
3/8/2019
Part I - Character data and regular expressions
Consider the following toy dataset strings_data.csv. This dataset has 461 rows (or length 461 using
readLines) and consists of random character strings.
char_data <- readLines("strings_data.csv")
head(char_data,4)
## [1] "\"strings\""
## [2] "\"rmJgFZUGKsBlvmuUOuWnFUyziiyWEEhiRROlJJXRXxOwp\""
## [3] "\"bacUqblSKDopCEAYWdgD\""
## [4] "\"qsPuSJdkmv\""
length(char_data)
## [1] 461
Among the 461 cases, several rows contain numeric digits and a specific string of the form “Group_Letter”,
where “Letter”" is a single uppercase letter. For example, the 6th element contains the symbols
“0.2”,“9.454”,“Group_I”.
char_data[6]
## [1] "\"SBolTFf0.2nMoQ9.454lKlgjQZGroup_IOMLFgXj\""
c("0.2","9.454","Group_I")
## [1] "0.2" "9.454" "Group_I"
Problem 1
Your task is to extract the numeric digits and the group variable from this character string vector. Notes:
1. The first number x is a single digit followed by a period and at least one digit. There are a few cases
where the first number is only a single digit without a period.
2. The second number y is one or two digits followed by a period and at least one digit. Note that the
second number can be negative or positive.
3. The group value is the string "Group_" followed by a single capital letter. For example "Group_I" and
"Group_S" are both elements of the third string of interest.
Once you extract all three symbols, make sure to convert the numeric digits to a numeric mode (use
as.numeric()) and organize the scrapped information in a dataframe. Your final dataframe should have 230
rows by 3 columns. The first three rows of your dataframe should look like the following output:
data.frame(x=c(0.20,0.36,0.56),
y=c(9.454,9.454,9.454),
Group=c("Group_I","Group_I","Group_I"))
1
## x y Group
## 1 0.20 9.454 Group_I
## 2 0.36 9.454 Group_I
## 3 0.56 9.454 Group_I
Solution
num_exp <- "[0-9]|(-?[0-9]{1,2}\\.[0-9]{0,4})"
num_index <- grep(char_data,pattern=num_exp)
num_locations <- gregexpr(pattern=num_exp,char_data[num_index])
nums <- regmatches(char_data[num_index],num_locations)
x <- sapply(nums,function(x) as.numeric(x[1]))
y <-sapply(nums,function(x) as.numeric(x[2]))
length(x)
## [1] 230
length(y)
## [1] 230
group_exp <- "Group_[A-Z]"
num_locations <- gregexpr(pattern=group_exp,char_data[num_index])
nums <- regmatches(char_data[num_index],num_locations)
Letter <- sapply(nums,function(x) x[1])
head(Letter)
## [1] "Group_I" "Group_I" "Group_I" "Group_I" "Group_I" "Group_I"
Check result!
my_data <- data.frame(x,y,Letter)
my_data[1:3,]
## x y Letter
## 1 0.20 9.454 Group_I
## 2 0.36 9.454 Group_I
## 3 0.56 9.454 Group_I
Problem 2
Use both base R and ggplot to construct a scatterplot of the variables y versus x and split the colors of
the plot by the variable Group. Also include a legend, relabel the axes and include a title. Make sure the
legend doesn’t cover up the plot in base R.
Base R plot
Letter_fac <- factor(my_data$Letter)
plot(my_data$x,my_data$y,col=Letter_fac,
ylab="y",xlab="x",main="My Base R Plot")
legend("topright",legend=levels(Letter_fac),
fill=1:length(levels(Letter_fac)),cex=.5)
2
0 2 4 6 8 10
0
2
4
6
8
10
My Base R Plot
x
y
Group_A
Group_H
Group_I
Group_S
Group_T
ggplot
plot
library(ggplot2)
ggplot(data=my_data)+
geom_point(mapping=aes(x=x,y=y,col=Letter))+
labs(title="My ggplot Plot")
3
0.0
2.5
5.0
7.5
10.0
0.0 2.5 5.0 7.5 10.0
x
y
Letter
Group_A
Group_H
Group_I
Group_S
Group_T
My ggplot Plot
Part II - Data proccessing and exploratory analysis
The data comprise of roughly 25,000 records for males between the age of 18 and 70 who are full time workers.
A variety of variables are given for each subject: years of education and job experience, college graduate (yes,
no), working in or near a city (yes, no), US region (midwest, northeast, south, west), commuting distance,
number of employees in a company, and race (African America, Caucasian, Other). The response variable is
weekly wages (in dollars). The data are taken many decades ago so the wages are low compared to current
times.
salary_data <- read.csv("salary.txt",as.is=T,header=T)
head(salary_data)
## wage edu exp city reg race deg com emp
## 1 354.94 7 45 yes northeast white no 24.3 200
## 2 370.37 9 9 yes northeast white no 26.2 130
## 3 754.94 11 46 yes northeast white no 26.4 153
## 4 593.54 12 36 yes northeast other no 9.9 86
## 5 377.23 16 22 yes northeast white yes 7.1 181
## 6 284.90 8 51 yes northeast white no 11.4 32
Below I am defining a new variable in the salary_data dataframe which computes the natural logarithm of
wages.
salary_data$log_wage <- log(salary_data$wage)
4
Problem 3
Use the summary() function on the salary dataset to check if the variables make sense. Specifically, one of
the continuous variables has some “funny” values. Remove the rows of the dataframe corresponding to these
strange values. If you can’t figure this question out, then move on because you can still solve Problem 4 & 5
without Problem 3.
Solution
summary(salary_data)
## wage edu exp city
## Min. : 50.39 Min. : 0.00 Min. :-4.00 Length:24823
## 1st Qu.: 356.13 1st Qu.:12.00 1st Qu.: 9.00 Class :character
## Median : 546.06 Median :12.00 Median :16.00 Mode :character
## Mean : 637.82 Mean :13.06 Mean :18.53
## 3rd Qu.: 830.96 3rd Qu.:16.00 3rd Qu.:27.00
## Max. :18777.20 Max. :18.00 Max. :63.00
## reg race deg com
## Length:24823 Length:24823 Length:24823 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 5.30
## Mode :character Mode :character Mode :character Median :10.50
## Mean :11.66
## 3rd Qu.:16.80
## Max. :49.70
## emp log_wage
## Min. : 3.0 Min. :3.920
## 1st Qu.: 67.0 1st Qu.:5.875
## Median :111.0 Median :6.303
## Mean :123.2 Mean :6.265
## 3rd Qu.:166.0 3rd Qu.:6.723
## Max. :631.0 Max. :9.840
salary_data <- salary_data[-which(salary_data$exp<0),]
Problem 4
Using ggplot, plot log_wages against work experience, i.e., x=exp and y=log_wages. In this graphic,
change the transparency of the points so that the scatterplot does not look so dense. Note: the alpha
parameter changes the transparency. Also label the plot appropriately.
Solution
library(ggplot2)
ggplot(data=salary_data)+
geom_point(mapping = aes(x=exp,y=log_wage),alpha=1/10)
5
46
8
10
0 20 40 60
exp
lo
g_
wa
ge
Notice that your graphic constructed from Problem 4 shows a quadratic or curved relationship between
log_wages against exp. The next task is to plot three quadratic functions for each race level “black”,
“white” and “other”. To estimate the quadratic fit, you can use the following function quad_fit:
quad_fit <- function(data_sub) {
return(lm(log_wage~exp+I(exp^2),data=data_sub)$coefficients)
}
quad_fit(salary_data)
## (Intercept) exp I(exp^2)
## 5.685505082 0.060770975 -0.001095438
The above function computes the least squares quadratic fit and returns coefficients aˆ1,aˆ2 and aˆ3, where
Yˆ = aˆ1 + aˆ2x+ aˆ3x2
and Yˆ = log(wage) and x = exp.
Use ggplot to accomplish this task or use base R graphics for partial credit. Make sure to include a legend
and appropriate labels.
Solution
data_black <- salary_data[salary_data$race=="black",]
data_white <- salary_data[salary_data$race=="white",]
data_other <- salary_data[salary_data$race=="other",]
fit_black <- quad_fit(data_black)
fit_white <- quad_fit(data_white)
fit_other <- quad_fit(data_other)
6
x <- seq(0,60,by=.01)
y_black <- fit_black[1]+fit_black[2]*x+fit_black[3]*x^2
y_white <- fit_white[1]+fit_white[2]*x+fit_white[3]*x^2
y_other <- fit_other[1]+fit_other[2]*x+fit_other[3]*x^2
len_x <- length(x)
plot_data <- data.frame(x=rep(x,3),
Average=c(y_black,y_white,y_other),
Race=c(rep("Black",len_x),
rep("White",len_x),
rep("Other",len_x))
)
head(plot_data)
## x Average Race
## 1 0.00 5.514983 Black
## 2 0.01 5.515450 Black
## 3 0.02 5.515918 Black
## 4 0.03 5.516385 Black
## 5 0.04 5.516851 Black
## 6 0.05 5.517318 Black
library(ggplot2)
ggplot(data=plot_data)+
geom_line(mapping = aes(x=x,y=Average,col=Race))+
labs(x="Work Experience",y="Average Log-Wages",title="My Plot")
5.4
5.7
6.0
6.3
6.6
0 20 40 60
Work Experience
Av
e
ra
ge
L
og
−W
a
ge
s
Race
Black
Other
White
My Plot
7
Part III - The Bootstrap
Data and model description
Consider a study that assesses how a drug affects someone’s resting heart rate. The study consists of n = 60
respondents. The researcher randomly places the respondents into three groups; control group and two dosage
groups (20 each). The first drug group is given 200 mg (x1) and the second drug group is given 500 mg (x2).
She then measures each respondent’s resting heart rate 1 hour after the drug was administered (Y ). She also
measures other characteristics of each respondent; age (x3), weight (x4), height (x5), gender (x6) and initial
resting heart rate before the drug was administered (x7). The statistical linear regression model is:
Y = β0 + β1x1 + β2x2 + β3x3 + β4x4 + β5x5 + β6x6 + β7x7 + ,
iid∼ N(0, σ2).
There are three dummy variables for this model:
x1 =
{
1 if 200mg
0 otherwse
x2 =
{
1 if 500mg
0 otherwse
x6 =
{
1 if male
0 otherwse
Based on the above variable coding, the control group is described through the intercept β0.
Exploratory analysis
The dataset drugstudy.csv is read in below.
drugstudy <- read.table("drugstudy.txt",header=T)
head(drugstudy)
## Final.HR Initial.HR Dose1 Dose2 Age Height Weight Gender
## 1 75.1 73.6 0 0 29 73.1 251.73 1
## 2 71.6 71.7 0 0 34 72.4 151.59 1
## 3 65.5 66.5 0 0 25 67.2 133.89 1
## 4 77.2 72.7 0 0 39 69.8 154.91 1
## 5 75.8 75.8 0 0 32 72.7 186.59 1
## 6 67.9 68.7 0 0 25 66.4 205.77 1
Problem 5
Compute the average final resting heart rate for each drug group. Also compute the average initial resting
heart rate for each drug group. Display the results in dataframe or table.
Solution
Drug <- ifelse(drugstudy$Dose1==0 & drugstudy$Dose2==0,"Control",
ifelse(drugstudy$Dose1==1 & drugstudy$Dose2==0,"200mg","500mg"))
Final <- tapply(drugstudy$Final.HR,Drug,mean)
Initial <- tapply(drugstudy$Initial.HR,Drug,mean)
data.frame(Initial,Final)
## Initial Final
## 200mg 73.785 71.920
## 500mg 74.890 71.145
## Control 72.950 72.695
8
Problem 6
Construct a comparative boxplot of the respondents final resting heart rate for each drug group. Use base R
or ggplot. Make sure to label the plot appropriately.
Base R plot
Drug <- factor(Drug,levels=c("Control","200mg","500mg"))
boxplot(drugstudy$Final.HR~Drug,main="Comparative Boxplot",ylab="Final Resting Heart Rate")
Control 200mg 500mg
60
65
70
75
80
Comparative Boxplot
Fi
na
l R
es
tin
g
He
ar
t R
at
e
ggplot plot
library(ggplot2)
ggplot(data=drugstudy)+
geom_boxplot(mapping=aes(x=Drug,y=Final.HR),col="blue")+
labs(title="Comparative Boxplot",x="",y="Final Resting Heart Rate")
9
60
65
70
75
80
Control 200mg 500mg
Fi
na
l R
es
tin
g
He
ar
t R
at
e
Comparative Boxplot
Nonparametric analysis (bootstrap)
Consider a nonparametric approach to assess the drug’s impact on final resting heart rate. More specifically,
the researcher is going to perform a bootstrap procedure on the following parameters:
1. β1
2. β2
3. β1 − β2
The final bootstrap intervals incorporate the three testing procedures:
1. H0 : β1 = 0 vs. HA : β1 6= 0
2. H0 : β2 = 0 vs. HA : β2 6= 0
3. H0 : β1 − β2 = 0 vs. HA : β1 − β2 6= 0
When testing β1 = 0, we are investigating the impact of the 200mg dosage group versus the control group.
Similarly, when testing β2 = 0, we are investigating the impact of the 500mg dosage group versus the control
group. The third test β1 − β2 = 0 is describing if the low dosage group has the same impact on resting heart
rate as the high dosage group.
Problem 7
Perform the follwong tasks!
Run a bootstrap procedure on parameters β1, β2 and β1 − β2. (i) Construct a table or dataframe displaying
the least squares estimators of βˆ1, βˆ2 and βˆ1 − βˆ2 of the original dataset, (ii) the bootstrapped standard
10
errors, and (iii) the bootstrap 95% confidence intervals. Use the traditional bootstrap intervals with B = 1000
boot iterations. The table should look similar to the following output:
Parameter Estimate Boot SE 95% Boot L-Bound 95% Boot U-Bound
Beta1 # # # #
Beta2 # # # #
Beta1_Beta2 # # # #
Solution
LM_original <- lm(Final.HR~.,data=drugstudy)$coefficients
LM_original
## (Intercept) Initial.HR Dose1 Dose2 Age Height
## -1.38229005 0.98409928 -2.48514208 -4.06064284 0.23352275 -0.12559980
## Weight Gender
## 0.02845916 -0.29880464
beta1_hat <- LM_original["Dose1"]
beta2_hat <- LM_original["Dose2"]
beta1_beta_2_hat <- LM_original["Dose1"]-LM_original["Dose2"]
# Boot
B <- 1000
n <- nrow(drugstudy)
boot_matrix <- matrix(NA,nrow=B,ncol=3)
colnames(boot_matrix) <- c("beta1","beta2","beta1-beta2")
for (b in 1:B) {
boot_index <- sample(1:n,n,replace=T)
Boot_slopes <- lm(Final.HR~.,data=drugstudy[boot_index,])$coefficients
boot_matrix[b,1] <- Boot_slopes["Dose1"]
boot_matrix[b,2] <- Boot_slopes["Dose2"]
boot_matrix[b,3] <- Boot_slopes["Dose1"]-Boot_slopes["Dose2"]
}
out <- data.frame(Estimates=c(beta1_hat,beta2_hat,beta1_beta_2_hat),
Boot.SE=apply(boot_matrix,2,sd),
Boot.L=2*c(beta1_hat,beta2_hat,beta1_beta_2_hat)-apply(boot_matrix,2,quantile,probs=.975),
Boot.U=2*c(beta1_hat,beta2_hat,beta1_beta_2_hat)-apply(boot_matrix,2,quantile,probs=.025))
row.names(out) <- c("beta1","beta2","beta1-beta2")
out
## Estimates Boot.SE Boot.L Boot.U
## beta1 -2.485142 0.7058631 -3.9116682 -1.229388
## beta2 -4.060643 0.6265362 -5.2910791 -2.801721
## beta1-beta2 1.575501 0.7078828 0.1497114 2.837857
Problem 8
Briefly interpret your results. More specifically, check if zero falls in the bootstrap intervals and conclude if
we do or do not show statistical significance.
11