辅导案例-STAT 385-Assignment 05

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

STAT 385 Homework Assignment 05
Due by 12:00 PM 11/16/2019
HW 5 Problems
Below you will find problems for you to complete as an individual. It is fine to discuss the homework problems with classmates, but
cheating is prohibited and will be harshly penalized if detected.
1. Using the ggplot function and tidyverse functionality, do the following
visualizations:
a. recreate your improved visualization in problem 2c of HW04
b. add a new visually appealing layer to the plot that helps clarify the plot and separately include a short description beneath the
plot, such as “Fig. 1 shows…”
c. recreate your improved visualization in problem 4c of HW04
d. add a new visually appealing layer to the plot that helps clarify the plot and separately include a short description beneath the
plot, such as “Fig. 2 shows…”
2. Successfully import the US Natality Data (for year 2015). This single tab-
delimited file link here
(https://uofi.box.com/shared/static/iogdsmwxzmcqzd5jfgvdq81f44u2smc8.txt)
is 1.56 GB in size. If your computer cannot handle that processing, do use
the partitioned version of the data Folder link here
(https://uofi.box.com/s/ksobwxvudewssucmq7fspg7trx41qce1) which are 20
comma-separated files of the same US Natality Data.
Bonus (worth 10 additional points, i.e. your max HW 05 score could be 20 out of 10): do problem 2
using parallel programming ideas (particularly with mclapply) discussed in class. No outside
functions/packages other than those discussed in the notes on parallel programming.
3. Using the ggplot function and tidyverse functionality, recreate or
reimagine the following visualizations using the appropriate data. Be sure to
use the visual design considerations from Knaflic’s Storytelling with Data.
a. The image below uses the US Natlity Data. Also, explain the image with Markdown syntax.
b. The image below uses the US Natlity Data. Also, explain the image with Markdown syntax (do not include the explanation within
the visualization).
c. The image below uses the Chicago Food Inspections Data link here
(https://uofi.box.com/shared/static/5637axblfhajotail80yw7j2s4r27hxd.csv). Also, explain the image with Markdown syntax (do not
include the explanation within the visualization).
d. The image below uses the Chicago Food Inspections Data. Also, explain the image with Markdown syntax (do not include the
explanation within the visualization).
4. Do the following:
Redo problem 3 in HW03 using the mclapply function. Does parallel computing perform the tasks in parts c and e faster than the
method that you used in HW03? Show your work including the runtimes for the unparallelized and parallelized versions.
5. Problem in parallel coding
a. Install the conformal.glm R package which can be found at https://github.com/DEck13/conformal.glm
(https://github.com/DEck13/conformal.glm).
Run the following code:
library(conformal.glm)
set.seed(13)
n <- 250
# generate predictors
x <- runif(n)
# set regression coefficient vector
beta <- c(3, 5)
# generate responses from a linear regression model
y <- rnorm(n, mean = cbind(1, x) %*% beta, sd = 3)
# store predictors and responses as a dataframe
dat <- data.frame(y = y, x = x)
# fit linear regression model
model <- lm(y ~ x, data = dat)
# obtain OLS estimator of beta
betahat <- model$coefficients
# convert predictors into a matrix
Xk <- as.matrix(x, nrow = n)
# extract internal model information, this is necessary for the assignment
call <- model$call
formula <- call$formula
family <- "gaussian"
link <- "identity"
newdata.formula <- as.matrix(model.frame(formula, as.data.frame(data))[, -1])
# This function takes on a new (x,y) data point and reports a
# value corresponding to how similar this new data point is
# with the data that we generated, higher numbers are better.
# The goal is to use this function to get a range of new y
# values that agrees with our generated data at each x value in
# our generated data set.
density_score <- function(ynew, xnew){
rank(phatxy(ynew = ynew, xnew = xnew, Yk = y, Xk = Xk, xnew.modmat = xnew,
data = dat, formula = formula, family = family, link = link))[n+1]
}
# We try this out on the first x value in our generated data set.
# In order to do this we write two line searches
xnew <- x[1]
# start line searches at the predicted response value
# corresponding to xnew
ystart <- ylwr <- yupr <- as.numeric(c(1,xnew) %*% betahat)
score <- density_score(ynew = ystart, xnew = xnew)
# line search 1: line search that estimates the largest y
# value corresponding to the first x value that agrees with
# our generated data
while(score > 13){
yupr <- yupr + 0.01
score <- density_score(ynew = yupr, xnew = xnew)
}
# line search 2: line search that estimates the smallest y
# value corresponding to the first x value that agrees with
# our generated data
score <- density_score(ynew = ystart, xnew = xnew)
while(score > 13){
ylwr <- ylwr - 0.01
score <- density_score(ynew = ylwr, xnew = xnew)
}
b. Write a function which runs the two line searches in part a for the jth generated predictor value.
c. Use mclapply to run the function you wrote in part b, set the mc.cores argument to the output of a call to the
detectCores() function. Save the output and record the time that it took to perform these calculations Note that setting the
number of cores in mclapply to the number of available cores could grind your computer to a halt if you are using your
computer for other tasks.
d. Redo the calculation in part c using lapply and record the time it took to run this job. Which method was faster?
e. Using ggplot, plot the original data and depict lines of the lower and upper boundaries that you computed in part c.