辅导案例-MAST90083-Assignment 2

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Assignment 2
MAST90083 Computational Statistics and Data Mining
Due time: 5PM, Friday October 25th
You must submit your report via LMS
Data analysis
The data set chicago, in the package gamair, contains data on the relationship between
air pollution and the death rate in Chicago from 1 January 1987 to 31 December 2000.
The seven variables are: the total number of (non-accidental) deaths each day (death); the
median density over the city of large pollutant particles (pm10median); the median density
of smaller pollutant particles (pm25median); the median concentration of ozone (O3 ) in the
air (o3median); the median concentration of sulfur dioxide (SO2 ) in the air (so2median);
the time in days (time); and the daily mean temperature (tmpd).
We will model how the death rate changes with pollution and temperature. Epidemiol-
ogists tell us that risk factors usually multiply together rather than adding, so we will fit
additive models to the logarithm of the number of deaths. For fitting additive models, please
use the mgcv package.
For this exercise, you may find it helpful to refer to Chapters 7 and 8 of Shalizi.
1. Load the data set and run summary on it.
(a) Is temperature given in degrees Fahrenheit or degrees Celsius?
(b) The pollution variables are negative at least half the time. What might this mean?
(c) We will ignore the pm25median variable in the rest of this problem set. Why is
this reasonable?
2. Fit a spline smoothing of log(death) on time. (You can use either smooth.spline
or gam.)
(a) Plot the smoothing spline along with the actual values.
(b) There should be four large outliers, right next to each other in time. When are
they?
1
3. Use gam to fit an additive model for log(death) on pm10median, o3median, so2median,
tmpd and time. Use spline smoothing for each of these predictor variables. Hint: Be-
cause of some missing-data issues, some plots later may be easier to make if you set
the na.action=na.exclude option when estimating the model.
(a) Plot the partial response functions, with partial residuals. Describe the partial
response functions in words.
(b) Plot the fitted values as a function of time, along with the actual values of
log(death).
(c) Are the outliers still there? Are they any better?
4. Using the last model you fit, we will consider the predicted impact of a 2◦ Celsius
increase in temperature on log(death), taking the last full year of the data as a
baseline. (2◦ Celsius is in the middle range of current projections for the global average
effect of climate change by the end of this century.)
(a) Prepare a data frame containing only the last full year of the data. What is the
average predicted value of log(deaths)?
(b) Modify this data frame to increase all temperatures by 2◦ Celsius.
(c) Find the new average change in the predicted values of log(deaths) associated
with a 2◦ Celsius warming.
(d) Find a standard error for this average predicted change, using the standard errors
for the prediction on each day, and assuming no correlation among them. Include
an explanation of why your calculation is correct. Also give the corresponding
Gaussian 95% confidence interval. Hint 1: The se.fit option to predict. Hint
2: Appendix C to Shalizi on propagation of error.
(e) Find a standard error for the predicted change in the number of deaths (not the
change in log(death)) and the corresponding 95% Gaussian confidence interval.
Hint: Propagation of error again.
EM algorithm
For the following exercises, you may find it helpful to refer to Chapter 17 in Shalizi.
5. Code up the EM algorithm for a mixture of K Gaussians. Simulate data from K = 3.
How well does your code assign data-points to clusters if you give it the actual Gaussian
parameters as your initial guess? If you give it other initial parameters?
6. Read Section 17.4 in Shalizi for the analysis of the Snoqualmie Falls data with a
Gaussian mixture. As it turns out, the Gaussian mixture is rather unsatisfactory.
Write a function to fit a mixture of exponential distributions using the EM algorithm
2
Does it do any better than a Gaussian mixture at discovering sensible structure in the
Snoqualmie Falls data? You can read the dataset into R with the command
snoqualmie <- scan("http://www.stat.washington.edu/peter/book.data/set1",skip=1)
3