微
FINA6610 Assignment 2
"
•Ben Lim
ssasa.eu
Due: May 27 2024, 5:00pm
In this and future assignments, you may be asked to use additional Rpackages or functions. Recall that,
for any function f which you are unsure of the usage about, you may type ?f in the R console to obtain the
help file for that function, or ??f to perform a search.
Please go through the coding labs for the relevant chapters of ISLR2 before attempting the coding
questions. Rmd files for the labs can be found in the relevant course content folders.
In this assignment, where hypothesis tests are required please use ↵=0.05 where not stated.
1. In this problem, we will consider the Boston dataset in the ISLR2 library. The data description can
be obtained by typing ?Boston.
(a) Delete the observations with medv = 50. After this step, use best subset selection with the BIC
criterion and nvmax = 13 to select the best performing subset of predictors for the prediction of
the medv variable. Justify your answer by showing the plot of BIC values.
(b) Similarly, use forward and backward selection to select the best performing subset of variables.
Show the plot of BIC values for each of the two methods.
(c) Fit a lasso model to the data, using cross-validation to select the parameter. What coe cients
does the model select? Do the conclusions di↵er between these methods?
(d) Split the data into a validation set consisting of the first 100 data points and the remainder as
the training set. Using best subset selection, select the best performing set of predictors for the
prediction of medv using the validation error criterion.
(e) Similarly, use forward and backward selection to select the best performing subset of variables.
Do the conclusions di↵er between these methods?
2. (Based on ISLR2 Chapter 6, Question 9) In this exercise, we will predict the number of applications
received using the other variables in the College data set.
(a) Split the data set intoa trainingset consisting of the first 700 observations, and the remainder as
the test set.
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
(c) Fit a ridge regression model on the training set, with chosen by cross-validation. Report the
test error obtained.
(d) Fit a lasso model on the training set, with chosen by cross-validation. Report the test error
obtained, along with the number of non-zero coe cient estimates.
(e) Comment on the results obtained. How accurately can we predict the number of college ap-
plications received? Is there much di↵erence among the test errors resulting from these three
approaches?
3. Demand for Term Life Insurance. We continue our study of term life insurance demand, found in
TermLife.csv. Specifically, we examine the 2004 Survey of Consumer Finances (SCF), a nationally
representative sample that contains extensive information on assets, liabilities, income, and demo-
graphic characteristics of those sampled (potential U.S. customers). We now return to the original
sample of n = 500 families with positive incomes to study whether a family purchases term life insur-
ance. Fromoursample, itturnsoutthat225didnot(FACEPOS=0), whereas275didpurchaseterm
life insurance (FACEPOS = 1).
1
(a) Provide a table of means of explanatory variables by level of the dependent variable FACEPOS.
Interpret what we learn from this table.
(b) Fit a logistic regression model using FACEPOS as the dependent variable and LNINCOME,
EDUCATION,AGE,andGENDERascontinuousexplanatoryvariables,togetherwiththefactor
MARSTAT.
(c) Forthismodel,identifywhichvariablesappeartobestatisticallysignificant. Inyouridentification,
describe the basis for your conclusions.
(d) For this model, which measure summarizes the goodness of fit?
(e) Define MARSTAT1 to be a binary variable that indicates MARSTAT = 1. Fit a second logistic
regression model using LINCOME, EDUCATION, and MARSTAT1.
(f) Comparethetwomodelsusingalikelihoodratiotest. Stateyournullandalternativehypotheses,
decision-making criterion, and decision-making rule.
(g) Using this second model, who is more likely to purchase term life insurance, married or nonmar-
ried? Provide an interpretation in terms of the odds of purchasing term life insurance for the
variable MARSTAT1.
(h) Consideramarriedmalewhoisage54. Assumethatthispersonhas13yearsofeducation,annual
wages of $70,000, and lives in a household composed of four people. For this second model, what
is the estimate of the probability of purchasing term life insurance?
4. Medical Expenditures Data. This exercise considers data from the Medical Expenditure Panel Survey
(MEPS) described in Exercise 1.1 and Section 11.4. The data is found in HealthExpend.csv. Our
dependent variable consists of the number of outpatient (COUNTOP) visits. For MEPS, outpatient
events include hospital outpatient department visits, o ce-based provider visits, and emergency room
visits excluding dental services. (Dental services, compared to other types of health care services, are
more predictable and occur on a more regular basis.) Hospital stays with the same date of admission
and discharge, known as zero-night stays, were also included in outpatient counts and expenditures.
(Payments associated with emergency room visits that immediately preceded an inpatient stay were
includedintheinpatientexpenditures. Prescribedmedicinesthatcanbelinkedtohospitaladmissions
wereincludedininpatientexpenditures,notoutpatientutilization.) Considertheexplanatoryvariables
described in Section 11.4.
(a) Provide a table of counts, a histogram, and summary statistics of COUNTOP. What is the shape
of the distribution of COUNTOP? Is the sample variance larger, approximately equal to, or
smaller than the sample mean?
(b) Theaggregatefunctioncanbeusedtocalculatethemeanbylevelofacategoricalvariable, with
one example as follows:
aggregate(df$col_to_aggregate, list(df$col_to_group_by), FUN=mean)
Heredfisthedataframe,col_to_aggregateisthecolumntobeaggregated,andcol_to_group_by
isthecolumntobegroupedby. CreatetablesofmeansofCOUNTOPbylevelofGENDER,eth-
nicity, region, education, self-rated physical health, self-rated mental health, activity limitation,
income, and insurance. Do the tables suggest that the explanatory variables have an impact on
COUNTOP?
(c) As a baseline, estimate a Poisson model without any explanatory variables and calculate a Pear-
son’s chi-square statistic for goodness of fit at the individual level. Recall that this is given
by
n (y µˆ )2
r2 = i i ,
i µˆ
i
i=1
X
and has n (p+1) parameters where p is the number of explanatory variables used (k = 0 if