代写辅导接单-FINA6610 Assignment 2

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top

FINA6610 Assignment 2

"

•Ben Lim

ssasa.eu

Due: May 27 2024, 5:00pm

In this and future assignments, you may be asked to use additional Rpackages or functions. Recall that,

for any function f which you are unsure of the usage about, you may type ?f in the R console to obtain the

help file for that function, or ??f to perform a search.

Please go through the coding labs for the relevant chapters of ISLR2 before attempting the coding

questions. Rmd files for the labs can be found in the relevant course content folders.

In this assignment, where hypothesis tests are required please use ↵=0.05 where not stated.

1. In this problem, we will consider the Boston dataset in the ISLR2 library. The data description can

be obtained by typing ?Boston.

(a) Delete the observations with medv = 50. After this step, use best subset selection with the BIC

criterion and nvmax = 13 to select the best performing subset of predictors for the prediction of

the medv variable. Justify your answer by showing the plot of BIC values.

(b) Similarly, use forward and backward selection to select the best performing subset of variables.

Show the plot of BIC values for each of the two methods.

(c) Fit a lasso model to the data, using cross-validation to select the parameter. What coecients

does the model select? Do the conclusions di↵er between these methods?

(d) Split the data into a validation set consisting of the first 100 data points and the remainder as

the training set. Using best subset selection, select the best performing set of predictors for the

prediction of medv using the validation error criterion.

(e) Similarly, use forward and backward selection to select the best performing subset of variables.

Do the conclusions di↵er between these methods?

2. (Based on ISLR2 Chapter 6, Question 9) In this exercise, we will predict the number of applications

received using the other variables in the College data set.

(a) Split the data set intoa trainingset consisting of the first 700 observations, and the remainder as

the test set.

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

(c) Fit a ridge regression model on the training set, with chosen by cross-validation. Report the

test error obtained.

(d) Fit a lasso model on the training set, with chosen by cross-validation. Report the test error

obtained, along with the number of non-zero coecient estimates.

(e) Comment on the results obtained. How accurately can we predict the number of college ap-

plications received? Is there much di↵erence among the test errors resulting from these three

approaches?

3. Demand for Term Life Insurance. We continue our study of term life insurance demand, found in

TermLife.csv. Specifically, we examine the 2004 Survey of Consumer Finances (SCF), a nationally

representative sample that contains extensive information on assets, liabilities, income, and demo-

graphic characteristics of those sampled (potential U.S. customers). We now return to the original

sample of n = 500 families with positive incomes to study whether a family purchases term life insur-

ance. Fromoursample, itturnsoutthat225didnot(FACEPOS=0), whereas275didpurchaseterm

life insurance (FACEPOS = 1).

1

(a) Provide a table of means of explanatory variables by level of the dependent variable FACEPOS.

Interpret what we learn from this table.

(b) Fit a logistic regression model using FACEPOS as the dependent variable and LNINCOME,

EDUCATION,AGE,andGENDERascontinuousexplanatoryvariables,togetherwiththefactor

MARSTAT.

(c) Forthismodel,identifywhichvariablesappeartobestatisticallysignificant. Inyouridentification,

describe the basis for your conclusions.

(d) For this model, which measure summarizes the goodness of fit?

(e) Define MARSTAT1 to be a binary variable that indicates MARSTAT = 1. Fit a second logistic

regression model using LINCOME, EDUCATION, and MARSTAT1.

(f) Comparethetwomodelsusingalikelihoodratiotest. Stateyournullandalternativehypotheses,

decision-making criterion, and decision-making rule.

(g) Using this second model, who is more likely to purchase term life insurance, married or nonmar-

ried? Provide an interpretation in terms of the odds of purchasing term life insurance for the

variable MARSTAT1.

(h) Consideramarriedmalewhoisage54. Assumethatthispersonhas13yearsofeducation,annual

wages of $70,000, and lives in a household composed of four people. For this second model, what

is the estimate of the probability of purchasing term life insurance?

4. Medical Expenditures Data. This exercise considers data from the Medical Expenditure Panel Survey

(MEPS) described in Exercise 1.1 and Section 11.4. The data is found in HealthExpend.csv. Our

dependent variable consists of the number of outpatient (COUNTOP) visits. For MEPS, outpatient

events include hospital outpatient department visits, oce-based provider visits, and emergency room

visits excluding dental services. (Dental services, compared to other types of health care services, are

more predictable and occur on a more regular basis.) Hospital stays with the same date of admission

and discharge, known as zero-night stays, were also included in outpatient counts and expenditures.

(Payments associated with emergency room visits that immediately preceded an inpatient stay were

includedintheinpatientexpenditures. Prescribedmedicinesthatcanbelinkedtohospitaladmissions

wereincludedininpatientexpenditures,notoutpatientutilization.) Considertheexplanatoryvariables

described in Section 11.4.

(a) Provide a table of counts, a histogram, and summary statistics of COUNTOP. What is the shape

of the distribution of COUNTOP? Is the sample variance larger, approximately equal to, or

smaller than the sample mean?

(b) Theaggregatefunctioncanbeusedtocalculatethemeanbylevelofacategoricalvariable, with

one example as follows:

aggregate(df$col_to_aggregate, list(df$col_to_group_by), FUN=mean)

Heredfisthedataframe,col_to_aggregateisthecolumntobeaggregated,andcol_to_group_by

isthecolumntobegroupedby. CreatetablesofmeansofCOUNTOPbylevelofGENDER,eth-

nicity, region, education, self-rated physical health, self-rated mental health, activity limitation,

income, and insurance. Do the tables suggest that the explanatory variables have an impact on

COUNTOP?

(c) As a baseline, estimate a Poisson model without any explanatory variables and calculate a Pear-

son’s chi-square statistic for goodness of fit at the individual level. Recall that this is given

by

n (y µˆ )2

r2 = i i ,

i µˆ

i

i=1

X

and has n (p+1) parameters where p is the number of explanatory variables used (k = 0 if

there is only an intercept).

2

(d) Estimate a Poisson model using the explanatory variables in part (b). As a sequential ANOVA

is used later on, please ensure model terms are added in the following order: gender, ethnicity,

region, education, self-rated physical health, self-rated mental health, activity limitation, income,

and insurance.

i. Printthesummaryofthemodel. Also,Thefollowingcodeisusefultodisplaythesignificance

of categorical variables in a glm object:

anova(glmobject, test = "LRT")

Comment briefly on the significance of each individual variable.

ii. Provide an interpretation for the GENDER coecient.

iii. Calculate a (individual-level) Pearson’s chi-square statistic for goodness of fit. Compare

this to the one in part (c). On the basis of this statistic and the statistical significance of

coecients discussed in part d(i), which model do you prefer?

iv. Reestimate the model using the quasi-likelihood estimator of the dispersion parameter, and

print the summary. How have your comments in part d(i) changed?

(e) Estimate a negative binomial model using the explanatory variables in part (d).

i. Print the summary and comment briefly on the statistical significance of each variable.

ii. Calculate a (individual-level) Pearson’s chi-square statistic for goodness of fit. Compare this

to the ones in parts (b) and (d). Which model do you prefer? Also cite the AIC statistic in

your comparison.

iii. Reestimatethemodel,droppingthefactorincome,andprintthesummary. Usethelikelihood

ratio test to say whether income is a statistically significant factor.

(f) As a robustness check, create a new indicator variable for COUNTOP> 0, and estimate a logistic

regression model using the explanatory variables in part (d). Do the signs and significance of the

coecients of this model fit give the same interpretation as with the negative binomial model in

part (e)?

3

51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468