程序代写案例-MAST90044-Assignment 3

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

1
MAST90044 Thinking and Reasoning with Data
Semester 1 2021
Assignment 3
Due: 17:00 PM, Tue 25 May

Student name:______________________________

Student number:____________________________

• Please label your assignment with the following information in the appropriate spots
at the top of this document:
o your name
o your student number
• This assignment is worth 20% of the marks in this subject, and covers the work done
in weeks 8 to 10.
• The total number of marks for this assignment is 61.
• Your assignment should show all working and reasoning, as marks will be given for
method as well as for correct answers. Please spellcheck your document.
• Paste any R code and output into the appropriate places so that it can be seen easily
along with your other work. Graphics from R can be resized within your document;
make them smaller as necessary.
• Tutors will not help you directly with assignment questions. However, they may give
some help with R.
• Please note that we may only mark a subset of questions.
• Assignments are to be saved as a pdf and submitted (uploaded) via GradeScope.
• Each question is followed by an empty box for the answer. Please answer each
question in the dedicated box. If you need more space for a question you can use the
empty pages at the end of this document BUT clearly state that the answer to the
question continues on the additional pages. Please DO NOT resize/move the boxes,
or add additional pages (except for right at the end). The document needs to be the
same format as it is currently for ease of marking.

2
Question 1 [3+4+1+3+5+2+3+3]

Fox J and Weisberg, S (2019) collected academic salary for Assistant Professors, Associate
Professors and Professors in a college in the U.S during 2008–2009. The data were collected
as part of the on-going effort of the college’s administration to monitor salary differences
between male and female faculty members.
The data frame includes 397 observations and 6 variables. The variables are:
- rank: a factor with levels AssocProf, AsstProf, Prof
- discipline: a factor with levels A (“theoretical” departments) or B (“applied”
departments).
- yrs.since.phd: years since PhD.
- yrs.service: years of service.
- sex: a factor with levels Female and Male
- salary: nine-month salary, in dollars.
The entire data set is in Salaries.csv on Canvas.

(a) Construct an appropriate graph of these data. Comment on the apparent strength or
otherwise of the relationships between the response variable and each of the
explanatory variables. Also comment on the strength or otherwise of the
relationships between the explanatory variables.

3
(b) Fit a multiple regression that includes all the explanatory variables (Model I). State
the fitted equation and the percentage of variability in the salary variable explained
by the model. Comment of the overall suitability of the linear model.

(c) Use tests of significance to determine the “best” model. Re-fit the model with the
significant variables only and state the fitted equation of this model (Model II).

(d) Using Model II, give an interpretation of the coefficients of the rankProf and
yrs.service variables.

4
(e) Plot the diagnostic plots for Model II. List the assumptions underlying the model and comment on
whether each of them holds or not.

(f) Using Model II, find the residual for subject 26.

5
(g) Returning to the full dataset, use AIC to determine the most appropriate model using
forward selection. Report the final model (Model III).

(h) Using Adjusted-! as the criterion for comparing models, determine the most
appropriate model from Models I, II and III. Taking everything into consideration,
which model would you adopt? Briefly explain.

6
Question 2 [6+4+4+3+5]

A team in a cancer research institute run a study a few years ago to investigate the
association between lung cancer and smoking history. The data that was used is
saved in the file lungCancer.csv in Canvas. The data has several variables:
LungCancer – an indicator if the participant of the study was diagnosed with lung
cancer (1-Yes, 0-No).
Age – the age of the participant when he or she have been joining the study.
Gender – 1 for male and 0 for female.
Ever_smoked – an indicator if the participant ever smoked (1-Yes, 0-No).

Read the data into R and answer the following questions.

(a) Create a data frame in R to summarise the frequency of the response variable
LungCancer for the four combinations of the two potential explanatory variables,
Ever_smoked and Gender. The first row of your dataframe should look like this:
Ever_smoked Gender LungCancer total
no F 2 21
Note: you may find it useful to use the function table() to help you calculate the
relevant numbers to fill your data frame.

7
(b) To keep the model simple, we consider only the effect of smoking (Ever_smoked)
on LungCancer. Find a point estimate and 90% confidence interval for the
difference between the percentage of participants that ever smoked among the
sick participants and the percentage of participants that ever smoked among the
healthy participants. Carry out a corresponding hypothesis test (at 90%
confidence) and briefly describe your conclusion.

8

(c) Fit a logistic regression model with LungCancer as the outcome, and
Ever_smoked as the explanatory variable. Using your regression summary
output, calculate the probabilities of being diagnosed with lung cancer for those
that ever and never smoked.

9
(d) Now fit the logistic regression that explains LungCancer by using both
Ever_smoked and Gender. Test the significance of the effect of Gender when it
is added to the model with Ever_smoked.

10
(e) It is suspected that age is associated with LungCancer; check if the data support
this (do this simply, with summary statistics, not significance tests).
In this dataset, were the people that ever smoked (Ever_smoked=1) older, or
younger, on average, than those who never smoked?
Should we include both Ever_smoked and the age in our model to explain
LungCancer? How could this be tested?

11
Question 3 [3+3+3+3+3]

For each of the experimental designs below, state what the experimental unit is, whether
blocking has been used (and identify the blocking factor), and any flaws in the design
(statistically unsound aspects).

I. Ten spiders are randomly chosen to receive the tea tree mixture. The remaining
spiders receive a water treatment.

II. The ten biggest spiders receive the tea tree mixture. The others receive the water
treatment.

III. All spiders receive the tea tree mixture, and the results are compared to
measurements taken before the experiment for each of the spiders.

Arkys walckenaeri spiders, found on the leaves of
eucalypt trees in Tasmania, have six eyes. Researchers
are concerned that this species can develop eye
infections due to the exposure to the eucalypt tree oil.
They decided to run an experiment to see if tea tree
diluted oil can help their eye infection. For the
experiment they caught 20 spiders. The aim of the
experiment was to determine if after 5 days of
treatment, the infection will be healed. They considered
several experiment designs. htt
p:
//
w
w
w
.ta
sm
an
ia
ns
pi
de
rs
.in
fo
/0
50
.h
tm

12
IV. Researchers select at random 10 trees at least 10 meters distance from each other
and collect two spiders from each tree. One of the two spiders from each tree
receives the tea tree mixture and the other spider the water treatment.

V. For each spider, its left or right large eye is randomly chosen to receive the tea tree
mixture. The other large eye receives the water treatment.

Additional space:

13
Additional space:

欢迎咨询51作业君