1 MAST90044 Thinking and Reasoning with Data Semester 1 2021 Assignment 3 Due: 17:00 PM, Tue 25 May Student name:______________________________ Student number:____________________________ • Please label your assignment with the following information in the appropriate spots at the top of this document: o your name o your student number • This assignment is worth 20% of the marks in this subject, and covers the work done in weeks 8 to 10. • The total number of marks for this assignment is 61. • Your assignment should show all working and reasoning, as marks will be given for method as well as for correct answers. Please spellcheck your document. • Paste any R code and output into the appropriate places so that it can be seen easily along with your other work. Graphics from R can be resized within your document; make them smaller as necessary. • Tutors will not help you directly with assignment questions. However, they may give some help with R. • Please note that we may only mark a subset of questions. • Assignments are to be saved as a pdf and submitted (uploaded) via GradeScope. • Each question is followed by an empty box for the answer. Please answer each question in the dedicated box. If you need more space for a question you can use the empty pages at the end of this document BUT clearly state that the answer to the question continues on the additional pages. Please DO NOT resize/move the boxes, or add additional pages (except for right at the end). The document needs to be the same format as it is currently for ease of marking. 2 Question 1 [3+4+1+3+5+2+3+3] Fox J and Weisberg, S (2019) collected academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S during 2008–2009. The data were collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members. The data frame includes 397 observations and 6 variables. The variables are: - rank: a factor with levels AssocProf, AsstProf, Prof - discipline: a factor with levels A (“theoretical” departments) or B (“applied” departments). - yrs.since.phd: years since PhD. - yrs.service: years of service. - sex: a factor with levels Female and Male - salary: nine-month salary, in dollars. The entire data set is in Salaries.csv on Canvas. (a) Construct an appropriate graph of these data. Comment on the apparent strength or otherwise of the relationships between the response variable and each of the explanatory variables. Also comment on the strength or otherwise of the relationships between the explanatory variables. 3 (b) Fit a multiple regression that includes all the explanatory variables (Model I). State the fitted equation and the percentage of variability in the salary variable explained by the model. Comment of the overall suitability of the linear model. (c) Use tests of significance to determine the “best” model. Re-fit the model with the significant variables only and state the fitted equation of this model (Model II). (d) Using Model II, give an interpretation of the coefficients of the rankProf and yrs.service variables. 4 (e) Plot the diagnostic plots for Model II. List the assumptions underlying the model and comment on whether each of them holds or not. (f) Using Model II, find the residual for subject 26. 5 (g) Returning to the full dataset, use AIC to determine the most appropriate model using forward selection. Report the final model (Model III). (h) Using Adjusted-! as the criterion for comparing models, determine the most appropriate model from Models I, II and III. Taking everything into consideration, which model would you adopt? Briefly explain. 6 Question 2 [6+4+4+3+5] A team in a cancer research institute run a study a few years ago to investigate the association between lung cancer and smoking history. The data that was used is saved in the file lungCancer.csv in Canvas. The data has several variables: LungCancer – an indicator if the participant of the study was diagnosed with lung cancer (1-Yes, 0-No). Age – the age of the participant when he or she have been joining the study. Gender – 1 for male and 0 for female. Ever_smoked – an indicator if the participant ever smoked (1-Yes, 0-No). Read the data into R and answer the following questions. (a) Create a data frame in R to summarise the frequency of the response variable LungCancer for the four combinations of the two potential explanatory variables, Ever_smoked and Gender. The first row of your dataframe should look like this: Ever_smoked Gender LungCancer total no F 2 21 Note: you may find it useful to use the function table() to help you calculate the relevant numbers to fill your data frame. 7 (b) To keep the model simple, we consider only the effect of smoking (Ever_smoked) on LungCancer. Find a point estimate and 90% confidence interval for the difference between the percentage of participants that ever smoked among the sick participants and the percentage of participants that ever smoked among the healthy participants. Carry out a corresponding hypothesis test (at 90% confidence) and briefly describe your conclusion. 8 (c) Fit a logistic regression model with LungCancer as the outcome, and Ever_smoked as the explanatory variable. Using your regression summary output, calculate the probabilities of being diagnosed with lung cancer for those that ever and never smoked. 9 (d) Now fit the logistic regression that explains LungCancer by using both Ever_smoked and Gender. Test the significance of the effect of Gender when it is added to the model with Ever_smoked. 10 (e) It is suspected that age is associated with LungCancer; check if the data support this (do this simply, with summary statistics, not significance tests). In this dataset, were the people that ever smoked (Ever_smoked=1) older, or younger, on average, than those who never smoked? Should we include both Ever_smoked and the age in our model to explain LungCancer? How could this be tested? 11 Question 3 [3+3+3+3+3] For each of the experimental designs below, state what the experimental unit is, whether blocking has been used (and identify the blocking factor), and any flaws in the design (statistically unsound aspects). I. Ten spiders are randomly chosen to receive the tea tree mixture. The remaining spiders receive a water treatment. II. The ten biggest spiders receive the tea tree mixture. The others receive the water treatment. III. All spiders receive the tea tree mixture, and the results are compared to measurements taken before the experiment for each of the spiders. Arkys walckenaeri spiders, found on the leaves of eucalypt trees in Tasmania, have six eyes. Researchers are concerned that this species can develop eye infections due to the exposure to the eucalypt tree oil. They decided to run an experiment to see if tea tree diluted oil can help their eye infection. For the experiment they caught 20 spiders. The aim of the experiment was to determine if after 5 days of treatment, the infection will be healed. They considered several experiment designs. htt p: // w w w .ta sm an ia ns pi de rs .in fo /0 50 .h tm 12 IV. Researchers select at random 10 trees at least 10 meters distance from each other and collect two spiders from each tree. One of the two spiders from each tree receives the tea tree mixture and the other spider the water treatment. V. For each spider, its left or right large eye is randomly chosen to receive the tea tree mixture. The other large eye receives the water treatment. Additional space: 13 Additional space:
欢迎咨询51作业君