1STAT5002: Introduction to Statistics - Semester 1, 2022

Submission Due Date: Friday, 27th May, 2022 before 11:59 pm (Sydney time)

Instructions:

1. You are required to type up your entire assignment, including any equations. If you are using Word,

you should use the equation editor for any maths notation.

2. Copy and paste relevant R code and outputs while discussing your answer in the text. Do not put

all R code and outputs at the end of the document.

3. Answer all questions in the given order; i.e., 1(a), 1(b), etc. Keep your answers clear, brief, and concise.

4. Convert and submit your assignment in pdf, which must be uploaded to the Turnitin assignment

box on Canvas.

5. Data used in this assignment are in the spreadsheet ADataset.xlsx.

6. You MUST write up solutions on your own. Do not discuss the assignment with your classmates.

Students caught cheating will automatically receive a mark of 0 and are subject to disciplinary action.

7. This assignment carries a weight of 8% towards your final mark for STAT5002.

1. Assume that the marks in the following subjects are normally distributed:

Subject Mean (µ) Standard deviation (σ)

Statistics 50 12

Economics 65 10

Mathematics 76 8

(a) Douglas obtained a final mark of 68 in Statistics, 73 in Economics, and 71 in Mathematics. In

which subject did he perform best compared to the rest of the class?

(b) Maria’s z scores were 1.2 in Statistics, -0.5 in Economics, and -1.5 in English. Calculate Maria’s

mark in the subject where she performed worst compared to the rest.

(c) Examiners often use z scores to scale marks via a new mean and a new standard deviation.

The new marks are then directly comparable. Calculate the scaled marks with a new mean of

100 and a new standard deviation of 20 in each subject for Douglas and Maria.

(d) Refer to part (c). Who had the best overall performance?

2. It has long been known that brain weight scales with body weight across large groups of animals.

The data were collected on n = 24 mammals and is found in the Q2 sheet in the Excel spreadsheet.

Let X be the body weights (kg) and Y be the brain weights (g).

(a) Produce a scatter plot of ”Brain weights (g)” versus ”Body weights (kg)”. Make sure you label

your axes properly and that your graph has an appropriate title. Briefly describe the nature

of the relationship between these two variables. Are there any outliers? If yes, can we remove

them? Why or why not?

(b) You would like to build a linear regression model to predict brain weights (g) using body weight

(kg). Which model: linear-linear, log-linear, linear-log, or log-log fits the data better? Provide

visual evidence to support your argument. Write down the model of your choice.

23. The dataset Q3 contains the following information on a sample of n = 36 severely depressed indi-

viduals.

Variable Description

Eff Measure of the effectiveness of the treatment

Age Age (years)

Tmt Treatment received (A, B or C)

(a) Produce a scatter plot of Eff versus Age. What does it show?

(b) Run a regression of Eff on Age. Write down the fitted regression equation.

(c) Produce another scatter plot of Eff versus Age but this time with colour coding and different

regression lines for each of the three treatments. Does the treatment appear to interact with

age in explaining the response? Explain why or why not.

(d) Code up dummy variables for treatments A and B as well as an interaction between Age and

each of treatments A and B. Attach the R code to show how you create the dummies and

interaction terms. Why don’t we need a dummy variable for treatment C?

(e) Using Age, the dummies, and the interactions as predictors, perform the backward elimination

to obtain the best model by means of AIC criterion. Write down the final estimated regression

equation. What percentage of the total variation in Eff is explained by the model?

(f) Use the partial F test to determine which model [the one in part (b) or the one in part (e)]

fits the data better. Include mention of H0 and H1, the observed value of the test statistic, the

p-value, the decision, and conclusion.

(g) Predict the effectiveness of treatments for the following people:

Patient Age Treatment

Peter 20 A

Anna 56 B

Louis 69 C

4. As part of the 2020 College Alcohol Study, students who drank alcohol in their senior year were

asked if drinking ever resulted in missing a class. The data are given in the following table:

Drinking Status

Missed a class Nonbinger Occasional binger Frequent binger Total

No 41 18 11 70

Yes 4 8 18 30

Total 45 26 29 100

(a) At the 0.05 level of significance, is there evidence of a significant association between missing

a class and drinking status? Include mention of H0 and H1, the observed value of the test

statistic, the p-value, the decision, and conclusion.

(b) What is the conditional distribution of drinking status?

(c) What are the odds of a nonbinger who missed a class?

(d) What are the odds of a frequent binger who missed a class?

(e) What is the odds ratio for nonbingers versus frequent bingers who missed a class?

(f) Fit a logistic regression of a senior student who never missed a class on drinking status. Treat

frequent binger as a base group. Write down the fitted regression equation.

(g) Refer to part (f). What is the odds ratio for nonbingers versus frequent bingers who missed a

class? Is it the same as your calculation in part (e)? Explain why or why not.

欢迎咨询51作业君