辅导案例-MATH1015

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Semester 1 Main, 2018
The University of Sydney
School of Mathematics and Statistics
MATH1015
Biostatistics
June 2018 Lecturer: J Chan, K Wang, S Romanes
Time Allowed: Reading - 10 minutes; Writing - 1.5 hours
Exam Conditions: This is a closed-book examination — no material permitted. Writing
is not permitted at all during reading time.
Family Name: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SID: . . . . . . . . . . . . . . . . . . . . . . . . . . .
Other Names: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Seat Number: . . . . . . . . . . . . . . . . .
Please check that your examination paper is complete (16 pages) and indicate by signing below.
I have checked the examination paper and affirm it is complete.
Signature: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Date: . . . . . . . . . . . . . . . . . . . . . . . . .
This examination consists of 16 pages, numbered from 1 to 16.
There are 3 questions, numbered from 1 to 3.
You may bring in 1 sheet of two-sided A4 page of notes. Calculators are not permitted.
Marker’s use
only
Page 1 of 16
Semester 1 Main, 2018 Page 2 of 16
Answer these questions in the spaces provided.
If rounding is required, please give your answer to 2 decimal places.
1. Women have increasingly become the primary income earners for their families. To
test this claim, the editor of The Working Mothers magazine included a form in the
magazine for its 615 subscribers to report their salaries and their partners’ salaries. 124
of them returned the forms with their salary information.
(a) Name one possible source of bias in the study design and suggest a way of improve-
ment.
(b) The salary data in thousand dollars were saved in salary.csv . To visualise the
data, comparative boxplots were drawn below.
data = read.csv(salary.csv)
x = data$hushand
y = data$wife
boxplot(x,y,names = c("X","Y"))
l
X Y
60
80
10
0
12
0
14
0
16
0
Briefly compare the centre, the spread and the skewness of the two distributions.
Explain the circle in boxplot Y.
Semester 1 Main, 2018 Page 3 of 16
(c) Adam decided to use R to calculate the sample correlation coefficient between the
salaries of husbands and wives but was not sure how to do it. He tried three different
approaches and his R codes and output were given below. Use the information to
state the correct sample correlation coefficient.
n = length(x)
zx = (x-mean(x))/sd(x) # standardise x
zy = (y-mean(y))/sd(y) # standardise y
a = mean(zx*zy)
a
## [1] 0.7185
b = a*n/(n-1)
b
## [1] 0.7244
c = sd(y)/sd(x)
c
## [1] 0.7749
Semester 1 Main, 2018 Page 4 of 16
(d) Scatter plot of the data is provided below. Briefly comment on the suitability of a
linear regression model.
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
60 80 100 120 140 160
60
80
10
0
12
0
14
0
husband salary
w
ife
s
a
la
ry
(e) Using the following R output, write down the regression model in the form:
Y = intercept + slope X where Y = wife’s salary and X = husband’s salary.
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 30.6254 0.5613
Semester 1 Main, 2018 Page 5 of 16
(f) Using the regression model in (e), estimate the amount of increase/decrease in wife
salary if the husband’s salary increases by 1K. Should we conclude that the 1K
increase in husband’s salary will bring that amount of increase/decrease of wife’s
salary?
(g) Using the information in (c) and the following summary statistics, provide expres-
sions to calculate the slope and Y intercept for the regression model in (e). No
evaluations of these expressions are needed as the values are already provided in (e).
c(mean(x),mean(y),mean(x*y))
## [1] 106.60 90.46 9889.42
c(sd(x),sd(y))
## [1] 21.05 16.32
Semester 1 Main, 2018 Page 6 of 16
2.(a) Consider two box models: Box A = {−1, 1}, and Box B = {−2, 2}.
Calculate the mean of Box A. Show that Box B shares the same mean.
(b) Calculate the standard deviations of Box A and Box B.
(c) Suppose in an experiment, we take 1000 random draws from Box A and then 1000
random draws from Box B. We then calculate the respective mean of these two 1000
draws, denoted by SampleMeanA and SampleMeanB.
In general, should we expect SampleMeanA = SampleMeanB exactly? Justify
your answer.
(d) In general, will the mean of Box A in (a) equal to SampleMeanA in (c) exactly?
Explain briefly.
Semester 1 Main, 2018 Page 7 of 16
(e) State the Law of Averages (Law of Large Numbers) and explain if the result in (d)
contradicts the law.
(f) The experiment in (d) was repeated so that we have multiple copies of SampleMeanA
and SampleMeanB. Below are two histograms labelled as “Histogram X” and “His-
togram Y”. One of the histograms represents the multiple copies of SampleMeanA
and the other for SampleMeanB.
Histogram X Histogram Y
−0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2
0
5
10
15
de
ns
ity
Identify which histogram corresponds to SampleMeanA and briefly justify your
answer.
Semester 1 Main, 2018 Page 8 of 16
3. A local government body investigated if fewer than 10% of the IV drug users in a
particular population are HIV positive. 250 users were sampled from clinics around the
state, with budding statistician Matt assigned to analyse the results.
(a) Matt wants to perform a hypothesis test to arrive at an appropriate conclusion.
(i) State the appropriate hypothesis test for this study, write down the null (H0)
and alternate (H1) hypotheses and define any terms used.
(ii) Fill in the blanks of the box model describing the null distribution below.
Semester 1 Main, 2018 Page 9 of 16
(iii) What theorem does Matt need to assume in order to perform this hypothesis
test? Why does he need it?
(iv) From the study, it is found that 21 users of IV drugs presented to the clinic
are HIV positive. Using the R code below, sketch the appropriate area under
the normal curve that Matt would use to find the p-value for this test. What
range of values does the p-value belong in? What conclusion would Matt
arrive at? Annotate clearly the test statistic for this test.
meanbox = mean(box)
sdbox = popsd(box)
ev = 250*meanbox
se = sqrt(250)*sdbox
ev
## [1] 25
se
## [1] 4.743416
(21-ev)/se
## [1] -0.843274
Semester 1 Main, 2018 Page 10 of 16
−3 −2 −1 0 1 2 3
(b) Matt’s associate, Foggy, has different ideas about how to perform hypothesis testing
for this study. He decides to simulate from the box model 1000000 times, and deduce
the p-value. The results from his simulations are below:
totals = replicate(1000000, sum(sample(box, 250, rep = T)))
table(totals)
## totals
## 7 8 9 10 11 12 13 14 15 16 17 18
## 9 27 80 239 511 1191 2428 4729 8094 13483 20411 29617
## 19 20 21 22 23 24 25 26 27 28 29 30
## 39844 51106 62337 72335 79265 83822 83598 80815 73990 65848 55818 45736
## 31 32 33 34 35 36 37 38 39 40 41 42
## 36093 27549 20123 14276 9698 6563 4100 2617 1559 956 501 321
## 43 44 45 46 47 48 49 50 51
## 164 71 35 23 5 6 5 1 1
mean(totals)
## [1] 24.99963
sd(totals)
## [1] 4.741202
hist(totals)
Semester 1 Main, 2018 Page 11 of 16
cumsum #cumulative sum of the totals
## 7 8 9 10 11 12 13 14 15
## 9 36 116 355 866 2057 4485 9214 17308
## 16 17 18 19 20 21 22 23 24
## 30791 51202 80819 120663 171769 234106 306441 385706 469528
## 25 26 27 28 29 30 31 32 33
## 553126 633941 707931 773779 829597 875333 911426 938975 959098
## 34 35 36 37 38 39 40 41 42
## 973374 983072 989635 993735 996352 997911 998867 999368 999689
## 43 44 45 46 47 48 49 50 51
## 999853 999924 999959 999982 999987 999993 999998 999999 1000000
(i) What p value would Foggy arrive at? Does his conclusion differ from Matt’s?
(ii) In your own words, in reference to the histogram above, explain the concept
of a p-value.
Semester 1 Main, 2018 Page 12 of 16
Examination Solution of extended answer questions
(1) (a) 1 answer from each
Source of bias:
Nonresponse bias: Working mothers who respond may have higher salaries than
the general working mothers.
Selection bias: Only those subscribers are included.
Measurement bias: Respondents may not know the salary of their partners.
Improvement:
Improve the response rate by putting the survey online.
Respondents should be assured of confidentiality.
(b) The median of X is higher than that of Y and X has a wider spread. Both
distribution are symmetric. The circle represents an outlier.
(c) The sample correlation coefficient is r = b = 0.7244.
(d) A linear regression model is suitable as there is a clear linear trend with no
obvious outliers.
(e) The regression model is Y = 30.63 + 0.5613X
(f) An increase of 1K in the husband salary is associated with an increase of amount
equals to the slope = d = 0.5613K.
One should only talk about association but not causation. That means one
cannot say a certain increase in husband’s salary will cause or bring about an
increase in wife’s salary.
(g) Slope = r SD(Y)/SD(X)= 0.7244 × 16.32/ 21.05
Y-intercept = mean(Y) - Slope × mean(X) = 90.46 - 0.5613 × 106.60
(2) (a) Mean for Box A =
−1 + 1
2
= 0. Mean for Box B is also 0 as the values in Box
B also sum to 0.
(b) SDA = (big−small)
√
propbig × propsmall = [1−(−1)]×
√
1
2
× 1
2
= 2×
√
1
4
= 1.
Similarly SDB = 2.
Alternative answers using RMS of gaps is SDA =
√
(1− 0)2 + (−1− 0)2
2
= 1.
(c) No, we should not expect the sample means to equal to each other exactly. This
is because the mean of the random samples is also random. There is only a very
small chance those two means are the same.
(d) Part (a) asks for expectation which is a fixed population mean. Part (c) asks
for the sample mean, which is a random variable. So one should not expect the
population mean equals to a sample mean exactly.
(e) The law states that the sample mean becomes more stable and approaches a
fixed number as the number of simulation increases.
The answer in (d) does not contradict the law because the sample mean will
approach that fixed population mean but it does not imply the sample mean
will equal to the population mean for a certain number of sample.
(f) Box A corresponds to Histogram Y.
Even though both Boxes share the same mean, Box A has a lower standard
Semester 1 Main, 2018 Page 13 of 16
deviation than Box B, and thus, MeanA should spread out less from the centre
when experiment was repeated.
(3) (a) (i) The appropriate hypothesis test is the percentage test (test for propor-
tions).
We have H0 p = 0.1 vs H1 p < 0.1, where p is the proportion of HIV
positive IV drug users in the population.
(ii) In order to get p = 0.1, we need 9 ‘0’ tickets and 1 ‘1’ ticket. We also need
250 draws.
(iii) Matt needs to assume the Central Limit Theorem applies to the sum of
tickets from the boxes. He needs to assume this in order to calculate the p-
value, as we assume the sampling distribution of p is normally distributed.
(iv) From the study, it is found that 21 users of IV drugs presented to the
clinic are HIV positive. Since the test statistic is approx -0.843274, this lies
between 0 and -1 standard deviations away from the mean for the standard
normal distribution. Using known areas of the normal distribution, we can
say this p-value is between 0.16 and 0.5.
(b) (i) P-value refers to the probability of the observed or more extreme. Since
this test aims to test if fewer than 10% of the IV drug users are HIV
Semester 1 Main, 2018 Page 14 of 16
positive. This means we have a lower-sided H1. The idea of (b) is to release
the normal distribution assumption and use the distribution of simulated
counts to approximate the population distribution under H0 (simulating
from the box).
The p-value corresponds to the proportion based on the cumulative sum
at 21 (the observed value of HIV positive users) and a lower side H1 means
that we look for probability of 21 and below. As the table give cumulative
sum of frequencies, the value under 21 (ie 234106) gives cumulative sum of
all frequencies for values 21 and below which is what we want. Hence the
relative frequencies of 234106 / 1000000 is the approximated probability
for observing 21 or even lower which is the p-value. Therefore, since this is
greater than 0.05, Foggy would also retain the null hypothesis, and arrive
at the same conclusion as Matt (ie, that the proportion of HIV positive
users may be 10% or even higher.)
(ii) The p-value is the probability of observing a test statistic as extreme or
more extreme than the one observed from the sample, given the null dis-
tribution is true. In this case, this corresponds to observing 21 or less HIV
positive samples given the null hypothesis is true. On the histogram this
is the area bounded by 0 and 21.

欢迎咨询51作业君