Semester 1 Main, 2018 The University of Sydney School of Mathematics and Statistics MATH1015 Biostatistics June 2018 Lecturer: J Chan, K Wang, S Romanes Time Allowed: Reading - 10 minutes; Writing - 1.5 hours Exam Conditions: This is a closed-book examination — no material permitted. Writing is not permitted at all during reading time. Family Name: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SID: . . . . . . . . . . . . . . . . . . . . . . . . . . . Other Names: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Seat Number: . . . . . . . . . . . . . . . . . Please check that your examination paper is complete (16 pages) and indicate by signing below. I have checked the examination paper and affirm it is complete. Signature: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Date: . . . . . . . . . . . . . . . . . . . . . . . . . This examination consists of 16 pages, numbered from 1 to 16. There are 3 questions, numbered from 1 to 3. You may bring in 1 sheet of two-sided A4 page of notes. Calculators are not permitted. Marker’s use only Page 1 of 16 Semester 1 Main, 2018 Page 2 of 16 Answer these questions in the spaces provided. If rounding is required, please give your answer to 2 decimal places. 1. Women have increasingly become the primary income earners for their families. To test this claim, the editor of The Working Mothers magazine included a form in the magazine for its 615 subscribers to report their salaries and their partners’ salaries. 124 of them returned the forms with their salary information. (a) Name one possible source of bias in the study design and suggest a way of improve- ment. (b) The salary data in thousand dollars were saved in salary.csv . To visualise the data, comparative boxplots were drawn below. data = read.csv(salary.csv) x = data$hushand y = data$wife boxplot(x,y,names = c("X","Y")) l X Y 60 80 10 0 12 0 14 0 16 0 Briefly compare the centre, the spread and the skewness of the two distributions. Explain the circle in boxplot Y. Semester 1 Main, 2018 Page 3 of 16 (c) Adam decided to use R to calculate the sample correlation coefficient between the salaries of husbands and wives but was not sure how to do it. He tried three different approaches and his R codes and output were given below. Use the information to state the correct sample correlation coefficient. n = length(x) zx = (x-mean(x))/sd(x) # standardise x zy = (y-mean(y))/sd(y) # standardise y a = mean(zx*zy) a ## [1] 0.7185 b = a*n/(n-1) b ## [1] 0.7244 c = sd(y)/sd(x) c ## [1] 0.7749 Semester 1 Main, 2018 Page 4 of 16 (d) Scatter plot of the data is provided below. Briefly comment on the suitability of a linear regression model. l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 60 80 100 120 140 160 60 80 10 0 12 0 14 0 husband salary w ife s a la ry (e) Using the following R output, write down the regression model in the form: Y = intercept + slope X where Y = wife’s salary and X = husband’s salary. ## ## Call: ## lm(formula = y ~ x) ## ## Coefficients: ## (Intercept) x ## 30.6254 0.5613 Semester 1 Main, 2018 Page 5 of 16 (f) Using the regression model in (e), estimate the amount of increase/decrease in wife salary if the husband’s salary increases by 1K. Should we conclude that the 1K increase in husband’s salary will bring that amount of increase/decrease of wife’s salary? (g) Using the information in (c) and the following summary statistics, provide expres- sions to calculate the slope and Y intercept for the regression model in (e). No evaluations of these expressions are needed as the values are already provided in (e). c(mean(x),mean(y),mean(x*y)) ## [1] 106.60 90.46 9889.42 c(sd(x),sd(y)) ## [1] 21.05 16.32 Semester 1 Main, 2018 Page 6 of 16 2.(a) Consider two box models: Box A = {−1, 1}, and Box B = {−2, 2}. Calculate the mean of Box A. Show that Box B shares the same mean. (b) Calculate the standard deviations of Box A and Box B. (c) Suppose in an experiment, we take 1000 random draws from Box A and then 1000 random draws from Box B. We then calculate the respective mean of these two 1000 draws, denoted by SampleMeanA and SampleMeanB. In general, should we expect SampleMeanA = SampleMeanB exactly? Justify your answer. (d) In general, will the mean of Box A in (a) equal to SampleMeanA in (c) exactly? Explain briefly. Semester 1 Main, 2018 Page 7 of 16 (e) State the Law of Averages (Law of Large Numbers) and explain if the result in (d) contradicts the law. (f) The experiment in (d) was repeated so that we have multiple copies of SampleMeanA and SampleMeanB. Below are two histograms labelled as “Histogram X” and “His- togram Y”. One of the histograms represents the multiple copies of SampleMeanA and the other for SampleMeanB. Histogram X Histogram Y −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 0 5 10 15 de ns ity Identify which histogram corresponds to SampleMeanA and briefly justify your answer. Semester 1 Main, 2018 Page 8 of 16 3. A local government body investigated if fewer than 10% of the IV drug users in a particular population are HIV positive. 250 users were sampled from clinics around the state, with budding statistician Matt assigned to analyse the results. (a) Matt wants to perform a hypothesis test to arrive at an appropriate conclusion. (i) State the appropriate hypothesis test for this study, write down the null (H0) and alternate (H1) hypotheses and define any terms used. (ii) Fill in the blanks of the box model describing the null distribution below. Semester 1 Main, 2018 Page 9 of 16 (iii) What theorem does Matt need to assume in order to perform this hypothesis test? Why does he need it? (iv) From the study, it is found that 21 users of IV drugs presented to the clinic are HIV positive. Using the R code below, sketch the appropriate area under the normal curve that Matt would use to find the p-value for this test. What range of values does the p-value belong in? What conclusion would Matt arrive at? Annotate clearly the test statistic for this test. meanbox = mean(box) sdbox = popsd(box) ev = 250*meanbox se = sqrt(250)*sdbox ev ## [1] 25 se ## [1] 4.743416 (21-ev)/se ## [1] -0.843274 Semester 1 Main, 2018 Page 10 of 16 −3 −2 −1 0 1 2 3 (b) Matt’s associate, Foggy, has different ideas about how to perform hypothesis testing for this study. He decides to simulate from the box model 1000000 times, and deduce the p-value. The results from his simulations are below: totals = replicate(1000000, sum(sample(box, 250, rep = T))) table(totals) ## totals ## 7 8 9 10 11 12 13 14 15 16 17 18 ## 9 27 80 239 511 1191 2428 4729 8094 13483 20411 29617 ## 19 20 21 22 23 24 25 26 27 28 29 30 ## 39844 51106 62337 72335 79265 83822 83598 80815 73990 65848 55818 45736 ## 31 32 33 34 35 36 37 38 39 40 41 42 ## 36093 27549 20123 14276 9698 6563 4100 2617 1559 956 501 321 ## 43 44 45 46 47 48 49 50 51 ## 164 71 35 23 5 6 5 1 1 mean(totals) ## [1] 24.99963 sd(totals) ## [1] 4.741202 hist(totals) Semester 1 Main, 2018 Page 11 of 16 cumsum #cumulative sum of the totals ## 7 8 9 10 11 12 13 14 15 ## 9 36 116 355 866 2057 4485 9214 17308 ## 16 17 18 19 20 21 22 23 24 ## 30791 51202 80819 120663 171769 234106 306441 385706 469528 ## 25 26 27 28 29 30 31 32 33 ## 553126 633941 707931 773779 829597 875333 911426 938975 959098 ## 34 35 36 37 38 39 40 41 42 ## 973374 983072 989635 993735 996352 997911 998867 999368 999689 ## 43 44 45 46 47 48 49 50 51 ## 999853 999924 999959 999982 999987 999993 999998 999999 1000000 (i) What p value would Foggy arrive at? Does his conclusion differ from Matt’s? (ii) In your own words, in reference to the histogram above, explain the concept of a p-value. Semester 1 Main, 2018 Page 12 of 16 Examination Solution of extended answer questions (1) (a) 1 answer from each Source of bias: Nonresponse bias: Working mothers who respond may have higher salaries than the general working mothers. Selection bias: Only those subscribers are included. Measurement bias: Respondents may not know the salary of their partners. Improvement: Improve the response rate by putting the survey online. Respondents should be assured of confidentiality. (b) The median of X is higher than that of Y and X has a wider spread. Both distribution are symmetric. The circle represents an outlier. (c) The sample correlation coefficient is r = b = 0.7244. (d) A linear regression model is suitable as there is a clear linear trend with no obvious outliers. (e) The regression model is Y = 30.63 + 0.5613X (f) An increase of 1K in the husband salary is associated with an increase of amount equals to the slope = d = 0.5613K. One should only talk about association but not causation. That means one cannot say a certain increase in husband’s salary will cause or bring about an increase in wife’s salary. (g) Slope = r SD(Y)/SD(X)= 0.7244 × 16.32/ 21.05 Y-intercept = mean(Y) - Slope × mean(X) = 90.46 - 0.5613 × 106.60 (2) (a) Mean for Box A = −1 + 1 2 = 0. Mean for Box B is also 0 as the values in Box B also sum to 0. (b) SDA = (big−small) √ propbig × propsmall = [1−(−1)]× √ 1 2 × 1 2 = 2× √ 1 4 = 1. Similarly SDB = 2. Alternative answers using RMS of gaps is SDA = √ (1− 0)2 + (−1− 0)2 2 = 1. (c) No, we should not expect the sample means to equal to each other exactly. This is because the mean of the random samples is also random. There is only a very small chance those two means are the same. (d) Part (a) asks for expectation which is a fixed population mean. Part (c) asks for the sample mean, which is a random variable. So one should not expect the population mean equals to a sample mean exactly. (e) The law states that the sample mean becomes more stable and approaches a fixed number as the number of simulation increases. The answer in (d) does not contradict the law because the sample mean will approach that fixed population mean but it does not imply the sample mean will equal to the population mean for a certain number of sample. (f) Box A corresponds to Histogram Y. Even though both Boxes share the same mean, Box A has a lower standard Semester 1 Main, 2018 Page 13 of 16 deviation than Box B, and thus, MeanA should spread out less from the centre when experiment was repeated. (3) (a) (i) The appropriate hypothesis test is the percentage test (test for propor- tions). We have H0 p = 0.1 vs H1 p < 0.1, where p is the proportion of HIV positive IV drug users in the population. (ii) In order to get p = 0.1, we need 9 ‘0’ tickets and 1 ‘1’ ticket. We also need 250 draws. (iii) Matt needs to assume the Central Limit Theorem applies to the sum of tickets from the boxes. He needs to assume this in order to calculate the p- value, as we assume the sampling distribution of p is normally distributed. (iv) From the study, it is found that 21 users of IV drugs presented to the clinic are HIV positive. Since the test statistic is approx -0.843274, this lies between 0 and -1 standard deviations away from the mean for the standard normal distribution. Using known areas of the normal distribution, we can say this p-value is between 0.16 and 0.5. (b) (i) P-value refers to the probability of the observed or more extreme. Since this test aims to test if fewer than 10% of the IV drug users are HIV Semester 1 Main, 2018 Page 14 of 16 positive. This means we have a lower-sided H1. The idea of (b) is to release the normal distribution assumption and use the distribution of simulated counts to approximate the population distribution under H0 (simulating from the box). The p-value corresponds to the proportion based on the cumulative sum at 21 (the observed value of HIV positive users) and a lower side H1 means that we look for probability of 21 and below. As the table give cumulative sum of frequencies, the value under 21 (ie 234106) gives cumulative sum of all frequencies for values 21 and below which is what we want. Hence the relative frequencies of 234106 / 1000000 is the approximated probability for observing 21 or even lower which is the p-value. Therefore, since this is greater than 0.05, Foggy would also retain the null hypothesis, and arrive at the same conclusion as Matt (ie, that the proportion of HIV positive users may be 10% or even higher.) (ii) The p-value is the probability of observing a test statistic as extreme or more extreme than the one observed from the sample, given the null dis- tribution is true. In this case, this corresponds to observing 21 or less HIV positive samples given the null hypothesis is true. On the histogram this is the area bounded by 0 and 21.
欢迎咨询51作业君