TRINITY COLLEGE DUBLIN School of Computer Science and Statistics Final Assignment 2020-21 STU33009: Statistical Methods for Computer Science Submitting Your Report • Reports must be typed (no handwritten answers please) and submitted on Blackboard. • As a guideline, reports should be about 5 pages in length including all plots (please don’t go a lot over this). • You will need to use matlab to calculate values from the assignment dataset, or alternatively write a short program in python to do this. In either case give the code used as an appendix to the report (it doesn’t count towards the page limit), but please keep the code short. • In order to obtain full credit it is essential that you explain/justify how you obtained your results and, where appropriate, that you critically reflect upon them. Simply giving raw numbers as answers will receive few marks as will saying “see code for details” and the like, even if the code contains explanatory comments. • It is mandatory to complete the declaration that the work is entirely your own and you have not collaborated with anyone - the declaration form is available on Blackboard. Downloading Dataset • Download the assignment dataset from https://www.scss.tcd.ie/doug.leith/ ST3009/final2021.php. Important: You must fetch your own copy of the dataset, do not use the dataset downloaded by someone else. • The data file consists of three columns of COVID testing data. The first column is the number of people tested for COVID, the second column the number testing positive and the third column is the number of people presenting with significant symptoms. Each row corresponds to one week and the numbers reported are cumulative (so the value in first column of row i is the total number of people tested up to and including week i). Please cut and paste the first line of the data file (which begins with a #) into your report as it identifies your dataset. Assignment 1. In the first part of the assignment you’ll work with just the last row of data in the file you downloaded. That has three values: the number N of people tested for COVID, the number P testing positive and the number S with significant symptoms. (a) A key concern with COVID is that people may be infected but show no significant symptoms. Assuming that the people tested are drawn uniformly at random from the population, use your data to estimate the fraction of the population expected to (i) test positive for COVID but have no significant symptoms and (ii) test positive and have significant symptoms. Explain/discuss your calculations. [5 marks] (b) Estimate a confidence interval for each of your two estimates in part (i). Explain/discuss your calculation. [5 marks] (c) Is it important to assume that the people tested are drawn uniformly at random? How might it affect your estimates if this isn’t the case? [5 marks] (d) For people without significant symptoms the COVID test used has a false positive rate of 0.01 and a false negative rate of 0.1. That is, if you don’t have COVID there is a 0.01 probability that the test will incorrectly give a positive result, while if you do have COVID (but have no symptoms) there’s a 0.1 probablity that the test will incorrectly give a negative result. Use this information, combined with your estimate from part (i), to estimate the fraction of the population that have COVID (rather than just testing positive) but have no significant symptoms. Hint: Use marginalisation. [5 marks] (e) Estimate a confidence interval for your estimate in part (d). Explain/discuss your calculation. [5 marks] (f) Given that you test positive for COVID but have no significant symptoms, estimate the probability that you actually have the disease. Hint: Use Bayes Rule. [5 marks] (g) Using matlab (or python) write a short stochastic simulation of the setup you have just analysed. Namely, for each person in the population there is a probability of catching COVID but showing no symptoms, and a probability of catching COVID and showing significant symptoms. If a person shows symptoms then when tested this will come up positive but if they have no symptoms there is a high probability that they will test positive but also a small probability that they test negative. Explain/discuss your code. [10 marks] (h) Using this simulation, estimate the probability that a person (i) tests positive for COVID but has no significant symptoms and (ii) tests positive and has significant symptoms. Compare with your estimates in part (i) and discuss. [10 marks] 2. In this part of the assignment you’ll use the full dataset that you downloaded. Let xk be the number of infected people during week k. Assuming growth is exponential then xk = e akx0 where a is a growth parameter and x0 is the initial number of people infected. When a < 0 then the infection decays over time, when a > 0 then it grows. Taking logs, log xk = ak + log x0 and so if a is a constant then we expect a plot of log xk to be a straight line with slope a. 2 (a) Plot the number of people testing positive vs time and also plot the logarithm of the number of people testing positive vs time. Discuss and, if appropriate, roughly estimate growth parameter a. [5 marks] (b) Write a short piece of matlab (or python) code that trains a linear regression model using gradient descent. You should implement this from scratch (so you’ll need to calculate the cost function and its gradient, update these etc). Do not use any built in functions/libraries for linear regression. [10 marks] (c) Using your code from (b) train a linear regression model on the log xk data and so estimate growth parameter a and the level of initial infections log x0. (i) How did you choose the gradient descent step size? Justify your choice (and present data to back it up). [5 marks] (ii) A linear regression model makes some statistical assumptions regarding the data. What are these assumptions? [5 marks] (iii) Discuss whether the infection data is likely to satisfy or violate any of these assumptions. [5 marks] (d) Explain how to use bootstrapping to estimate confidence intervals for the linear re- gression estimates of a and log x0. [5 marks] (e) Now write a short piece of code to implement bootstrapping, and report the confidence intervals that you obtain. Discuss. [10 marks] (f) By plugging in a range of values for a that lie within the confidence interval into the formula xk = e akx0 estimate a confidence interval for xk when k = 10 weeks. In this formula just use the value of log x0 (recall x0 = exp(log x0)) that you estimated in (c), there’s no need to consider a range of x0 values. Explain/discuss. [5 marks] 3
欢迎咨询51作业君