EMAT30007 Applied Statistics 2021/22

Coursework

Due: 27th April 2022, 01:00pm (week 23)

Submit on BLACKBOARD

General information:

Attempt to answer all questions. This coursework has three parts and four questions. The questions are released successively as the material is covered in the course.

Submit a Matlab script (.m file) file with your answers. This should run when copied into the same folder as the data files (see below) and should only use commands and Matlab packages used in the worksheets. Clearly annotate your code and include the required discussion of your findings directly in the script.

The limit for your submission is specified for each question and given in lines of standard Matlab script with at most 100 characters per line, in addition to a restriction on the number of figures/plots for each question (see below).

There are additional files available on Blackboard for this piece of coursework that are specified for each question separately. The contents of the files are described in more detail below.

The only way I will answer questions about the coursework is via the dedicated coursework discussion forum on Blackboard. This is to ensure that the entire class has access to the same information.

There are a total of 60 marks for the coursework, 20 marks for each part of the coursework.

1

PART I – (≤300 lines of code)

Question 1 (20 marks):

A computer simulation model is an algorithm that allows us to generate data for a real-world process on a computer. Consider the following computer simulation model for a network of 100 Geiger counters:

Counters are independent from each other. Each counter reports the number of radio-active particles detected in a fixed time interval, t=10 seconds. Radio-active particles can be assumed to be detected independently from each other at a constant rate of r=0.4 per second (1/second) at all counter locations. The goal of the simulation model is to generate the cumulative count of detected particles for 10 second time intervals, that is to say the sum of particle detections across all sensors.

Answer

the following questions:

(a) Simulate 1,000 samples of the cumulative count across the network, choosing and justifying the correct probability distribution for the number of particles detected by each Geiger counter. Plot the resulting empirical distribution of cumulative counts and the corresponding empirical cumulative density function, CDF (1 figure, multiple panels are allowed).

(b) Implementtwoalternativeappropriateapproximatesimulationmodelsforthecumulative count across the network and discuss in how far they are accurate (the assumed probability distribution for each counter cannot be changed). Show, by plotting the resulting empirical CDFs in one figure that the simulations approximately match your answer from (a).

(c) Using the simulated data from (a), test the hypothesis that the cumulative count across the network is larger than 400 detected particles in a 10 second interval at a significance level of 5%. Discuss the use of hypothesis tests on computer simulated data.

(d) How would your answer to (b) change if the rate r was not the same for all sensors (in words only, no code needed).

(e) Consider the situation where 50 counters detect at a constant rate of r=0.4 per second, and 50 detect at a constant rate of r=1 per second. Simulate 100 samples of counts from the entire network (not cumulative counts) and plot the resulting empirical distribution and the corresponding empirical CDF (1 figure, multiple panels are allowed).

(f) Fit a bimodal probability distribution, based on two Normal distributions, to the data you simulated in (e) from first principles (that is to say using fminsearch in Matlab). Discuss the appropriateness of the fitted distribution.

2

PART II (20 marks) – (≤300 lines of code)

Question 2 (10 marks):

The file burning_experiment.txt contains a data set from controlled burning experiments on 25 commercially available scented candle brands. In the experiments the amount of a poisonous gas (in milligram, mg) emitted for each candle brand was recorded alongside the amount (in mg) of two ingredients they contain.

Use a statistical analysis of this data to discuss how this experiment can be used to investigate the effect of each ingredient on poisonous gas emission and to make predictions about gas emission. Make suggestions for how to improve the experiment.

In your discussion, show evidence for the statements you make (1 figure, multiple panels are allowed).

This data is hypothetical.

Question 3 (10 marks):

Discuss to what extent it is possible to build a Linear Model with fit lines that look approximately like a smiley, such as the one shown in the figure below (Tip: lecture 7). You are allowed to define additional variables/predictors on top of in the figure below. You must assume is drawn from a uniform distribution between -1 and 1. For your closest attempt, state the model explicitly, simulate data for it, and show by fitting your suggested model that a Linear Model is appropriate.

(2 figures, multiple panels are allowed)

_________________

3

PART III– (≤150 lines of code)

Question 4 (20 marks):

The following story is fictitious.

The Roadways Agency monitors the number of road accidents on all stretches of principal roads in the country. To understand and reduce the number of accidents several interventions are undertaken, and available data is analysed.

(i) Each year, speed cameras are installed on the twenty road segments that had the most accidents in the previous year and had no speed camera installed already. Welch’s t-tests are used to compare the observed reduction of accident counts before and after speed camera installation on each set of twenty road segments. From a p-value of roughly around 0.001 every year, it is deduced that speed cameras are a highly effective tool for substantially reducing accident numbers.

(ii) To determine what causes high accident numbers on road segments, statistical modelling is used. For a random selection of 1,000 road segments, additional data is recorded which is used to generate a total of 100 quantitative and qualitative predictors for the response which is set to be the accident count. Checking for correlations between predictors results in the removal of 10 predictors. Subsequently, an automated model selection procedure is used to determine the most relevant predictors. This procedure successively adds randomly chosen predictors to a Linear Model for the data and retains them if the parameter-specific test corresponding to the added predictor returns a p-value lower than 0.05. First all predictors are considered on their own and subsequently, interaction terms between included predictors are also considered in the same way. The final model has 85 parameters. From a very low p-value for the F-test on the entire model (p=4.35x10-8), it is concluded that the model fit to the data is excellent.

Write a statistical critique of the two data collection and analysis methods described in the story above, labelled (i)-(ii). Provide your critique in bullet points (not continuous text) and consider any improvements.

No coding or figures are required for this question but your answer should be included in the Matlab script, as described above.

_________________

4

end of coursework