Statistical computing MATH10093

Coursework B 2019/20

Finn Lindgren

20/3/2020

Summary

Handout Friday 20/3/2020, handin as pdf via Learn by noon Tuesday

14/4/2020. Discussion of the assignment with others is permitted, but

handin must be individual solutions. The work will be marked out of 50,

and counts for 50% of the total grade.

General notes

• Use RMarkdown (or knitr) used to produce the PDF-handin. See the

Learn Announcement about working in the free web based RStudio.cloud

if you do not have a local rstudio installation.

• Ordinary text should be typset as test, and not as R code chunk comments.

• Use LaTeX math typsetting with $· · · $ for inline formulas and $$· · · $$

for displayed equations (see lecture 8).

• The code in the CWB2020code.R file should be included via source(), and

not included in your report.

• Write readable code.

• Do not hide R code with echo=FALSE.

• Do hide unnecessary R output, such as long data listings, with results=’hide’

as RMarkdown code chunk

• Avoid unneccessarily repeating identical code, for example when adding

to a previous plot, use the pl + new stuff() technique for ggplot().

Suggested RMarkdown startup code chunk:

set.seed(12345L)

source("CWB2020code.R")

suppressPackageStartupMessages(library(tidyverse))

theme_set(theme_bw())

# Read data for part 2 of the assignment:

filament <- read.csv("filamentCWB.csv", stringsAsFactors = FALSE)

# Note: this is a different data file than in CWA.

The functions from CWB2020code.R that you need to call in your own code

are either mentioned in the text or part of provided code outlines. Look at the

code file for documentation, and to see functions that are used internally and

that you do not need to call yourself.

1

Part 1: Archaeology

“Anno Domini MCCCLXI feria III post Jacobi ante

portas Visby in manibus Danorum ceciderunt Gutenses,

hic sepulti, orate pro eis!”

“In the year of our Lord 1361, on the third day

after St. Jacob, the Goth fell outside the gates of Visby

at the hands of the Danish. They are buried here. Pray

for them!”

In 1361 the Danish king Valdemar Atterdag conquered Gotland1 and cap-

tured the rich Hanseatic town of Visby. The conquest was followed by a plunder

of Visby. Most of the defenders2 were killed in the attack and are buried in a

field, Korsbetningen3, outside of the walls of Visby.

In the 1920s the gravesite was subject to several archaeological excavations.

A total of 493 femurs4 (256 left, and 237 right) were found. We want to figure

out how many persons were likely buried at the gravesite. It must reasonably

have been at least 256, but how many more?

Statistical model

To build a simple model for this problem, we assume that the number of left

(y1 = 256) and right (y2 = 237) femurs are two independent observations from

a Bin(N,φ) distribution. Here N is the total number of people buried and φ is

the probability of finding a femur, left or right, and both N and φ are unknown

parameters.

The probability function for a single observation y ∼ Bin(N,φ) is

p(y|N,φ) =

(

N

y

)

φy(1− φ)N−y.

The function arch loglike() in CWB2020code.R evaluates the combined log-

likelihood log[p(y|N,φ)] for a collection y of y-observations. If a data.frame

with columns N and phi is provided, the log-likelihood for each row-pair (N,φ)

is returned.

Questions

1. An archaeological researcher tries to obtain 95% confidence intervals for N

and φ. They realise that since φ must fall in the finite interval (0, 1), it is

1Strategically located in the middle of the Baltic sea, Gotland had shifting periods of

being partly self-governed, and in partial control by the Hanseatic trading alliance, Sweden,

Denmark, and the Denmark-Norway-Sweden union, until settling as part of Sweden in 1645.

Gotland has an abundance of archaeological treasures, with coins dating back to Viking era

trade routes via Russia to the Arab Caliphates.

2Primarily local farmers that could not take shelter inside the city walls.

3Literal translation: the grazing field that is marked by a cross, as shown in the picture.

4thigh bone

2

reasonable to reparameterise to θ = log(φ)− log(1−φ), the so-called logit-

transformation of φ. They then attempt to apply Maximum Likelihood

theory to N and θ, where the estimates are assumed to be approximately

Gaussian, by running the following code:

like_est <- arch_likelihood_estimation(c(256, 237))

The code automatically transforms the confidence interval limits for θ back

to the φ scale, as φ = 1/(1 + eθ), and they obtain the intervals

CIN = (−51, 827),

CIφ = (0.073, 0.975).

After also introducing a log-transformation for N in the internal calcu-

lations,5 the computed confidence interval for N changes to (125, 1199),

with only minor changes to the confidence interval for φ.

Explain, in words, what is problematic with these frequentistic confidence

intervals. What assumptions were violated?

2. You should now do a Bayesian analysis of the problem (see Lecture 8), to

make an improvement over the researchers failed frequentistic attempt.

Let N have a Geometric(ξ), ξ > 0, prior distribution, and let φ have a

Beta(a, b), a, b > 0, prior distribution:

pN (n) = P(N = n) = ξ (1− ξ)n, n = 0, 1, 2, 3, . . . ,

pφ(φ) =

φa−1(1− φ)b−1

B(a, b)

, φ ∈ [0, 1].

Before the excavation took place, the archaeologist believed that around

1000 individuals were buried, and that they would find around half on the

femurs. To encode that belief in the Bayesian analysis, set ξ = 1/(1 +

1000), which corresponds to an expected total count of 1000, and a = b =

2, which makes φ more likely to be close to 1/2 than to 0 or 1.

The posterior probability function (PF) for N given y = (y1, y2) is given

by

p(N |Y )(n) = P(N = n|Y = y) = P(N = n,Y = y)

P(Y = y)

The numerator can be obtained by integrating the joint distribution for

N , φ, and Y ,

P(N = n,Y = y) =

∫ 1

0

pN (n)pφ(φ)P(Y = y|N = n, φ) dφ.

5This model reparameterisation would only require minor changes to

arch param to theta() and arch param from theta(), but this assignment doesn’t in-

volve doing that.

3

The denominator can be obtained by summing P(N = n,Y = y) for all

n,

P(Y = y) =

∞∑

n=max(y1,y2)

P(N = n,Y = y).

Estimate the posterior probability function (PF) pN |y(n) = P(N = n|y1, y2)

by first performing Monte Carlo integration6 over φ to obtain P(N =

n,Y = y) for n = max(y1, y2), . . . , 3000. The remaining terms, for

n > 3000 are negligible. Use 1000 Monte Carlo samples of φ ∼ pφ(φ)

in the integration.

You can use the following code template for the calculations, replacing

each ###***### with suitable code

y <- c(256, 237)

N_xi <- 1/(1 + 1000)

phi_ab <- list(a = 2, b = 2)

N <- max(y):3000

n_mc <- 1000

# Use the prior for phi for sampling:

phi <- ###***###

P_NY <- numeric(length(N))

for (loop_phi in phi) {

# Compute and accumulate the integrand:

P_NY <- P_NY + ###***###

}

df <- data.frame(N = N,

P_N_posterior = ###***###)

3. Draw a figure with the prior and posterior probability functions for N .

Pay attention to the fact that the prior and posterior distributions for N

have different supports.

4. Compute a Bayesian 95% credible interval for N , with equal upper and

lower tail probabilities, by finding the appropriate quantiles via the pos-

terior cumulative distribution function for N ,

P(N ≤ n|y1, y2) =

n∑

k=0

P(N = k|y1, y2).

Compare the result to the frequentistic confidence intervals the archaeol-

ogist computed.

6Remark: An alternative, Monte Carlo free approach for this particular model would be to

show that the conditional posterior distribution for (φ|N = n,Y = y) is Beta(a+ y1+ y2, b+

2n− y1 − y2). Then the identity p(n,y) = p(φ,n,y)p(φ|n,y) would provide a direct expression for the

joint probability function for N and Y , eliminating the need for numerical integration.

4

Part 2: LOOCV and randomisation tests

Recall the 3D printer filament problem from Coursework A. Of the models in-

vestigated there, the overall best model had a log-linear model for the variance,

that we’ll now call model A:

yi ∼ N

[

θ1 + θ2xi, e

θ3+θ4xi

]

,

where xi are the CAD Weight values of filamentCWB.csv, and yi are the ob-

served Actual Weight values. See the suggested setup code chunk for how to

load the data file. The data columns are

• Index: An observation index, 1, 2, 3, . . . , n

• Date: The date the object was printed.

• Material: The material colour

• CAD Weight: The CAD-estimated required filament weight (gram)

• Actual Weight: The measured actual object weight (gram)

The Date and Material variables are not needed for this assignment.

We are now interested in a slightly different model, that we’ll call model B:

yi ∼ N

[

θ1 + θ2xi, e

θ3 + eθ4x2i

]

.

Code for estimating the parameters for each of the two models, and com-

puting predictions for new data, is available in CWB2020code.R as the function

estimate and predict().

1. Show that Model B is mathematically equivalent to an error model that

can be written yi = α0 + γixi + ei, where γi and ei are random variables

with independent Gaussian distributions, N(µγ , σ

2

γ) and N(0, σ

2

e).

What are the parameters for the γi and ei distributions that yield Model

B?

2. Implement leave-one-out cross validation scoring and compare the predic-

tion scores for the two models, for Squared Error (SE), Dawid-Sebastiani

(DS), and the Interval score (Interval, use the default 90% intervals).

First, fill in suitable code instead of ###***###:

pred <- list(A = estimate_and_predict_output_template(nrow(filament)),

B = estimate_and_predict_output_template(nrow(filament)))

for (ind in filament$Index) {

for (model in c("A", "B")) {

# Leave out one observation, estimate the model, predict the left-out obs:

pred[[model]][ind,] <- ###***###

}

}

5

The scores SAi = S(F

A

i , yi) and S

B

i = S(F

B

i , yi) can then be computed

and stored as follows:

scores <- list(A = calc_scores(pred[["A"]], filament[["Actual_Weight"]]),

B = calc_scores(pred[["B"]], filament[["Actual_Weight"]]))

3. Let SA−B denote the average of the pairwise score differences SAi − SBi ,

and let G denote the prediction of the true date generating model. For

each type of score, estimate the p-value, PT , for a suitable randomisation

test of

H0 : EG(SA−B) = 0,

H1 : EG(SA−B) > 0.

Use the code below, replacing ###***### with suitable7 code. Explain, in

words, what assumptions the test is based on.

# Compute the test statistics

stat <- colMeans(scores$A - scores$B)

J <- 10000

rand_stat <- data.frame(SE = numeric(J),

DS = numeric(J),

Interval = numeric(J))

# Compute the randomised test statistics

for (loop in seq_len(J)) {

rand_stat[loop, ] <- colMeans(###***###)

}

# Estimate the P-values for the tests

p_value <- colMeans(rand_stat >= rep(stat, each = J))

4. For each of the three tests, Z(j) = I(T (j) ≥ T ) ∼ Bin(1, PT ), independent

across j = 1, . . . , J . By the central limit theorem, the estimate P̂T has,

approximately, a N(PT , σ

2

T ) distribution. Show that σ

2

T = PT (1− PT )/J ,

and compute a standard deviation estimate σ̂T for each test, by replacing

PT by P̂T . What is the relative Monte Carlo error (std.dev. divided by

the estimate) for each of the three p-value estimates? Also compute an

approximate 95% confidence interval for each PT .

7Consult lecture 6 for material on randomisation tests.

6