辅导案例-MATH10093

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Statistical computing MATH10093
Coursework B 2019/20
Finn Lindgren
20/3/2020
Summary
Handout Friday 20/3/2020, handin as pdf via Learn by noon Tuesday
14/4/2020. Discussion of the assignment with others is permitted, but
handin must be individual solutions. The work will be marked out of 50,
and counts for 50% of the total grade.
General notes
• Use RMarkdown (or knitr) used to produce the PDF-handin. See the
Learn Announcement about working in the free web based RStudio.cloud
if you do not have a local rstudio installation.
• Ordinary text should be typset as test, and not as R code chunk comments.
• Use LaTeX math typsetting with $· · · $ for inline formulas and $$· · · $$
for displayed equations (see lecture 8).
• The code in the CWB2020code.R file should be included via source(), and
not included in your report.
• Write readable code.
• Do not hide R code with echo=FALSE.
• Do hide unnecessary R output, such as long data listings, with results=’hide’
as RMarkdown code chunk
• Avoid unneccessarily repeating identical code, for example when adding
to a previous plot, use the pl + new stuff() technique for ggplot().
Suggested RMarkdown startup code chunk:
set.seed(12345L)
source("CWB2020code.R")
suppressPackageStartupMessages(library(tidyverse))
theme_set(theme_bw())
# Read data for part 2 of the assignment:
filament <- read.csv("filamentCWB.csv", stringsAsFactors = FALSE)
# Note: this is a different data file than in CWA.
The functions from CWB2020code.R that you need to call in your own code
are either mentioned in the text or part of provided code outlines. Look at the
code file for documentation, and to see functions that are used internally and
that you do not need to call yourself.
1
Part 1: Archaeology
“Anno Domini MCCCLXI feria III post Jacobi ante
portas Visby in manibus Danorum ceciderunt Gutenses,
hic sepulti, orate pro eis!”
“In the year of our Lord 1361, on the third day
after St. Jacob, the Goth fell outside the gates of Visby
at the hands of the Danish. They are buried here. Pray
for them!”
In 1361 the Danish king Valdemar Atterdag conquered Gotland1 and cap-
tured the rich Hanseatic town of Visby. The conquest was followed by a plunder
of Visby. Most of the defenders2 were killed in the attack and are buried in a
field, Korsbetningen3, outside of the walls of Visby.
In the 1920s the gravesite was subject to several archaeological excavations.
A total of 493 femurs4 (256 left, and 237 right) were found. We want to figure
out how many persons were likely buried at the gravesite. It must reasonably
have been at least 256, but how many more?
Statistical model
To build a simple model for this problem, we assume that the number of left
(y1 = 256) and right (y2 = 237) femurs are two independent observations from
a Bin(N,φ) distribution. Here N is the total number of people buried and φ is
the probability of finding a femur, left or right, and both N and φ are unknown
parameters.
The probability function for a single observation y ∼ Bin(N,φ) is
p(y|N,φ) =
(
N
y
)
φy(1− φ)N−y.
The function arch loglike() in CWB2020code.R evaluates the combined log-
likelihood log[p(y|N,φ)] for a collection y of y-observations. If a data.frame
with columns N and phi is provided, the log-likelihood for each row-pair (N,φ)
is returned.
Questions
1. An archaeological researcher tries to obtain 95% confidence intervals for N
and φ. They realise that since φ must fall in the finite interval (0, 1), it is
1Strategically located in the middle of the Baltic sea, Gotland had shifting periods of
being partly self-governed, and in partial control by the Hanseatic trading alliance, Sweden,
Denmark, and the Denmark-Norway-Sweden union, until settling as part of Sweden in 1645.
Gotland has an abundance of archaeological treasures, with coins dating back to Viking era
trade routes via Russia to the Arab Caliphates.
2Primarily local farmers that could not take shelter inside the city walls.
3Literal translation: the grazing field that is marked by a cross, as shown in the picture.
4thigh bone
2
reasonable to reparameterise to θ = log(φ)− log(1−φ), the so-called logit-
transformation of φ. They then attempt to apply Maximum Likelihood
theory to N and θ, where the estimates are assumed to be approximately
Gaussian, by running the following code:
like_est <- arch_likelihood_estimation(c(256, 237))
The code automatically transforms the confidence interval limits for θ back
to the φ scale, as φ = 1/(1 + eθ), and they obtain the intervals
CIN = (−51, 827),
CIφ = (0.073, 0.975).
After also introducing a log-transformation for N in the internal calcu-
lations,5 the computed confidence interval for N changes to (125, 1199),
with only minor changes to the confidence interval for φ.
Explain, in words, what is problematic with these frequentistic confidence
intervals. What assumptions were violated?
2. You should now do a Bayesian analysis of the problem (see Lecture 8), to
make an improvement over the researchers failed frequentistic attempt.
Let N have a Geometric(ξ), ξ > 0, prior distribution, and let φ have a
Beta(a, b), a, b > 0, prior distribution:
pN (n) = P(N = n) = ξ (1− ξ)n, n = 0, 1, 2, 3, . . . ,
pφ(φ) =
φa−1(1− φ)b−1
B(a, b)
, φ ∈ [0, 1].
Before the excavation took place, the archaeologist believed that around
1000 individuals were buried, and that they would find around half on the
femurs. To encode that belief in the Bayesian analysis, set ξ = 1/(1 +
1000), which corresponds to an expected total count of 1000, and a = b =
2, which makes φ more likely to be close to 1/2 than to 0 or 1.
The posterior probability function (PF) for N given y = (y1, y2) is given
by
p(N |Y )(n) = P(N = n|Y = y) = P(N = n,Y = y)
P(Y = y)
The numerator can be obtained by integrating the joint distribution for
N , φ, and Y ,
P(N = n,Y = y) =
∫ 1
0
pN (n)pφ(φ)P(Y = y|N = n, φ) dφ.
5This model reparameterisation would only require minor changes to
arch param to theta() and arch param from theta(), but this assignment doesn’t in-
volve doing that.
3
The denominator can be obtained by summing P(N = n,Y = y) for all
n,
P(Y = y) =
∞∑
n=max(y1,y2)
P(N = n,Y = y).
Estimate the posterior probability function (PF) pN |y(n) = P(N = n|y1, y2)
by first performing Monte Carlo integration6 over φ to obtain P(N =
n,Y = y) for n = max(y1, y2), . . . , 3000. The remaining terms, for
n > 3000 are negligible. Use 1000 Monte Carlo samples of φ ∼ pφ(φ)
in the integration.
You can use the following code template for the calculations, replacing
each ###***### with suitable code
y <- c(256, 237)
N_xi <- 1/(1 + 1000)
phi_ab <- list(a = 2, b = 2)
N <- max(y):3000
n_mc <- 1000
# Use the prior for phi for sampling:
phi <- ###***###
P_NY <- numeric(length(N))
for (loop_phi in phi) {
# Compute and accumulate the integrand:
P_NY <- P_NY + ###***###
}
df <- data.frame(N = N,
P_N_posterior = ###***###)
3. Draw a figure with the prior and posterior probability functions for N .
Pay attention to the fact that the prior and posterior distributions for N
have different supports.
4. Compute a Bayesian 95% credible interval for N , with equal upper and
lower tail probabilities, by finding the appropriate quantiles via the pos-
terior cumulative distribution function for N ,
P(N ≤ n|y1, y2) =
n∑
k=0
P(N = k|y1, y2).
Compare the result to the frequentistic confidence intervals the archaeol-
ogist computed.
6Remark: An alternative, Monte Carlo free approach for this particular model would be to
show that the conditional posterior distribution for (φ|N = n,Y = y) is Beta(a+ y1+ y2, b+
2n− y1 − y2). Then the identity p(n,y) = p(φ,n,y)p(φ|n,y) would provide a direct expression for the
joint probability function for N and Y , eliminating the need for numerical integration.
4
Part 2: LOOCV and randomisation tests
Recall the 3D printer filament problem from Coursework A. Of the models in-
vestigated there, the overall best model had a log-linear model for the variance,
that we’ll now call model A:
yi ∼ N
[
θ1 + θ2xi, e
θ3+θ4xi
]
,
where xi are the CAD Weight values of filamentCWB.csv, and yi are the ob-
served Actual Weight values. See the suggested setup code chunk for how to
load the data file. The data columns are
• Index: An observation index, 1, 2, 3, . . . , n
• Date: The date the object was printed.
• Material: The material colour
• CAD Weight: The CAD-estimated required filament weight (gram)
• Actual Weight: The measured actual object weight (gram)
The Date and Material variables are not needed for this assignment.
We are now interested in a slightly different model, that we’ll call model B:
yi ∼ N
[
θ1 + θ2xi, e
θ3 + eθ4x2i
]
.
Code for estimating the parameters for each of the two models, and com-
puting predictions for new data, is available in CWB2020code.R as the function
estimate and predict().
1. Show that Model B is mathematically equivalent to an error model that
can be written yi = α0 + γixi + ei, where γi and ei are random variables
with independent Gaussian distributions, N(µγ , σ
2
γ) and N(0, σ
2
e).
What are the parameters for the γi and ei distributions that yield Model
B?
2. Implement leave-one-out cross validation scoring and compare the predic-
tion scores for the two models, for Squared Error (SE), Dawid-Sebastiani
(DS), and the Interval score (Interval, use the default 90% intervals).
First, fill in suitable code instead of ###***###:
pred <- list(A = estimate_and_predict_output_template(nrow(filament)),
B = estimate_and_predict_output_template(nrow(filament)))
for (ind in filament$Index) {
for (model in c("A", "B")) {
# Leave out one observation, estimate the model, predict the left-out obs:
pred[[model]][ind,] <- ###***###
}
}
5
The scores SAi = S(F
A
i , yi) and S
B
i = S(F
B
i , yi) can then be computed
and stored as follows:
scores <- list(A = calc_scores(pred[["A"]], filament[["Actual_Weight"]]),
B = calc_scores(pred[["B"]], filament[["Actual_Weight"]]))
3. Let SA−B denote the average of the pairwise score differences SAi − SBi ,
and let G denote the prediction of the true date generating model. For
each type of score, estimate the p-value, PT , for a suitable randomisation
test of
H0 : EG(SA−B) = 0,
H1 : EG(SA−B) > 0.
Use the code below, replacing ###***### with suitable7 code. Explain, in
words, what assumptions the test is based on.
# Compute the test statistics
stat <- colMeans(scores$A - scores$B)
J <- 10000
rand_stat <- data.frame(SE = numeric(J),
DS = numeric(J),
Interval = numeric(J))
# Compute the randomised test statistics
for (loop in seq_len(J)) {
rand_stat[loop, ] <- colMeans(###***###)
}
# Estimate the P-values for the tests
p_value <- colMeans(rand_stat >= rep(stat, each = J))
4. For each of the three tests, Z(j) = I(T (j) ≥ T ) ∼ Bin(1, PT ), independent
across j = 1, . . . , J . By the central limit theorem, the estimate P̂T has,
approximately, a N(PT , σ
2
T ) distribution. Show that σ
2
T = PT (1− PT )/J ,
and compute a standard deviation estimate σ̂T for each test, by replacing
PT by P̂T . What is the relative Monte Carlo error (std.dev. divided by
the estimate) for each of the three p-value estimates? Also compute an
approximate 95% confidence interval for each PT .
7Consult lecture 6 for material on randomisation tests.
6