辅导案例-MATH10093
Statistical computing MATH10093 Coursework B 2019/20 Finn Lindgren 20/3/2020 Summary Handout Friday 20/3/2020, handin as pdf via Learn by noon Tuesday 14/4/2020. Discussion of the assignment with others is permitted, but handin must be individual solutions. The work will be marked out of 50, and counts for 50% of the total grade. General notes • Use RMarkdown (or knitr) used to produce the PDF-handin. See the Learn Announcement about working in the free web based RStudio.cloud if you do not have a local rstudio installation. • Ordinary text should be typset as test, and not as R code chunk comments. • Use LaTeX math typsetting with $· · · $ for inline formulas and $$· · · $$ for displayed equations (see lecture 8). • The code in the CWB2020code.R file should be included via source(), and not included in your report. • Write readable code. • Do not hide R code with echo=FALSE. • Do hide unnecessary R output, such as long data listings, with results=’hide’ as RMarkdown code chunk • Avoid unneccessarily repeating identical code, for example when adding to a previous plot, use the pl + new stuff() technique for ggplot(). Suggested RMarkdown startup code chunk: set.seed(12345L) source("CWB2020code.R") suppressPackageStartupMessages(library(tidyverse)) theme_set(theme_bw()) # Read data for part 2 of the assignment: filament <- read.csv("filamentCWB.csv", stringsAsFactors = FALSE) # Note: this is a different data file than in CWA. The functions from CWB2020code.R that you need to call in your own code are either mentioned in the text or part of provided code outlines. Look at the code file for documentation, and to see functions that are used internally and that you do not need to call yourself. 1 Part 1: Archaeology “Anno Domini MCCCLXI feria III post Jacobi ante portas Visby in manibus Danorum ceciderunt Gutenses, hic sepulti, orate pro eis!” “In the year of our Lord 1361, on the third day after St. Jacob, the Goth fell outside the gates of Visby at the hands of the Danish. They are buried here. Pray for them!” In 1361 the Danish king Valdemar Atterdag conquered Gotland1 and cap- tured the rich Hanseatic town of Visby. The conquest was followed by a plunder of Visby. Most of the defenders2 were killed in the attack and are buried in a field, Korsbetningen3, outside of the walls of Visby. In the 1920s the gravesite was subject to several archaeological excavations. A total of 493 femurs4 (256 left, and 237 right) were found. We want to figure out how many persons were likely buried at the gravesite. It must reasonably have been at least 256, but how many more? Statistical model To build a simple model for this problem, we assume that the number of left (y1 = 256) and right (y2 = 237) femurs are two independent observations from a Bin(N,φ) distribution. Here N is the total number of people buried and φ is the probability of finding a femur, left or right, and both N and φ are unknown parameters. The probability function for a single observation y ∼ Bin(N,φ) is p(y|N,φ) = ( N y ) φy(1− φ)N−y. The function arch loglike() in CWB2020code.R evaluates the combined log- likelihood log[p(y|N,φ)] for a collection y of y-observations. If a data.frame with columns N and phi is provided, the log-likelihood for each row-pair (N,φ) is returned. Questions 1. An archaeological researcher tries to obtain 95% confidence intervals for N and φ. They realise that since φ must fall in the finite interval (0, 1), it is 1Strategically located in the middle of the Baltic sea, Gotland had shifting periods of being partly self-governed, and in partial control by the Hanseatic trading alliance, Sweden, Denmark, and the Denmark-Norway-Sweden union, until settling as part of Sweden in 1645. Gotland has an abundance of archaeological treasures, with coins dating back to Viking era trade routes via Russia to the Arab Caliphates. 2Primarily local farmers that could not take shelter inside the city walls. 3Literal translation: the grazing field that is marked by a cross, as shown in the picture. 4thigh bone 2 reasonable to reparameterise to θ = log(φ)− log(1−φ), the so-called logit- transformation of φ. They then attempt to apply Maximum Likelihood theory to N and θ, where the estimates are assumed to be approximately Gaussian, by running the following code: like_est <- arch_likelihood_estimation(c(256, 237)) The code automatically transforms the confidence interval limits for θ back to the φ scale, as φ = 1/(1 + eθ), and they obtain the intervals CIN = (−51, 827), CIφ = (0.073, 0.975). After also introducing a log-transformation for N in the internal calcu- lations,5 the computed confidence interval for N changes to (125, 1199), with only minor changes to the confidence interval for φ. Explain, in words, what is problematic with these frequentistic confidence intervals. What assumptions were violated? 2. You should now do a Bayesian analysis of the problem (see Lecture 8), to make an improvement over the researchers failed frequentistic attempt. Let N have a Geometric(ξ), ξ > 0, prior distribution, and let φ have a Beta(a, b), a, b > 0, prior distribution: pN (n) = P(N = n) = ξ (1− ξ)n, n = 0, 1, 2, 3, . . . , pφ(φ) = φa−1(1− φ)b−1 B(a, b) , φ ∈ [0, 1]. Before the excavation took place, the archaeologist believed that around 1000 individuals were buried, and that they would find around half on the femurs. To encode that belief in the Bayesian analysis, set ξ = 1/(1 + 1000), which corresponds to an expected total count of 1000, and a = b = 2, which makes φ more likely to be close to 1/2 than to 0 or 1. The posterior probability function (PF) for N given y = (y1, y2) is given by p(N |Y )(n) = P(N = n|Y = y) = P(N = n,Y = y) P(Y = y) The numerator can be obtained by integrating the joint distribution for N , φ, and Y , P(N = n,Y = y) = ∫ 1 0 pN (n)pφ(φ)P(Y = y|N = n, φ) dφ. 5This model reparameterisation would only require minor changes to arch param to theta() and arch param from theta(), but this assignment doesn’t in- volve doing that. 3 The denominator can be obtained by summing P(N = n,Y = y) for all n, P(Y = y) = ∞∑ n=max(y1,y2) P(N = n,Y = y). Estimate the posterior probability function (PF) pN |y(n) = P(N = n|y1, y2) by first performing Monte Carlo integration6 over φ to obtain P(N = n,Y = y) for n = max(y1, y2), . . . , 3000. The remaining terms, for n > 3000 are negligible. Use 1000 Monte Carlo samples of φ ∼ pφ(φ) in the integration. You can use the following code template for the calculations, replacing each ###***### with suitable code y <- c(256, 237) N_xi <- 1/(1 + 1000) phi_ab <- list(a = 2, b = 2) N <- max(y):3000 n_mc <- 1000 # Use the prior for phi for sampling: phi <- ###***### P_NY <- numeric(length(N)) for (loop_phi in phi) { # Compute and accumulate the integrand: P_NY <- P_NY + ###***### } df <- data.frame(N = N, P_N_posterior = ###***###) 3. Draw a figure with the prior and posterior probability functions for N . Pay attention to the fact that the prior and posterior distributions for N have different supports. 4. Compute a Bayesian 95% credible interval for N , with equal upper and lower tail probabilities, by finding the appropriate quantiles via the pos- terior cumulative distribution function for N , P(N ≤ n|y1, y2) = n∑ k=0 P(N = k|y1, y2). Compare the result to the frequentistic confidence intervals the archaeol- ogist computed. 6Remark: An alternative, Monte Carlo free approach for this particular model would be to show that the conditional posterior distribution for (φ|N = n,Y = y) is Beta(a+ y1+ y2, b+ 2n− y1 − y2). Then the identity p(n,y) = p(φ,n,y)p(φ|n,y) would provide a direct expression for the joint probability function for N and Y , eliminating the need for numerical integration. 4 Part 2: LOOCV and randomisation tests Recall the 3D printer filament problem from Coursework A. Of the models in- vestigated there, the overall best model had a log-linear model for the variance, that we’ll now call model A: yi ∼ N [ θ1 + θ2xi, e θ3+θ4xi ] , where xi are the CAD Weight values of filamentCWB.csv, and yi are the ob- served Actual Weight values. See the suggested setup code chunk for how to load the data file. The data columns are • Index: An observation index, 1, 2, 3, . . . , n • Date: The date the object was printed. • Material: The material colour • CAD Weight: The CAD-estimated required filament weight (gram) • Actual Weight: The measured actual object weight (gram) The Date and Material variables are not needed for this assignment. We are now interested in a slightly different model, that we’ll call model B: yi ∼ N [ θ1 + θ2xi, e θ3 + eθ4x2i ] . Code for estimating the parameters for each of the two models, and com- puting predictions for new data, is available in CWB2020code.R as the function estimate and predict(). 1. Show that Model B is mathematically equivalent to an error model that can be written yi = α0 + γixi + ei, where γi and ei are random variables with independent Gaussian distributions, N(µγ , σ 2 γ) and N(0, σ 2 e). What are the parameters for the γi and ei distributions that yield Model B? 2. Implement leave-one-out cross validation scoring and compare the predic- tion scores for the two models, for Squared Error (SE), Dawid-Sebastiani (DS), and the Interval score (Interval, use the default 90% intervals). First, fill in suitable code instead of ###***###: pred <- list(A = estimate_and_predict_output_template(nrow(filament)), B = estimate_and_predict_output_template(nrow(filament))) for (ind in filament$Index) { for (model in c("A", "B")) { # Leave out one observation, estimate the model, predict the left-out obs: pred[[model]][ind,] <- ###***### } } 5 The scores SAi = S(F A i , yi) and S B i = S(F B i , yi) can then be computed and stored as follows: scores <- list(A = calc_scores(pred[["A"]], filament[["Actual_Weight"]]), B = calc_scores(pred[["B"]], filament[["Actual_Weight"]])) 3. Let SA−B denote the average of the pairwise score differences SAi − SBi , and let G denote the prediction of the true date generating model. For each type of score, estimate the p-value, PT , for a suitable randomisation test of H0 : EG(SA−B) = 0, H1 : EG(SA−B) > 0. Use the code below, replacing ###***### with suitable7 code. Explain, in words, what assumptions the test is based on. # Compute the test statistics stat <- colMeans(scores$A - scores$B) J <- 10000 rand_stat <- data.frame(SE = numeric(J), DS = numeric(J), Interval = numeric(J)) # Compute the randomised test statistics for (loop in seq_len(J)) { rand_stat[loop, ] <- colMeans(###***###) } # Estimate the P-values for the tests p_value <- colMeans(rand_stat >= rep(stat, each = J)) 4. For each of the three tests, Z(j) = I(T (j) ≥ T ) ∼ Bin(1, PT ), independent across j = 1, . . . , J . By the central limit theorem, the estimate P̂T has, approximately, a N(PT , σ 2 T ) distribution. Show that σ 2 T = PT (1− PT )/J , and compute a standard deviation estimate σ̂T for each test, by replacing PT by P̂T . What is the relative Monte Carlo error (std.dev. divided by the estimate) for each of the three p-value estimates? Also compute an approximate 95% confidence interval for each PT . 7Consult lecture 6 for material on randomisation tests. 6