FIT3154 Extra Notes Note on Bayesian Prediction Dr. Daniel F. Schmidt∗ August 22, 2022 1 Prediction Let p(y |θ) be a probability distribution (model) with parameters θ. It is usual that, once we have fitted a model to data, we use it to make predictions about the population it is representing. For example, imagine we are interested in modelling the average heights (why is every example always heights in this unit?) of the Australian population. We might use a normal model y ∼ N(µ, σ2), and from a sample of heights y = (y1, . . . , yn) we estimate the unknown parameters by maximum likelihood. Call the ML estimates µˆ and σˆ2. Then we might be asked: what is the probability of a random person from our population having a height greater than 1.7m, i.e., what is P(Y > 1.7)? One way to do this is take our estimates, plug them into the probability model and then compute the probability assuming this probability model is a representative of our population, i.e., estimate P(Y > 1.7) using ∫ ∞ 1.7 p(y | µˆ, σˆ2). This is called a “plug-in” estimator, because we take the estimated parameter values and plug them into the probability model, and treat them as if they were the unknown population parameters. The problem with this approach is that, of course, the estimates µˆ and σˆ2 are not the exact population values; instead, they are values we have estimated (guessed) using the data, and are subject to uncer- tainty. Ignoring this uncertainty can lead to poorer predictions, and does not allow us to characterise how uncertain we are in our prediction itself. The Bayesian approach allows us to incorporate this uncertainty in a straightforward fashion. Note that more generally, a prediction is made using our model; and therefore, using our model parameters. So in general, a prediction is just a function of our model parameters, i.e., f(θ). Some examples would be: • The average value of future realisations of Y from our population would be predicted by the mean of the fitted distribution: E [ Y | θˆ ] = ∫ ∞ −∞ y p(y | θˆ)dy ∗Copyright (C) Daniel F. Schmidt, 2020 1 • The median of future realisations from our population would be predicted by the median of the fitted distribution, i.e., median(Y | θˆ) = { m : ∫ m −∞ p(y | θˆ)dy = 0.5 } that is, the value m such that P(Y ≤ m) = P(Y > m) under our fitted model p(y | θˆ). • The probability that a random individual from our population has a value greater than c would just be P(Y > c | θˆ) = ∫ ∞ c p(y | θˆ)dy and so on. Sometimes predictions might depend on extra values, like values for the predictors in the case of regression, but in all cases, the prediction is clearly a function of the model parameters θˆ. 2 Bayesian Prediction In the Bayesian approach we put a prior distribution, pi(θ), over our parameters θ, which expresses our beliefs about how likely different values of θ are to be the population values; and then we form the posterior p(θ |y) = p(y |θ)pi(θ)∫ p(y |θ)pi(θ)dθ . (1) The posterior distribution tells us the probability of various values of θ being the population value, in light of our prior beliefs and the data we have observed. To make a prediction using the information in our posterior, we might try and replicate the plug-in approach: we could take a single value as our best guess, using the posterior, and plug it in to the model. For example, we could use the posterior mean. If our prediction function was f(θ) we could use f(E [θ |y]) as our prediction in a similar way to above, but using the posterior mean in place of the ML estimate. This gives us a prediction, but has the same weakness as above: it treats the guess E [θ |y] as if it was exactly known and ignores the uncertainty inherent in estimating parameters. However, we can use the posterior distribution directly to incorporate this uncertainty into our prediction. Essentially, the idea is that in the Bayesian framework the parameter θ is a random variable following the posterior distribution p(θ |y). As a prediction is just a function (transformation), f(θ), of the model parameters, and the model parameters are random variables, then there is implicitly a distribution over the predictions, i.e., our predictions follow a distribution p(f(θ) |y) (2) Once we have this distribution we can do things like find a best prediction using the mean of the prediction distribution, or plot the distribution to get an idea of how variable our guesses for likely values of the prediction are, or find credible intervals to characterise how accurate we believe our predictions are, and so on. Finding the distribution in a simple form is in general, quite difficult due to the complications in transforming random variables. However, if we have m samples of θ drawn from our posterior, say θ(1), . . . ,θ(m), as we often do, we can approximate the distribution by applying the prediction function to each sample of the parameters: f(θ(1)), . . . , f(θ(m)) These m predictions, each made using one of the models we sampled from our posterior, can then be used to approximate the posterior distribution of the predictions p(f(θ) |y) in the usual way. The 2 mean of these m predictions gives us a good guess at the average prediction, the quantiles give us access to credible intervals (i.e., ranges of likely values for the predictions) and so on. Notice how general the idea is: as long as we have a prediction function we can just apply it to all our samples to get a distribution over the predictions with no clever mathematics required! 2.1 Example Let’s examine a simple example in the context of normal distribution we looked at in Studio 6. The Bayesian hierarchy was yi |µ, σ ∼ N(µ, σ2), i = 1, . . . , n µ |m, s ∼ C(m, s) σ ∼ C+(0, 1) We cannot calculate the posterior distribution (1) directly but we can use the provided code from Studio 6 to generate samples of µ and σ2, and use this instead. As a test, we can set the population parameter µ = 0 and σ2 = 1, and set the hyperparameters to m = 0 and s = 1 (it really doesn’t make any difference to the ideas). Then we can generate a sample of n = 50 data points from the population model yi ∼ N(µ = 1, σ2 = 1), and draw samples of µ and σ2 from our posterior. Imagine our prediction is the probability that a random realisation from our population would have a value greater than 1, i.e., we want to estimate P(Y > 1). Under our population model this is easily calculated using R to be 1 - pnorm(1, 0, 1) ≈ 0.1586 The following code generates a random sample from the population, samples from the posterior dis- tribution (to fit the normal to the sample) and then computes the posterior distribution over the predictions, and finds a 95% interval for these predictions. set.seed(3154) y = rnorm(50, 0, 1) # Draw a random sample of size n=50 1 - pnorm(1, 0, 1) # Theoretical probability # Sample from the posterior rv = bayes.cauchy.normal(y, m=0, s=1, n.samples=1e6) # The plug-in estimate of P(Y>1) 1-pnorm(1, mean(rv$mu.samples), mean(rv$sigma.samples)) # Generate distribution of predictions prb = 1-pnorm(1, rv$mu.samples, rv$sigma.samples) mean(prb) quantile(prb, c(0.025,0.975)) my.density.plot(prb,xlabel="P(Y>1)",ylabel="P(P(Y>1) | y)") This code generates n = 50 samples from the N(0, 1) population model, then uses the code from bayes.cauchy.normal.R (from Studio 6) to sample 106 samples of µ and σ2 from the posterior dis- tribution. It then computes the plug-in prediction of P(Y > 1) by plugging in the posterior mean in 3 place of the population mean. It then computes predictions for each one of the samples, and then computes the 95% credible interval for the predictions and plots the histogram of these predictions. This histogram is essentially the posterior distribution over our prediction, i.e., Equation (2). Note that the prediction made by plugging in the posterior mean (≈ 0.21) is quite a bit larger than the theoretical P(Y > 1), but the credible interval of predictions ≈ (0.13, 0.3) does contain the true value of ≈ 0.158. So taking into account the variability gives us a range of plausible predictions that do include the true value, which is nice. Note to produce a distribution over another type of prediction, we would just change the code that produces the predictions as desired. 4
欢迎咨询51作业君