程序辅导案例 > Program >

程序代写案例-FIT3154

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

FIT3154 Extra Notes
Note on Bayesian Prediction
Dr. Daniel F. Schmidt∗
August 22, 2022
1 Prediction
Let p(y |θ) be a probability distribution (model) with parameters θ. It is usual that, once we have
fitted a model to data, we use it to make predictions about the population it is representing. For
example, imagine we are interested in modelling the average heights (why is every example always
heights in this unit?) of the Australian population. We might use a normal model y ∼ N(µ, σ2),
and from a sample of heights y = (y1, . . . , yn) we estimate the unknown parameters by maximum
likelihood. Call the ML estimates µˆ and σˆ2. Then we might be asked: what is the probability of a
random person from our population having a height greater than 1.7m, i.e., what is P(Y > 1.7)?
One way to do this is take our estimates, plug them into the probability model and then compute
the probability assuming this probability model is a representative of our population, i.e., estimate
P(Y > 1.7) using ∫ ∞
1.7
p(y | µˆ, σˆ2).
This is called a “plug-in” estimator, because we take the estimated parameter values and plug them
into the probability model, and treat them as if they were the unknown population parameters. The
problem with this approach is that, of course, the estimates µˆ and σˆ2 are not the exact population
values; instead, they are values we have estimated (guessed) using the data, and are subject to uncer-
tainty. Ignoring this uncertainty can lead to poorer predictions, and does not allow us to characterise
how uncertain we are in our prediction itself. The Bayesian approach allows us to incorporate this
uncertainty in a straightforward fashion.
Note that more generally, a prediction is made using our model; and therefore, using our model
parameters. So in general, a prediction is just a function of our model parameters, i.e., f(θ). Some
examples would be:
• The average value of future realisations of Y from our population would be predicted by the
mean of the fitted distribution:
E
[
Y | θˆ
]
=
∫ ∞
−∞
y p(y | θˆ)dy
∗Copyright (C) Daniel F. Schmidt, 2020
1
• The median of future realisations from our population would be predicted by the median of the
fitted distribution, i.e.,
median(Y | θˆ) =
{
m :
∫ m
−∞
p(y | θˆ)dy = 0.5
}
that is, the value m such that P(Y ≤ m) = P(Y > m) under our fitted model p(y | θˆ).
• The probability that a random individual from our population has a value greater than c would
just be
P(Y > c | θˆ) =
∫ ∞
c
p(y | θˆ)dy
and so on. Sometimes predictions might depend on extra values, like values for the predictors in the
case of regression, but in all cases, the prediction is clearly a function of the model parameters θˆ.
2 Bayesian Prediction
In the Bayesian approach we put a prior distribution, pi(θ), over our parameters θ, which expresses
our beliefs about how likely different values of θ are to be the population values; and then we form
the posterior
p(θ |y) = p(y |θ)pi(θ)∫
p(y |θ)pi(θ)dθ . (1)
The posterior distribution tells us the probability of various values of θ being the population value, in
light of our prior beliefs and the data we have observed. To make a prediction using the information
in our posterior, we might try and replicate the plug-in approach: we could take a single value as our
best guess, using the posterior, and plug it in to the model. For example, we could use the posterior
mean. If our prediction function was f(θ) we could use f(E [θ |y]) as our prediction in a similar way
to above, but using the posterior mean in place of the ML estimate. This gives us a prediction, but
has the same weakness as above: it treats the guess E [θ |y] as if it was exactly known and ignores the
uncertainty inherent in estimating parameters.
However, we can use the posterior distribution directly to incorporate this uncertainty into our
prediction. Essentially, the idea is that in the Bayesian framework the parameter θ is a random variable
following the posterior distribution p(θ |y). As a prediction is just a function (transformation), f(θ),
of the model parameters, and the model parameters are random variables, then there is implicitly a
distribution over the predictions, i.e., our predictions follow a distribution
p(f(θ) |y) (2)
Once we have this distribution we can do things like find a best prediction using the mean of the
prediction distribution, or plot the distribution to get an idea of how variable our guesses for likely
values of the prediction are, or find credible intervals to characterise how accurate we believe our
predictions are, and so on. Finding the distribution in a simple form is in general, quite difficult due to
the complications in transforming random variables. However, if we have m samples of θ drawn from
our posterior, say θ(1), . . . ,θ(m), as we often do, we can approximate the distribution by applying the
prediction function to each sample of the parameters:
f(θ(1)), . . . , f(θ(m))
These m predictions, each made using one of the models we sampled from our posterior, can then be
used to approximate the posterior distribution of the predictions p(f(θ) |y) in the usual way. The
2
mean of these m predictions gives us a good guess at the average prediction, the quantiles give us
access to credible intervals (i.e., ranges of likely values for the predictions) and so on. Notice how
general the idea is: as long as we have a prediction function we can just apply it to all our samples to
get a distribution over the predictions with no clever mathematics required!
2.1 Example
Let’s examine a simple example in the context of normal distribution we looked at in Studio 6. The
Bayesian hierarchy was
yi |µ, σ ∼ N(µ, σ2), i = 1, . . . , n
µ |m, s ∼ C(m, s)
σ ∼ C+(0, 1)
We cannot calculate the posterior distribution (1) directly but we can use the provided code from
Studio 6 to generate samples of µ and σ2, and use this instead. As a test, we can set the population
parameter µ = 0 and σ2 = 1, and set the hyperparameters to m = 0 and s = 1 (it really doesn’t
make any difference to the ideas). Then we can generate a sample of n = 50 data points from the
population model yi ∼ N(µ = 1, σ2 = 1), and draw samples of µ and σ2 from our posterior. Imagine
our prediction is the probability that a random realisation from our population would have a value
greater than 1, i.e., we want to estimate P(Y > 1). Under our population model this is easily calculated
using R to be
1 - pnorm(1, 0, 1) ≈ 0.1586
The following code generates a random sample from the population, samples from the posterior dis-
tribution (to fit the normal to the sample) and then computes the posterior distribution over the
predictions, and finds a 95% interval for these predictions.
set.seed(3154)
y = rnorm(50, 0, 1) # Draw a random sample of size n=50
1 - pnorm(1, 0, 1) # Theoretical probability
# Sample from the posterior
rv = bayes.cauchy.normal(y, m=0, s=1, n.samples=1e6)
# The plug-in estimate of P(Y>1)
1-pnorm(1, mean(rv$mu.samples), mean(rv$sigma.samples))
# Generate distribution of predictions
prb = 1-pnorm(1, rv$mu.samples, rv$sigma.samples)
mean(prb)
quantile(prb, c(0.025,0.975))
my.density.plot(prb,xlabel="P(Y>1)",ylabel="P(P(Y>1) | y)")
This code generates n = 50 samples from the N(0, 1) population model, then uses the code from
bayes.cauchy.normal.R (from Studio 6) to sample 106 samples of µ and σ2 from the posterior dis-
tribution. It then computes the plug-in prediction of P(Y > 1) by plugging in the posterior mean in
3
place of the population mean. It then computes predictions for each one of the samples, and then
computes the 95% credible interval for the predictions and plots the histogram of these predictions.
This histogram is essentially the posterior distribution over our prediction, i.e., Equation (2). Note
that the prediction made by plugging in the posterior mean (≈ 0.21) is quite a bit larger than the
theoretical P(Y > 1), but the credible interval of predictions ≈ (0.13, 0.3) does contain the true value
of ≈ 0.158. So taking into account the variability gives us a range of plausible predictions that do
include the true value, which is nice. Note to produce a distribution over another type of prediction,
we would just change the code that produces the predictions as desired.
4

欢迎咨询51作业君