-
Notifications
You must be signed in to change notification settings - Fork 8
/
summaries.tex
404 lines (375 loc) · 20.3 KB
/
summaries.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
\chapter{Summarising the Posterior Distribution}\label{chapter:summaries}
The posterior distribution is the full answer to any Bayesian problem. It gives
a complete description of our state of knowledge and our uncertainty about the
value(s) of unknown
parameters. From the posterior distribution, we can calculate any probability
we want. For example, if we had a posterior distribution $p(\theta|x)$ and we
wanted to know the probability that $\theta$ is greater than or equal to 100, we could do:
\begin{eqnarray}
P(\theta \geq 100 | x) &=& \int_{100}^\infty p(\theta | x) \, d\theta
\end{eqnarray}
or
\begin{eqnarray}
P(\theta \geq 100 | x) &=& \sum_{100}^\infty p(\theta | x)
\end{eqnarray}
depending on whether the set of possible $\theta$ values is continuous or
discrete. We could also work out the probability of anything else.
However, the posterior distribution is sometimes too much
information for us to think about easily. Maybe a giant list of $\theta$
values and probabilities isn't easy to digest. Sometimes,
we need to {\it summarise} the posterior distribution to help us
communicate our results with others. A giant Bayes' Box (or a million MCMC
samples of the parameter, we'll see that later), might technically
contain everything we want, but it's not easy to talk about.
For example, say you were trying to estimate
a parameter, and a colleague asked you to state your uncertainty about the
parameter. Well, your posterior distribution might be complicated. It might
have bumps and wiggles in it, or some other kind of structure. If there were
two or more unknown parameters, there might be dependence in the posterior
distribution. In some cases there might even me multiple separate peaks!
Figure~\ref{fig:complicated_posterior} shows an example of what a complicated
posterior distribution might look like. If this was your result, your colleague
might not care about all the little wiggles in this plot. They just want to know
the ``big picture'' of your results.
\begin{figure}[!ht]
\begin{center}
\includegraphics[scale=0.6]{Figures/complicated_posterior.pdf}
\caption{\it A complicated posterior distribution. When communicating with others,
it is often useful to summarise the posterior distribution with a few numbers. In
this case, something like ``the parameter = $5 \pm 1$'' might be a useful summary.
\label{fig:complicated_posterior}}
\end{center}
\end{figure}
The idea of summarising the posterior distribution is very closely related to
the idea of summarising a data set, which you probably encountered when you
studied descriptive statistics.
\begin{framed}
{\bf In descriptive statistics, you often make summaries of a complex data set
(e.g. the mean and the standard deviation) so that you can communicate about
the data set in a concise way. In Bayesian statistics, you often do a similar
thing, but instead of giving a concise description of the {\it data}, you give a
concise description of the {\it posterior distribution}.}
\end{framed}
\section{Point Estimates}
A ``point estimate'' refers to a single number guess for the value of a parameter.
If you have several parameters, a point estimate would be a single guess for the
value of each parameter (like a single point in a multidimensional space).
If you look at the
posterior distribution plotted in Figure~\ref{fig:complicated_posterior}, you
can see that the true value of the parameter is probably somewhere around 5,
but with some uncertainty. If you were to provide a single number as a guess of
the parameter, you would probably say something close to 5. In classical statistics, a
single number guess is called an ``estimate'', and a rule for generating such
guesses is called an ``estimator''. Estimates are usually written by putting a
hat over the name of the parameter. So, by looking at the plot of the
posterior, you could give an estimate like this:
\begin{eqnarray}
\hat{\theta} = 5.
\end{eqnarray}
But there are better things you could do than just looking at the plot, and you've
probably learnt some of them
in previous statistics courses. Here are three methods you could use to
choose a point estimate using the posterior distribution: the posterior mean
(the expectation value of the parameter), the posterior median
(the value that divides the probability
in half), and the posterior mode (the value where the posterior distribution has its
peak). In our illustrative example, the values of these three point estimates
are:
\begin{eqnarray}
\hat{\theta} &=& 4.988 \textnormal{ (the posterior mean)}\\
\hat{\theta} &=& 4.924 \textnormal{ (the posterior median)}\\
\hat{\theta} &=& 4.996 \textnormal{ (the posterior mode)}
\end{eqnarray}
In this example, there's not much of a difference between these three methods.
But in other situations, they can be quite different (this usually happens if
the posterior distribution is skewed, or has multiple modes; you may notice a
strong analogy between this topic and descriptive statistics). Is there a way
to say which one is the {\it best}? It turns out there is, but that
depends on what you mean by ``best''.
Before we move on to the formal ways of deciding what constitutes a good estimate, I would
like to mention a very common method that is easy to use. If the posterior distribution
looks even vaguely like a normal distribution, it is common to summarise it like
this:
\begin{eqnarray}
\theta = \textnormal{posterior mean }\pm\textnormal{posterior standard deviation}.
\end{eqnarray}
I use this kind of summary frequently in my own research.
\subsection{A Very Brief Introduction to Decision Theory}
Decision theory is a very important topic. In this course we will use a
{\it tiny} amount of it, just enough to solve the problem of ``which point
estimate is best?''. If you think about it, this is a bit of a weird question.
Obviously, the best point estimate is the true value. Of course it is, how could it
be otherwise? Our only problem is that we can't actually implement this suggestion.
We don't know the true value. We only have the posterior distribution
(which is based on all the evidence we have), and we have to do
the best we can with our incomplete information. To think about which decision is best,
the first thing we should think
about is which decisions are {\it possible}. For estimating a single parameter, any
real number is a possible guess.
The key idea in decision theory is the concept of {\it utility}, and the related
concept of {\it loss} (loss is just negative utility). Utility is
a numerical measure of how good it would be if a certain outcome came true.
Conversely, loss is a measure of how bad it would be if a certain outcome
came true.
Utilities are often subjective (not unlike prior probabilities), but in some
applications utility can be more concrete. For example, in betting or investment
decisions the utility can be measured in dollars. The problem with utility is
that we have uncertainty about what is going to happen, or about what is true,
so we can't just choose the decision that gives us the greatest utility. Instead
we will use our posterior probabilities and choose the decision that gives us
the maximum possible {\it expected value} of the utility.
Imagine we were estimating a parameter $\theta$ and we wanted to give a
point estimate $\hat{\theta}$. One idea for what the utility or loss might be is the
{\it quadratic} loss function, which is given by
\begin{eqnarray}
L(\theta, \hat{\theta}) = \left(\hat{\theta} - \theta\right)^2.
\end{eqnarray}
This expression inside the parentheses is the difference between our point estimate
and the true value. This formula says that if our point estimate is off by 2,
that is four times worse than if we were off by 1. If we were off by 10, that is
100 times worse than if we were off by 1, due to the squaring in the quadratic
loss function formula.
It turns out (we will prove this below) that {\it if the loss function is
quadratic, the best estimate you can give is the posterior mean}.
Here is the proof. The expected value of the loss is
\begin{eqnarray}
\mathds{E}\left[L(\theta, \hat{\theta})\right] =
\int p(\theta|x)(\hat{\theta} - \theta)^2 \, d\theta
\end{eqnarray}
Since we are summing (integrating) over all possible true $\theta$ values,
the expected loss is only a function of our estimate $\hat{\theta}$. To minimise a function
of one variable, you differentiate it and then set the derivative to zero.
The derivative is
\begin{eqnarray}
\frac{d}{d\hat{\theta}}\mathds{E}\left[L(\theta, \hat{\theta})\right] &=&
\int p(\theta|x)\frac{d}{d\hat{\theta}}(\hat{\theta} - \theta)^2 \, d\theta \\
&=& \int p(\theta|x)2(\hat{\theta} - \theta) \, d\theta
\end{eqnarray}
Setting this equal to zero and then solving for $\hat{\theta}$ gives the final
result:
\begin{eqnarray}
\hat{\theta} &=& \int \theta p(\theta|x) \, d\theta.
\end{eqnarray}
which is the posterior mean. Some people call the posterior mean the ``Bayes
Estimate'' for this reason. I don't like that term because I don't think
point estimates are really Bayesian.
The actual output of a Bayesian analysis is the posterior distribution.
Note that I didn't verify that $\hat{\theta}$ actually minimises the expected loss
, because setting the derivative to zero would also find a maximum. To make sure
it really does minimise the expected loss, you can calculate the second
derivative and verify that it is
positive. But that's not really needed. It would be pretty bizarre if the
posterior mean was the {\it worst} estimate!
\subsection{Absolute Loss}
Sometimes, the quadratic loss/utility is not a reasonable model
for the consequences of an incorrect estimate. Another plausible form for the loss
function is the {\it absolute} loss. This looks like:
\begin{eqnarray}
L(\hat{\theta}, \theta) &=& |\hat{\theta} - \theta|.
\end{eqnarray}
With this assumption, the ``badness'' of an incorrect estimate is proportional
to how far the estimate is from the true value. If the estimate is twice as
far from the true value, it is twice as bad. We will not prove it (although you
are welcome to derive this yourself), but for this loss function
the best estimate is the posterior median, which is the value of $\hat{\theta}$
for which $P(\theta \leq \hat{\theta}) = P(\theta > \hat{\theta}) = 0.5$.
\subsection{All-or-nothing Loss}
The third kind of loss function we will look at is the ``all-or-nothing'' loss,
also sometimes called {\it 0-1 loss}.
Sometimes, you may need your estimate to be
completely correct, and if it isn't correct, then it is irrelevant how far your
estimate was from the true value. All incorrect estimates are equally bad.
The all-or-nothing loss looks like:
\begin{eqnarray}
L(\hat{\theta}, \theta) = \left\{
\begin{array}{lr}
0 & \hat{\theta} = \theta\\
1, & \textnormal{otherwise.}
\end{array}
\right.
\end{eqnarray}
If you were in this situation you would want to make your chances as high as
possible, which implies you should simply choose the most probable value
of $\theta$ as your point estimate $\hat{\theta}$. That is, the appropriate
estimate is the posterior mode. This intuition is correct. With all-or-nothing loss,
the best estimate is the posterior mode. The three loss functions we consider
in STATS 331 are shown in Figure~\ref{fig:utility}.
\begin{figure}[!ht]
\begin{center}
\includegraphics[scale=0.5]{Figures/utility.pdf}
\caption{\it Three kinds of loss function, which measure how bad it is for our
point estimate $\hat{\theta}$ to be different from the true value of the parameter
$\theta$. Note that the all-or-nothing loss has a small amount of width in this
plot, just so that we can clearly see the spike at
$\hat{\theta} - \theta$ = 0.\label{fig:utility}}
\end{center}
\end{figure}
\subsection{Invariance of Decisions}
You may be wondering about the definitions of our loss functions. For example,
we defined the quadratic loss as $(\hat{\theta} - \theta)^2$, but what if we
defined it as $3(\hat{\theta} - \theta)^2 + 5$ instead? Would our
best decision change? Luckily, the answer is no. The decision (estimate) which
minimises the expected value of a loss function $L$ also minimises the expected
value of a different loss function $aL + b$, where $a$ is any positive number
and $b$ is any other number.
For the mathematicians, the optimal decision is invariant under positive affine
transformations of the utility or loss function. Phew!
\subsection{Computing Point Estimates from a Bayes' Box}
We have just discussed three different point estimates, and under what
circumstances we can consider them to be the best possible estimate we could
make. Now, we will look at how to actually obtain the point estimates.
The posterior mean is straightforward. It's the expectation value of the parameter
using the posterior distribution. In R, the code is:
\begin{minted}[mathescape,
numbersep=5pt,
gobble=0,
frame=single,
framesep=2mm, fontsize=\small]{r}
post_mean = sum(theta*post)
\end{minted}
You should also know how to compute this manually from a Bayes' Box, using a
calculator.
The posterior mode is also fairly straightforward. First, we can find the
highest probability in the Bayes' Box. Then we find the corresponding parameter
value.
\begin{minted}[mathescape,
numbersep=5pt,
gobble=0,
frame=single,
framesep=2mm, fontsize=\small]{r}
highest_probability = max(post)
post_mode = theta[post == highest_probability]
\end{minted}
In the case of a tie, {\tt post\_mode} might be a vector, indicating
that there isn't a single mode.
The posterior median is a little harder. We need to find the $\theta$ value
which has 50\% of the probability to the left and 50\% of the probability to the
right. Note that this isn't precisely defined in some cases, particularly with
discrete distributions. For example, if $\theta$ could be 1, 2, or 3, and the
probabilities of these were 0.3, 0.6, and 0.1, then what is the median? It is
not entirely clear. However, if there are a large number of possibilities then
the definition becomes more clear.
To calculate the posterior median in R, we need to use the cumulative distribution
which is defined as $F(t) = P(\theta \leq t)$. If we then find the value of $t$
where $F(t) = 0.5$, we have found the posterior median. This isn't always
possible but we can always find the value of $t$ which makes $F(t)$ very close to
0.5.
To obtain the cumulative distribution
in R you can use the {\tt cumsum} function, which calculates the cumulative sum
of a vector. The posterior vector contains the probabilities of $\theta$ equalling
certain values. If we want the probability that $\theta$ is less than or
equal to a certain value, we sum all the probabilities up to and including that value. The
cumulative sum function achieves this. Here is the code for calculating the
posterior median:
\begin{minted}[mathescape,
numbersep=5pt,
gobble=0,
frame=single,
framesep=2mm, fontsize=\small]{r}
F = cumsum(post)
dist = abs(F - 0.5) # Distance of the F-values from 0.5
post_median = theta[dist == min(dist)]
\end{minted}
Note that this may also produce more than one result. Like the mode, the
posterior median is not always uniquely defined.
\subsection{Computing Point Estimates from Samples}
When we use a Bayes' Box (or the equivalent R commands which represent the
columns of a Bayes' Box as vectors), we end up with a vector of possible parameter
values and another vector containing the posterior distribution.
When we use MCMC and JAGS, the output is different.
We will only have a vector of parameter
values, without corresponding probabilities. The vector of parameter values is
meant to be a random sample of values drawn from the posterior distribution.
It's like saying ``here are a bunch of guesses for the parameter'', and any
region where there are a lot of guesses is considered to be a region with high
probability.
When we have samples instead of an exhaustive list of parameter values and
probabilities, the methods for computing the summaries are different. For a
parameter called $\theta$, the methods for computing the summaries are given
below.
\begin{minted}[mathescape,
numbersep=5pt,
gobble=0,
frame=single,
framesep=2mm, fontsize=\small]{r}
# Posterior mean using samples
post_mean = mean(theta)
# Posterior mode using samples
# post_mode = ??? (this can't be done easily with samples!)
# If you have a really large number of samples,
# visually finding the peak of
# a histogram can work.
# Posterior median using samples
sorted = sort(theta)
post_median = sorted[0.5*length(theta)]
\end{minted}
\section{Credible Intervals}
Credible intervals are another useful kind of summary. They are used to make
statements like ``There is a 95\% probability the parameter is between
100 and 150''. The basic idea is to
use the posterior distribution to find an interval $[a, b]$ such that
\begin{eqnarray}
P(a \leq \theta \leq b | x) &=& \alpha
\end{eqnarray}
where $\alpha$ is some pre-defined probability. 95\% seems to be the
most popular choice. An example of a 95\% credible interval is given in
Figure~\ref{fig:credible_interval}.
\begin{figure}[!ht]
\begin{center}
\includegraphics[scale=0.5]{Figures/credible_interval.pdf}
\caption{\it A central 95\% credible interval is defined as an interval that
contains 95\% of the posterior probability, while having 2.5\% of the probability
above the upper limit and 2.5\% of the probability below the lower limit. The
credible interval is formed by finding the edges of the grey region. In this
case the credible interval is $[3.310, 7.056]$.
\label{fig:credible_interval}}
\end{center}
\end{figure}
Note that the interval shown in Figure~\ref{fig:credible_interval} is not the
only possible interval that would contain 95\% of the probability. However, to
make the notion of a credible interval precise, we usually use a {\it central}
credible interval. A central credible interval containing an amount of probability
$\alpha$ will leave $(1-\alpha)/2$ of the probability to its left and
the same amount $(1-\alpha)/2$ of the probability to its right.
\subsection{Computing Credible Intervals from a Bayes' Box}
The method for computing credible intervals is closely related to the method
for computing the posterior median. With the median, we found the value of
$\theta$ which has 50\% of the posterior probability to its left and 50\% to its
right. To find the lower end of a 95\% credible interval, we find the $\theta$
value that has 2.5\% of the probability to its left. To find the upper end we
find the value of $\theta$ that has 2.5\% of the posterior probability to its
right, or 97.5\% to the left.
\subsection{Computing Credible Intervals from Samples}
If you have used MCMC and have obtained random samples from the posterior
distribution, you can find a credible interval in a similar way to how you would
find the posterior median. Again, instead of finding the 0.5 quantile of the
posterior distribution you would find the 0.025 quantile and the 0.975 quantile
(if you wanted a central 95\% credible interval).
\section{Confidence Intervals}
In previous stats courses you have probably come across the concept of a
{\it confidence interval}. A confidence interval is a concept in classical
statistics that is somewhat similar to a credible interval in Bayesian statistics.
When people calculate confidence intervals, they usually want to say they are
95\% sure that the parameter is in that interval,
given the data. This is what Bayesian credible intervals do, but {\it it is not
what classical confidence intervals do}!
Luckily, a lot of the time, the classical and the Bayesian methods for making
intervals will actually give the same interval. But this isn't always the case!
In lectures we will study an example (taken from an Ed Jaynes paper from the
70s)
where the Bayesian credible interval and
the classical confidence interval give completely different results. The result
is shown in Figure~\ref{fig:jaynes}. The key thing to note is the classical
confidence interval lies entirely in a region where we are certain
(from the data) that $\theta$ cannot possibly be!
\begin{figure}[!ht]
\begin{center}
\includegraphics[scale=0.5]{Figures/jaynes.pdf}
\caption{\it An example of a Bayesian credible interval and a frequentist confidence
interval applied to a particular problem where they give different answers.
The posterior distribution in blue shows that the parameter $\theta$ is probably
somewhere between 11 and 12, and values above 12 are completely impossible.
However, the entire frequentist confidence interval lies above $\theta=12$.
\label{fig:jaynes}}
\end{center}
\end{figure}