Nested sampling is a stochastic algorithm to evaluate the evidence of a Bayesian problem
$$ Z = p(d | \mathcal{H}) = \int p(d | \boldsymbol{\theta}, \mathcal{H}) p(\boldsymbol{\theta}| \mathcal{H}) d\boldsymbol{\theta}$$
where $d$ is the observed data, $\boldsymbol{\theta}$ is the set of model parameters, $\mathcal{H}$ is the hypothesis assumed. This quantity means the likelihood to observe the data $d$ if the hypothesis $\mathcal{H}$ is true. When we only have the observation, but we don't know what is the underlying mechanism, we can use the evidence to statistically quantify which candidate model explains the data the best.
Transformation of the integral
In literature of nested sampling, we usually use $\mathcal{L}(\boldsymbol{\theta})$ to represent the likelihood $p(d | \boldsymbol{\theta})$, and $\pi(\boldsymbol{\theta})$ to represent the prior $p(\boldsymbol{\theta})$, then the evidence is written as
$$Z = \int \mathcal{L}(\boldsymbol{\theta}) \pi(\boldsymbol{\theta})d\boldsymbol{\theta}\,. $$
One should notice that the above integral is multi-dimensional, and is very difficult to evaluate when the dimesnsionality of $\boldsymbol{\theta}$ is high. Nested sampling instead transforms the multi-dimensional integral into an one-dimensional integral by performing the following transformation
$$\pi(\boldsymbol{\theta})d\boldsymbol{\theta} = dX$$
such that
$$X(\mathcal{L}_{\text{min}}) = \int_{\mathcal{L}(\boldsymbol{\theta}) > \mathcal{L}_{\text{min}}}\pi(\boldsymbol{\theta})d\boldsymbol{\theta}\,.$$
By slightly abusing the notation (the $\mathcal{L}$ here is not the $\mathcal{L}$ above), we define the inverse transform of $X(\mathcal{L}_{\text{min}})$ as $\mathcal{L}(X) = \mathcal{L}_{\text{min}}$, and then we have
$$Z = \int \mathcal{L}(X) dX \,.$$
 |
| Figure 1: The figure on the left is the contour plot of the likelihood in a two-dimensional parameter space. The figure on the right shows the likelihood as a function of $X$. The area under the curve is exactly the evidence. From Ref. [1]. |
This integral however does not look very useful, because we don't know $\mathcal{L}(X)$.
The trick to get $\mathcal{L}$ and $X$
Nested sampling chooses the $\mathcal{L}$ levels that the $X$ are known. If we sample from the prior $\pi(\boldsymbol{\theta})$, the integral transform of the samples would distribute uniformly between $0$ and $1$, as guaranteed by the inverse transform theorem [2]. This implies that $X(\mathcal{L}_{1}), X(\mathcal{L}_{2}), ..., X(\mathcal{L}_{N})$ are uniformly distributed. Determining the corresponding $\mathcal{L}_{\text{min}}$ is however very difficult. Instead, we can do things slightly differently.
The $\mathcal{L}_{\text{min}}$ can be obtained by drawing a number of samples $\boldsymbol{\theta}$ from $\pi(\boldsymbol{\theta})$, and we take the minimum of the likelihood $\mathcal{L}(\boldsymbol{\theta}_{i})$ from the set of samples as the $\mathcal{L}_{\text{min}}$, and then we determine the corresponding $X(\mathcal{L}_{\text{min}})$ statistically.
If we draw $N$ samples from the prior $\pi(\boldsymbol{\theta})$, and each sample has its corresponding $\mathcal{L}_{i}$ and $X_{i}$, then the cumulative distribution function of the $X$ of the minimum likelihood $L_{\text{min}} = \text{min}\{\mathcal{L}_{i}\}$ is
$$P(X < t) = P(X_{1} < t, X_{2} < t, ..., X_{N} < t) = t^{N}\,.$$ The probability density function of $t$ is therefore
$$p(t) = \frac{dP(t)}{dt} = Nt^{N-1}$$, which is also known as the Beta distribution $\text{Beta}(N, 1)$.
 |
| Figure 2: Distribution of $t$ with a different number of samples $N$. |
Let's take a short break, and summarize what we have now
- We draw $N$ samples from the prior $\pi(\boldsymbol{\theta})$ (the collection of samples is also called live points)
- We order the likelihood of the $N$ samples from the lowest to the highest $\boldsymbol{\mathcal{L}} = \{\mathcal{L}_{(1)}, \mathcal{L}_{(2)}, ..., \mathcal{L}_{(N)}\}$
- The $X_{1}$ value of the minimum likelihood $\mathcal{L}_{1} = \mathcal{L}_{(1)}$ is unknown, but we know it follows the beta distribution, i.e. $X_{1}\sim \text{Beta}(N,1)$.
If we know $X_{1}$, then we can obtain the point $(X_{1}, \mathcal{L}_{1})$ in the figure below.
 |
| Figure 3: The one-dimensional integral of the evidence in the $X$ space. |
How can we get the other grid points $X_{2}, X_{3}, ...$?
Since $\mathcal{L}(X)$ is a monotonically decreasing function, we can get a smaller $X_{2} < X_{1}$ from a larger likelihood $\mathcal{L}_{2} > \mathcal{L}_{1}$. How can we can a likelihood larger than $\mathcal{L}_{1}$? Simple, just keep sampling from the prior $\pi(\boldsymbol{\theta})$ until we get a sample which likelihood is greater than $\mathcal{L}_{1}$. In other words, we are sampling from a constrained prior, and the sample that we obtain is the so-called nested sample.
In the second iteration, we are exploring from a fraction of the prior volume i.e. $X_{1}$ since we restrict the samples to have a likelihood greater than $\mathcal{L}_{1}$. Notice that we do not need to draw $N$ samples again. We only need to draw one new sample that satisfies the likelihood constraint since the samples are independent to each other.
- The ordered live points from the first iteration: $\{\mathcal{L}_{1}, \mathcal{L}_{2}, ..., \mathcal{L}_{N}\}$
- Remove the sample with the minimum likelihood from the set: $\{\mathcal{L}_{2}, \mathcal{L}_{3}, ..., \mathcal{L}_{N}\}$
- The sample with likelihood $\mathcal{L}_{1}$ is called a dead point
- Draw a new sample from the prior $\pi(\boldsymbol{\theta})$ which likelihood is greater than $\mathcal{L}_{1}$, and add it to the set of live points
The $X$ of the new sample with the minimum likelihood also follows the Beta distribution, but hold on, we are exploring a constrained prior volume, so $X_{2}$ should instead be $X_{2} = tX_{1}$ where $t$ follows the $\text{Beta}(N, 1)$ distribution. By repeating the procedures, the prior volume shrinks geometrically, or the log prior volume shrinks linearly
- $\log X_{1} = \log t$
- $\log X_{2} = \log X_{1} + \log t$
- $\log X_{3} = \log X_{2} + \log t$
- $...$
But still, we don't know what are the $X$s...
The only thing that we know is their distribution. Then, the best thing that we can do is to approximate the $\log X$ as the mean of their distribution (starting from now, we will work in the $\log$ space)! The mean of $\log t$ is
$$\left<\log{t}\right> = -\frac{1}{N}\,,$$
and the standard deviation is
$$\sigma(\log t) = \frac{1}{N}\,.$$
Then, we have the estimated log prior volume in each iteration to be
$$\log X_{i} = -\frac{i}{N} \pm \frac{\sqrt{i}}{N}\,.$$
We shall see from the standard deviation that the distribution is more sharply peaked with a higher number of live points, i.e. we will get a more accurate result with more live points.
Finally, we can evaluate the evidence in the $X$ space now
To summarize nested sampling, we
- Sample $N$ live points from the prior $\pi(\boldsymbol{\theta})$
- Sort the samples by the likelihood $L$ from the lowest to the highest $\boldsymbol{L} = \{L_{1}, L_{2}, ..., L_{N}\}$
- Obtain the minimum likelihood from the set of live points $\mathcal{L}_{1} = L_{1}$
- Assign the log prior volume $\log X_{1}$ corresponding to $\mathcal{L}_{1}$ to be $-\frac{1}{N}$
- Remove $L_{1}$ from $\boldsymbol{L}$
- Draw a new sample from the prior $\pi(\boldsymbol{\theta})$ which likelihood is greater than $\mathcal{L}_{1}$
- Add the new sample into $\boldsymbol{L}$
- Sort the $\boldsymbol{L}$ and get the minimum likelihood $\mathcal{L}_{2}$
- Assign the log prior volume $\log X_{2}$ corresponding to $\mathcal{L}_{2}$ to be $-\frac{2}{N}$
- Remove the minimum likelihood from $\boldsymbol{L}$
- Repeat steps 6-10 until the result converges
We now have a set of dead points $(X_{i}, \mathcal{L}_{i})$, and the evidence can be evaluated using the trapezoidal rule
$$Z = \int \mathcal{L}(X)dX \approx \sum_{i}\mathcal{L}_{i}w_{i}\,.$$
where $w_{i} = \frac{1}{2}(X_{i-1} - X_{i+1})$, and $X_{i} = \exp(-i/N)$.
How do we know the result converges?
The algorithm terminates if we believe the set of dead points has encompassed the vast majority of the posterior, which means the increment of the (log) evidence in the iteration should be smaller than some small threshold
$$\Delta \log \hat{Z}_{i} \equiv \log\left( \hat{Z}_{i} + \Delta \hat{Z}_{i} \right) - \log\left( \hat{Z}_{i} \right) < \epsilon\,.$$
Although the remaining evidence $\Delta \hat{Z}_{i}$ is unknown, we can set an upper bound of it
$$\Delta \hat{Z}_{i} \leq \mathcal{L}^{\text{max}}X_{i}\,. $$ Again, neither $\mathcal{L}^{\text{max}}$ nor $X_{i}$ is known exactly. We may replace the upper bound with their estimates i.e.
$$\Delta \hat{Z}_{i} \lessapprox \mathcal{L}_{i}^{\text{max}}\hat{X}_{i} $$
where $\mathcal{L}_{i}^{\text{max}}$ is the maximum likelihood in the live points at the $i$-th iteration, and $\hat{X}_{i} = \exp(-i/N)$. In practice, the termination condition is therefore
$$\Delta \log \hat{Z}_{i} \equiv \log\left( \hat{Z}_{i} + \mathcal{L}_{i}^{\text{max}}\hat{X}_{i} \right) - \log\left( \hat{Z}_{i} \right) < \epsilon\,.$$
The by-product - posterior samples
Although nested sampling is designed to estimate the evidence, we can obtain the posterior samples from the dead points. Since we know the $\mathcal{L}_{i}$ and $X_{i}$ of each dead point, we can assign a weight to each dead point
$$p_{i} = \frac{\mathcal{L}_{i}w_{i}}{Z}$$
which is exactly the posterior probability. We can then sample from the dead points with the posterior probability as the weight, and the generated samples will follow the posterior distribution!
Why are there so many nested samplers?
The idea of nested sampling is simple, but there is one particular step in the algorithm that can be very inefficient. Different samplers handle that step differently and try to improve the efficiency of the algorithm.
So, which step is the trouble maker?
Reference:
[1] Feroz, F., Hobson, M. P., & Bridges, M. (2009). MultiNest: An efficient and robust Bayesian inference tool for cosmology and particle physics. Monthly Notices of the Royal Astronomical Society, 398(4), 1601–1614. https://doi.org/10.1111/j.1365-2966.2009.14548.x
[2] Inverse transform sampling - Wikipedia. (2021). Retrieved 5 December 2021, from https://en.wikipedia.org/wiki/Inverse_transform_sampling
[3] Speagle, J. (2020). dynesty: a dynamic nested sampling package for estimating Bayesian posteriors and evidences. Monthly Notices Of The Royal Astronomical Society, 493(3), 3132-3158. doi: 10.1093/mnras/staa278