11.5 Bayesian inference and signal detection

Probability is half the story. The other half is inference: given observed data, what can we conclude about the underlying state of the world? This lesson develops Bayes’ rule — the single most-used identity in inferential statistics — and applies it to two settings: Bayesian inference (continuous-parameter estimation) and signal detection theory (discriminating signal from noise). Both are central to perception (Hearing Ch 8).

Bayes’ rule

The joint probability $\mathrm{Pr}(A \text{ and } B)$ of two events can be factored two ways:

\mathrm{Pr}(A, B) \;=\; \mathrm{Pr}(A \mid B)\, \mathrm{Pr}(B) \;=\; \mathrm{Pr}(B \mid A)\, \mathrm{Pr}(A).

Equating the two right-hand sides and rearranging:

\boxed{\;\mathrm{Pr}(A \mid B) \;=\; \frac{\mathrm{Pr}(B \mid A)\, \mathrm{Pr}(A)}{\mathrm{Pr}(B)}.\;}

This is Bayes’ rule. It’s a one-line consequence of the definition of conditional probability. The novelty is the interpretation — and a whole school of statistics is built on it.

In an inferential setting, replace $A$ with a hypothesis $H$ (the parameter being estimated) and $B$ with the data $D$ (the observation). Bayes’ rule becomes

\underbrace{\mathrm{Pr}(H \mid D)}_{\text{posterior}} \;=\; \frac{\overbrace{\mathrm{Pr}(D \mid H)}^{\text{likelihood}} \cdot \overbrace{\mathrm{Pr}(H)}^{\text{prior}}}{\underbrace{\mathrm{Pr}(D)}_{\text{evidence}}}.

The four named pieces:

Prior $\mathrm{Pr}(H)$ — your belief about the hypothesis before seeing the data. Encodes background knowledge, default expectations, or “ignorance” if you want to be uninformative.
Likelihood $\mathrm{Pr}(D \mid H)$ — the probability of observing the data if the hypothesis were true. This is what the physical/biological model gives you (Gaussian noise, Poisson statistics, etc.).
Posterior $\mathrm{Pr}(H \mid D)$ — your updated belief about the hypothesis after seeing the data. The output of inference.
Evidence (or marginal likelihood) $\mathrm{Pr}(D) = \sum_H \mathrm{Pr}(D \mid H)\, \mathrm{Pr}(H)$ — a normalising constant making the posterior sum to 1.

The structure is symmetric: data updates the prior to a posterior via the likelihood. If you then collect more data, the current posterior becomes the new prior for the next update. This sequential property — Bayesian updating — is what makes Bayes’ rule the inferential workhorse of online estimation, Kalman filtering, and modern probabilistic-machine-learning systems.

A continuous example: estimating a Gaussian mean

The textbook starter example. We want to estimate the mean $\mu$ of a Gaussian distribution with known variance $\sigma_\text{obs}^2$ . We observe $n$ independent samples $x_1, x_2, \ldots, x_n$ . We have a Gaussian prior on $\mu$ : $\mathcal{N}(\mu_0, \sigma_0^2)$ .

The posterior turns out to be Gaussian (the Gaussian is self-conjugate under Gaussian likelihoods), with parameters

\boxed{\;\sigma_\text{post}^2 \;=\; \left( \frac{1}{\sigma_0^2} + \frac{n}{\sigma_\text{obs}^2} \right)^{-1}, \qquad \mu_\text{post} \;=\; \sigma_\text{post}^2 \left( \frac{\mu_0}{\sigma_0^2} + \frac{\sum x_i}{\sigma_\text{obs}^2} \right).\;}

In words: the posterior precision (reciprocal of variance) is the sum of the prior precision and the data precision. The posterior mean is the precision-weighted average of the prior mean and the sample mean.

▶ Why Gaussian × Gaussian = Gaussian Derivation

The prior is $f(\mu) \propto \exp(-(\mu - \mu_0)^2 / (2 \sigma_0^2))$ . The likelihood for observations $x_1, \ldots, x_n$ is

\mathrm{Pr}(\mathbf{x} \mid \mu) \;=\; \prod_{i=1}^n \frac{1}{\sigma_\text{obs}\sqrt{2\pi}}\, \exp\!\left( -\frac{(x_i - \mu)^2}{2 \sigma_\text{obs}^2} \right) \;\propto\; \exp\!\left( -\frac{\sum_i (x_i - \mu)^2}{2 \sigma_\text{obs}^2} \right).

The posterior (up to normalisation) is the product:

\mathrm{Pr}(\mu \mid \mathbf{x}) \;\propto\; \exp\!\left( -\frac{(\mu - \mu_0)^2}{2 \sigma_0^2} - \frac{\sum_i (x_i - \mu)^2}{2 \sigma_\text{obs}^2} \right).

The exponent is a quadratic in $\mu$ ; any quadratic-in- $\mu$ exponent is a Gaussian in $\mu$ . Complete the square: write the exponent as $-(\mu - \mu_\text{post})^2 / (2 \sigma_\text{post}^2)$ plus a $\mu$ -independent constant. Matching coefficients of $\mu^2$ and $\mu$ gives

\frac{1}{\sigma_\text{post}^2} \;=\; \frac{1}{\sigma_0^2} + \frac{n}{\sigma_\text{obs}^2}, \qquad \frac{\mu_\text{post}}{\sigma_\text{post}^2} \;=\; \frac{\mu_0}{\sigma_0^2} + \frac{\sum x_i}{\sigma_\text{obs}^2}.

Multiplying out gives the boxed formulas. The Gaussian is closed under conjugate Gaussian updates — a property called conjugacy — and the algebraic update is just precision-weighted addition.

prior mean μ₀ = 0.00 prior std σ₀ = 1.50

obs noise σ_obs = 1.00 number of observations = 3

observations:

The blue prior is your belief about parameter μ before seeing any data. The red likelihood peaks at the empirical mean of the observations and is sharpened by both larger sample size and lower observation noise. The green posterior is Bayes' rule: prior × likelihood, renormalised. With many observations, the posterior tracks the likelihood (data dominates). With few observations or high noise, the prior pulls the posterior toward itself. This is the inferential engine underwriting [Hearing 8 — perception as Bayesian inference](/hearing/meaning/bayes).

Drag the prior mean and width; drag the observations; watch the posterior reform as the precision-weighted compromise. Three things to feel for:

With a tight prior, the posterior tracks the prior. Even when the data say otherwise, a confident prior pulls hard. You’d need many strong observations to overcome it.
With a flat prior, the posterior tracks the data. $\sigma_0 \to \infty$ makes the prior uninformative; the posterior becomes the likelihood, centred at the sample mean.
More observations sharpen the posterior. The posterior variance shrinks like $1/n$ — a direct consequence of Gaussian conjugacy and the precision-additivity rule.

▶ Bayesian posterior for a hearing-test detection scenario Worked Example

A screening test for hearing loss has sensitivity 90% (detects loss when present) and specificity 85% (correctly passes normal hearing). The prevalence of hearing loss in the tested population is 10%. A patient tests positive. What is the posterior probability they actually have hearing loss?

Let $H$ = hearing loss. Prior: $P(H) = 0.10$ . Likelihood: $P(+|H) = 0.90$ . False-alarm rate: $P(+|\neg H) = 0.15$ .

Evidence: $P(+) = P(+|H)\,P(H) + P(+|\neg H)\,P(\neg H) = 0.90\times0.10 + 0.15\times0.90 = 0.09 + 0.135 = 0.225.$

Posterior: $P(H|+) = \frac{P(+|H)\,P(H)}{P(+)} = \frac{0.09}{0.225} = 0.40.$

Despite the positive test, there is only a 40% chance of actual hearing loss — because the low base rate (10%) means most positives are false alarms. This is why clinical audiometry uses multiple frequencies and repeat testing.

A note on priors

The Bayesian framework requires you to specify a prior — what you believed before seeing the data. This is sometimes felt as a weakness (“subjective!”), and various attempts have been made to extract “objective” priors from symmetry or invariance arguments. In practice the prior matters most when data is scarce. Once you have many observations, the data dominates and the choice of prior becomes irrelevant — which is exactly what one would want.

The Bayesian and frequentist schools of statistics differ chiefly in whether they treat the parameter as having a distribution. To a Bayesian, $\mu$ is a random variable with a posterior; to a frequentist, $\mu$ is a fixed (unknown) number and the data is random. The two formalisms produce numerically identical answers in many practical settings — the disagreement is philosophical, not arithmetic.

Signal detection theory

A related but distinct inferential setting: given a single noisy observation, decide between two hypotheses. Is there a signal, or is it just noise? This is the signal detection problem, and it underpins all of psychophysics (including hearing thresholds), radar processing, medical screening, and audio compression artefact detection.

The classical setup: under the “noise-only” hypothesis $H_0$ , the observation $X$ is drawn from a distribution $f_0$ . Under the “signal-plus-noise” hypothesis $H_1$ , $X$ is drawn from a distribution $f_1$ (typically the same shape as $f_0$ shifted by the signal amplitude). The optimal Bayesian decision rule is to compute the likelihood ratio

L(x) \;=\; \frac{f_1(x)}{f_0(x)}

and compare to a threshold. If $L > c$ , declare signal; otherwise declare noise. The threshold $c$ encodes the costs of the two types of error and the prior probabilities of the hypotheses.

Four outcomes

The decision produces one of four outcomes:

| | Truth: $H_0$ (no signal) | Truth: $H_1$ (signal) | |---|---|---| | Declare $H_0$ | True negative | Miss (false negative) | | Declare $H_1$ | False alarm (false positive) | Hit (true positive) |

Two summary statistics matter:

Hit rate (or true positive rate, or sensitivity) $H = \mathrm{Pr}(\text{declare } H_1 \mid H_1)$ .
False-alarm rate (or false positive rate) $F = \mathrm{Pr}(\text{declare } H_1 \mid H_0)$ .

As you lower the decision threshold, both $H$ and $F$ go up: you catch more signals but also flag more noise. The ROC curve (Receiver Operating Characteristic) plots $H$ versus $F$ as the threshold sweeps. A useless detector — one whose output is independent of the true class — lies on the diagonal $H = F$ . A perfect detector reaches the upper-left corner $H = 1, F = 0$ . The area under the curve (AUC) measures detector quality; AUC = 0.5 is chance, AUC = 1 is perfect.

The $d'$ statistic

When the signal-plus-noise and noise-only distributions are both Gaussian with the same variance $\sigma$ but means differing by $\Delta\mu$ , the detector’s quality is summarised by the sensitivity index

d' \;=\; \frac{\Delta\mu}{\sigma}.

$d'$ is the signal amplitude in units of the noise standard deviation. $d' = 1$ is a marginal detector (typical psychophysical threshold); $d' = 3$ is comfortable; $d' = 5$ or more is essentially unambiguous. The AUC of the ROC curve and $d'$ are equivalent measures: $d' = \sqrt{2}\, \Phi^{-1}(\mathrm{AUC})$ .

In psychophysics, the experimentally-measured $d'$ tells you the signal-to-noise ratio at which a perceptual system can discriminate. The auditory-nerve threshold for detecting a tone in noise has $d' \approx 1$ at the just-noticeable level, by definition.

History

⏳ The history — Bayes 1763, Laplace 1774, and a 200-year argument

Thomas Bayes was a Presbyterian minister and amateur mathematician in 18th-century England. He wrote An Essay towards solving a Problem in the Doctrine of Chances sometime before his death in 1761, but never published it. The manuscript was found among his papers by Richard Price, who edited and submitted it to the Royal Society; it appeared in the Philosophical Transactions in 1763, two years after Bayes had died.

The paper introduced what we now call Bayes’ rule — initially as a special case for the binomial distribution — and applied it to the problem of estimating an unknown probability from observed successes and failures. The crucial conceptual move was to treat the unknown parameter (the probability of success) as itself having a distribution. This was philosophically radical: parameters were generally thought of as fixed unknowns, not as random variables.

Pierre-Simon Laplace independently rediscovered and generalised the rule in his 1774 Mémoire sur la probabilité des causes par les événements. Laplace took it much further — using Bayesian arguments throughout his career to tackle problems from celestial mechanics (determining the orbits of comets) to demography (estimating population sizes from birth-rate data).

The Bayesian / frequentist split crystallised in the early 20th century, with Ronald Fisher, Jerzy Neyman, and Karl Pearson on the frequentist side arguing for objective, parameter-free statistics, and Harold Jeffreys, Bruno de Finetti, and L. J. Savage on the Bayesian side defending the subjective-probability interpretation. The argument lasted decades; modern statistics largely shrugs and uses both. The rise of computational Bayesian methods (Markov-chain Monte Carlo, variational inference) in the 1990s tipped the practical balance toward Bayesian methods for complex models, and machine-learning’s adoption of probabilistic-programming languages (Stan, PyMC, Pyro) has made Bayes the default for most inference today.

Read the original: An Essay towards solving a Problem in the Doctrine of Chances (Thomas Bayes, 1763)

What we use this for

Bayesian inference and signal detection appear repeatedly:

Bayesian perception (Hearing 8.2) — the brain combines a prior over stimuli with sensory likelihoods to compute a perceptual posterior. The McGurk effect, phonemic restoration, and the Shepard tone are all consequences of this inferential structure.
Predictive coding (Hearing 8.4) — a neural-circuit-level implementation of approximate Bayesian inference.
Psychophysical thresholds — measured by $d'$ , plotted as ROC curves, fitted with signal-detection-theory models. The 50%-correct threshold of a 2-alternative-forced-choice task corresponds to a particular $d'$ value.
Speech perception in noise — every speech-in-noise audiometric test is a signal-detection problem.
Bayesian inference in modern engineering — Kalman filters, particle filters, ensemble Kalman filters, probabilistic-graphical-model algorithms — all built on Bayes’ rule and conjugate Gaussian updates.

Closing the chapter

That closes Foundations 11. The arc of the chapter: random variables and their distributions (11.1) give the vocabulary; the Central Limit Theorem (11.2) explains why the Gaussian is ubiquitous as the asymptotic shape of summed independent noise; random walks (11.3) and Poisson processes (11.4) are the two canonical stochastic models that capture most of what physical noise looks like at the macroscale; and Bayes’ rule (this lesson) is the inferential engine for going from data back to underlying state.