11.5 Bayesian inference and signal detection

Probability is half the story. The other half is inference: given observed data, what can we conclude about the underlying state of the world? This lesson develops Bayes’ rule — the single most-used identity in inferential statistics — and applies it to two settings the bookshelf needs: Bayesian inference (continuous-parameter estimation) and signal detection theory (discriminating signal from noise). Both are central to perception, which is the bridge to Hearing Ch 8.

Bayes’ rule

The joint probability Pr(A and B)\mathrm{Pr}(A \text{ and } B) of two events can be factored two ways:

Pr(A,B)  =  Pr(AB)Pr(B)  =  Pr(BA)Pr(A).\mathrm{Pr}(A, B) \;=\; \mathrm{Pr}(A \mid B)\, \mathrm{Pr}(B) \;=\; \mathrm{Pr}(B \mid A)\, \mathrm{Pr}(A).

Equating the two right-hand sides and rearranging:

  Pr(AB)  =  Pr(BA)Pr(A)Pr(B).  \boxed{\;\mathrm{Pr}(A \mid B) \;=\; \frac{\mathrm{Pr}(B \mid A)\, \mathrm{Pr}(A)}{\mathrm{Pr}(B)}.\;}

This is Bayes’ rule. It’s a one-line consequence of the definition of conditional probability. The novelty is the interpretation — and a whole school of statistics is built on it.

In an inferential setting, replace AA with a hypothesis HH (the parameter being estimated) and BB with the data DD (the observation). Bayes’ rule becomes

Pr(HD)posterior  =  Pr(DH)likelihoodPr(H)priorPr(D)evidence.\underbrace{\mathrm{Pr}(H \mid D)}_{\text{posterior}} \;=\; \frac{\overbrace{\mathrm{Pr}(D \mid H)}^{\text{likelihood}} \cdot \overbrace{\mathrm{Pr}(H)}^{\text{prior}}}{\underbrace{\mathrm{Pr}(D)}_{\text{evidence}}}.

The four named pieces:

The structure is symmetric: data updates the prior to a posterior via the likelihood. If you then collect more data, the current posterior becomes the new prior for the next update. This sequential property — Bayesian updating — is what makes Bayes’ rule the inferential workhorse of online estimation, Kalman filtering, and modern probabilistic-machine-learning systems.

A continuous example: estimating a Gaussian mean

The textbook starter example. We want to estimate the mean μ\mu of a Gaussian distribution with known variance σobs2\sigma_\text{obs}^2. We observe nn independent samples x1,x2,,xnx_1, x_2, \ldots, x_n. We have a Gaussian prior on μ\mu: N(μ0,σ02)\mathcal{N}(\mu_0, \sigma_0^2).

The posterior turns out to be Gaussian (the Gaussian is self-conjugate under Gaussian likelihoods — a beautiful algebraic gift), with parameters

  σpost2  =  (1σ02+nσobs2)1,μpost  =  σpost2(μ0σ02+xiσobs2).  \boxed{\;\sigma_\text{post}^2 \;=\; \left( \frac{1}{\sigma_0^2} + \frac{n}{\sigma_\text{obs}^2} \right)^{-1}, \qquad \mu_\text{post} \;=\; \sigma_\text{post}^2 \left( \frac{\mu_0}{\sigma_0^2} + \frac{\sum x_i}{\sigma_\text{obs}^2} \right).\;}

In words: the posterior precision (reciprocal of variance) is the sum of the prior precision and the data precision. The posterior mean is the precision-weighted average of the prior mean and the sample mean.

Why Gaussian × Gaussian = Gaussian

The prior is f(μ)exp((μμ0)2/(2σ02))f(\mu) \propto \exp(-(\mu - \mu_0)^2 / (2 \sigma_0^2)). The likelihood for observations x1,,xnx_1, \ldots, x_n is

Pr(xμ)  =  i=1n1σobs2πexp ⁣((xiμ)22σobs2)    exp ⁣(i(xiμ)22σobs2).\mathrm{Pr}(\mathbf{x} \mid \mu) \;=\; \prod_{i=1}^n \frac{1}{\sigma_\text{obs}\sqrt{2\pi}}\, \exp\!\left( -\frac{(x_i - \mu)^2}{2 \sigma_\text{obs}^2} \right) \;\propto\; \exp\!\left( -\frac{\sum_i (x_i - \mu)^2}{2 \sigma_\text{obs}^2} \right).

The posterior (up to normalisation) is the product:

Pr(μx)    exp ⁣((μμ0)22σ02i(xiμ)22σobs2).\mathrm{Pr}(\mu \mid \mathbf{x}) \;\propto\; \exp\!\left( -\frac{(\mu - \mu_0)^2}{2 \sigma_0^2} - \frac{\sum_i (x_i - \mu)^2}{2 \sigma_\text{obs}^2} \right).

The exponent is a quadratic in μ\mu; any quadratic-in-μ\mu exponent is a Gaussian in μ\mu. Complete the square: write the exponent as (μμpost)2/(2σpost2)-(\mu - \mu_\text{post})^2 / (2 \sigma_\text{post}^2) plus a μ\mu-independent constant. Matching coefficients of μ2\mu^2 and μ\mu gives

1σpost2  =  1σ02+nσobs2,μpostσpost2  =  μ0σ02+xiσobs2.\frac{1}{\sigma_\text{post}^2} \;=\; \frac{1}{\sigma_0^2} + \frac{n}{\sigma_\text{obs}^2}, \qquad \frac{\mu_\text{post}}{\sigma_\text{post}^2} \;=\; \frac{\mu_0}{\sigma_0^2} + \frac{\sum x_i}{\sigma_\text{obs}^2}.

Multiplying out gives the boxed formulas. The Gaussian is closed under conjugate Gaussian updates — a property called conjugacy — and the algebraic update is just precision-weighted addition.

priorlikelihoodposteriorx_1 = 1.50x_2 = 2.00x_3 = 1.80-4-2024parameter μprior N(0.0, 2.25) · 3 obs (σ_obs = 1.0) → posterior N(1.54, 0.290)
observations:

The blue prior is your belief about parameter μ before seeing any data. The red likelihood peaks at the empirical mean of the observations and is sharpened by both larger sample size and lower observation noise. The green posterior is Bayes' rule: prior × likelihood, renormalised. With many observations, the posterior tracks the likelihood (data dominates). With few observations or high noise, the prior pulls the posterior toward itself. This is the inferential engine underwriting [Hearing 8 — perception as Bayesian inference](/hearing/meaning/bayes).

Drag the prior mean and width; drag the observations; watch the posterior reform as the precision-weighted compromise. Three things to feel for:

A note on priors

The Bayesian framework requires you to specify a prior — what you believed before seeing the data. This is sometimes felt as a weakness (“subjective!”), and various attempts have been made to extract “objective” priors from symmetry or invariance arguments. In practice the prior matters most when data is scarce. Once you have many observations, the data dominates and the choice of prior becomes irrelevant — which is exactly what one would want.

The Bayesian and frequentist schools of statistics differ chiefly in whether they treat the parameter as having a distribution. To a Bayesian, μ\mu is a random variable with a posterior; to a frequentist, μ\mu is a fixed (unknown) number and the data is random. The two formalisms produce numerically identical answers in many practical settings — the disagreement is philosophical, not arithmetic.

Signal detection theory

A related but distinct inferential setting: given a single noisy observation, decide between two hypotheses. Is there a signal, or is it just noise? This is the signal detection problem, and it underpins all of psychophysics (including hearing thresholds), radar processing, medical screening, and audio compression artefact detection.

The classical setup: under the “noise-only” hypothesis H0H_0, the observation XX is drawn from a distribution f0f_0. Under the “signal-plus-noise” hypothesis H1H_1, XX is drawn from a distribution f1f_1 (typically the same shape as f0f_0 shifted by the signal amplitude). The optimal Bayesian decision rule is to compute the likelihood ratio

L(x)  =  f1(x)f0(x)L(x) \;=\; \frac{f_1(x)}{f_0(x)}

and compare to a threshold. If L>cL > c, declare signal; otherwise declare noise. The threshold cc encodes the costs of the two types of error and the prior probabilities of the hypotheses.

Four outcomes

The decision produces one of four outcomes:

Truth: H0H_0 (no signal)Truth: H1H_1 (signal)
Declare H0H_0True negativeMiss (false negative)
Declare H1H_1False alarm (false positive)Hit (true positive)

Two summary statistics matter:

As you lower the decision threshold, both HH and FF go up: you catch more signals but also flag more noise. The ROC curve (Receiver Operating Characteristic) plots HH versus FF as the threshold sweeps. A useless detector — one whose output is independent of the true class — lies on the diagonal H=FH = F. A perfect detector reaches the upper-left corner H=1,F=0H = 1, F = 0. The area under the curve (AUC) measures detector quality; AUC = 0.5 is chance, AUC = 1 is perfect.

The dd' statistic

When the signal-plus-noise and noise-only distributions are both Gaussian with the same variance σ\sigma but means differing by Δμ\Delta\mu, the detector’s quality is summarised by the sensitivity index

d  =  Δμσ.d' \;=\; \frac{\Delta\mu}{\sigma}.

dd' is the signal amplitude in units of the noise standard deviation. d=1d' = 1 is a marginal detector (typical psychophysical threshold); d=3d' = 3 is comfortable; d=5d' = 5 or more is essentially unambiguous. The AUC of the ROC curve and dd' are equivalent measures: d=2Φ1(AUC)d' = \sqrt{2}\, \Phi^{-1}(\mathrm{AUC}).

In psychophysics, the experimentally-measured dd' tells you the signal-to-noise ratio at which a perceptual system can discriminate. The auditory-nerve threshold for detecting a tone in noise has d1d' \approx 1 at the just-noticeable level, by definition.

History

The history — Bayes 1763, Laplace 1774, and a 200-year argument

Thomas Bayes was a Presbyterian minister and amateur mathematician in 18th-century England. He wrote An Essay towards solving a Problem in the Doctrine of Chances sometime before his death in 1761, but never published it. The manuscript was found among his papers by Richard Price, who edited and submitted it to the Royal Society; it appeared in the Philosophical Transactions in 1763, two years after Bayes had died.

The paper introduced what we now call Bayes’ rule — initially as a special case for the binomial distribution — and applied it to the problem of estimating an unknown probability from observed successes and failures. The crucial conceptual move was to treat the unknown parameter (the probability of success) as itself having a distribution. This was philosophically radical: parameters were generally thought of as fixed unknowns, not as random variables.

Pierre-Simon Laplace independently rediscovered and generalised the rule in his 1774 Mémoire sur la probabilité des causes par les événements. Laplace took it much further — using Bayesian arguments throughout his career to tackle problems from celestial mechanics (determining the orbits of comets) to demography (estimating population sizes from birth-rate data).

The Bayesian / frequentist split crystallised in the early 20th century, with Ronald Fisher, Jerzy Neyman, and Karl Pearson on the frequentist side arguing for objective, parameter-free statistics, and Harold Jeffreys, Bruno de Finetti, and L. J. Savage on the Bayesian side defending the subjective-probability interpretation. The argument lasted decades; modern statistics largely shrugs and uses both. The rise of computational Bayesian methods (Markov-chain Monte Carlo, variational inference) in the 1990s tipped the practical balance toward Bayesian methods for complex models, and machine-learning’s adoption of probabilistic-programming languages (Stan, PyMC, Pyro) has made Bayes the default for most inference today.

What we use this for

Bayesian inference and signal detection appear repeatedly:

Closing the chapter

That closes Foundations 11. The five lessons developed the working subset of probability and statistics the bookshelf uses: random variables and the named distributions (11.1), the Gaussian and the Central Limit Theorem that makes it ubiquitous (11.2), random walks and Brownian motion (11.3), Poisson processes (11.4), and Bayesian inference and signal detection theory (this lesson).

The arc of the chapter, in one paragraph: the world is statistical because most physical signals are sums of many independent micro-fluctuations (the CLT picture); random walks and Poisson processes are the two canonical stochastic models that capture most of what physical noise looks like at the macroscale; Bayes’ rule is the inferential engine for going from data back to underlying state. Almost everything probabilistic in the rest of the bookshelf is an instance of one of these.