The Gaussian distribution (or normal distribution) is the single most-used probability distribution in science. There is a reason: it is the asymptotic distribution of any sum of many independent, identically-distributed random variables with finite variance — the Central Limit Theorem. This means almost every “noise” we measure in physics, biology, finance, and engineering is approximately Gaussian, because most noises are sums of many independent micro-fluctuations.
This lesson develops the Gaussian, derives why it has the form it does, states the Central Limit Theorem, and demonstrates the CLT with an interactive. We also touch the multivariate Gaussian, which becomes central in Bayesian inference.
The Gaussian PDF
The one-dimensional Gaussian (or normal) distribution with mean μ and variance σ2 is
f(x)=σ2π1exp(−2σ2(x−μ)2).
It is the familiar “bell curve” centred at μ with width σ. The notation N(μ,σ2) denotes “a Gaussian with mean μ and variance σ2.” A standard normal is N(0,1).
Three features encode almost everything about the Gaussian’s behaviour:
Symmetric about the mean.f(μ+t)=f(μ−t) for any t.
Exponentially small tails.f(μ+kσ) falls as e−k2/2. At ∣x−μ∣=3σ the density is already ∼1% of its peak. By 5σ, ∼10−6.
The 68-95-99.7 rule. About 68% of the probability lies within ±1σ of the mean, 95% within ±2σ, 99.7% within ±3σ. Worth memorising.
The normalisation constant 1/(σ2π) is fixed by the requirement that ∫f(x)dx=1:
▶Why the normalisation involves √(2π)
Compute the integral
I=∫−∞∞e−x2/2dx.
There is no elementary antiderivative. The trick is to square it:
Switch to polar coordinates (r,θ) with r2=x2+y2 and dxdy=rdrdθ:
I2=∫02πdθ∫0∞e−r2/2rdr=2π⋅[−e−r2/2]0∞=2π.
So I=2π. For a Gaussian with general μ and σ, substitute u=(x−μ)/σ, du=dx/σ:
∫e−(x−μ)2/(2σ2)dx=σ∫e−u2/2du=σ2π.
Hence the normalisation factor 1/(σ2π).
Why the Gaussian is everywhere: the Central Limit Theorem
The Gaussian’s ubiquity is not a coincidence. The Central Limit Theorem says:
Let X1,X2,…,Xn be independent, identically distributed random variables with mean μ and finite variance σ2. Define the sample sum Sn=X1+X2+⋯+Xn. Then as n→∞,
σnSn−nμ⟶N(0,1)
in distribution.
In words: the sum of n independent samples is approximately Gaussian with mean nμ and variance nσ2, regardless of the underlying distribution of the Xi‘s. The convergence is in distribution — the CDF approaches the Gaussian CDF point-by-point.
This is one of the deepest theorems in probability. It does not depend on what the Xi‘s actually are — Bernoulli, exponential, uniform, an irregular bimodal mixture — only on the finite-variance condition. The Gaussian is the attractor under summation, the way the heat equation is the attractor under time evolution.
The CLT, made visible
distribution:
The Central Limit Theorem: regardless of the distribution we sample from, the *sum* of N independent samples approaches a Gaussian as N grows. At N = 1 the histogram traces the underlying distribution itself — uniform, exponential, or bimodal. By N = 5 the shape is already nearly Gaussian; by N = 10 it is indistinguishable from one with mean Nμ and variance Nσ², where μ and σ² are the mean and variance of a single sample. The red curve is the theoretical CLT prediction. The convergence happens for any distribution with finite variance — the only thing that changes is how fast.
Pick an underlying distribution — uniform, exponential, or bimodal — and slide the number of summands N from 1 to 30. The histogram is the empirical distribution of ∑i=1NXi over 20,000 trials; the red curve is the theoretical CLT prediction N(Nμ,Nσ2).
A few things to take from playing with this:
At N=1, the histogram traces the underlying distribution itself.
By N=5 the shape is already nearly Gaussian.
By N=10 it is indistinguishable from the theoretical Gaussian in any visible feature.
The mean of the sum grows linearly: E[SN]=Nμ. The standard deviation grows as the square root: std(SN)=σN. The relative width σSN/∣E[SN]∣ therefore shrinks as 1/N — the sum becomes proportionally tighter around its mean.
For the bimodal distribution, the convergence is slower (the histogram retains a slight bumpiness at small N) but still occurs. The exponential converges fastest among the three.
This “everyone’s the same after enough adding” is what makes the Gaussian so dominant in physics. Almost every noise we measure is a sum of many tiny independent fluctuations — molecular collisions, photon counts, thermal motions — and the CLT says the sum is Gaussian regardless of the underlying micro-distribution.
History
⏳The history— From de Moivre to Laplace to Gauss
The bell curve’s first appearance was in 1733, when Abraham de Moivre computed the limiting shape of the binomial distribution as n→∞. He derived (kn)pk(1−p)n−k as an approximate Gaussian for large n, what we’d now call a special case of the Central Limit Theorem. The result was buried in an obscure pamphlet; few people read it.
The curve was rediscovered and popularised by Pierre-Simon Laplace, who derived a more general central-limit result in his 1812 Théorie analytique des probabilités. Laplace argued that sums of many independent measurement errors should be Gaussian-distributed, regardless of the individual error distributions — the modern CLT framing.
Carl Friedrich Gauss developed the distribution from a completely different angle in 1809: he asked, what distribution makes the sample mean the maximum-likelihood estimator of the true value? The unique answer is the Gaussian. This is why we call it Gaussian today, even though de Moivre had the curve a century earlier and Laplace had the limit theorem.
The proof of the CLT in its modern form is due to Aleksandr Lyapunov in 1901 and Jarl Waldemar Lindeberg in 1922. The Lindeberg condition — a precise statement of “no individual Xi should dominate the sum” — is what makes the theorem rigorous.
Multivariate Gaussian
For a vector-valued random variable X=(X1,…,Xd) in Rd, the multivariate Gaussian with mean μ and covariance matrixΣ has PDF
f(x)=(2π)ddetΣ1exp(−21(x−μ)TΣ−1(x−μ)).
The covariance matrix Σij=E[(Xi−μi)(Xj−μj)] encodes both the spread of each component (diagonal entries) and the linear correlation between components (off-diagonal entries). A diagonal Σ means the components are independent.
The contours of constant density are ellipsoids aligned with the eigenvectors of Σ — directly the eigenvalue analysis from Linear Algebra. The principal axes are the eigenvectors, and the principal lengths are λi where λi are the eigenvalues. Principal-component analysis (PCA) is exactly the eigendecomposition of an empirical covariance matrix.
The multivariate Gaussian is the natural workhorse for Bayesian inference on multi-dimensional parameters: conjugate Gaussian priors and Gaussian likelihoods produce Gaussian posteriors via a closed-form update, allowing entire belief states to be passed through algorithms as (μ,Σ) pairs. The Kalman filter, the Gauss–Markov theorem, and most of “linear filtering theory” live in this corner of the world.
Standard error and confidence
A practical corollary of the CLT. The sample mean of n independent samples is
Xˉ=n1i=1∑nXi.
By the CLT, Xˉ is approximately Gaussian with mean μ (the true population mean) and variance σ2/n. The standard error of the sample mean is σ/n.
This is the famous ”1/n scaling” of measurement uncertainty. Averaging four independent measurements halves the uncertainty. Averaging 100 measurements reduces it by a factor of 10. Averaging a million measurements reduces it by 1000.
The confidence interval for the true mean, given a sample mean Xˉ, is Xˉ±zσ/n, where z depends on the desired confidence level: z=1.96 for 95% confidence, z=2.58 for 99%, z=3 for ∼99.7% — exactly the ±2σ / ±3σ rule from earlier.
When σ is unknown and must be estimated from the same sample, the Gaussian is replaced by Student’s t-distribution, which has slightly heavier tails to account for the additional uncertainty in σ. For sample sizes n>30 the t-distribution is indistinguishable from the Gaussian and most practical work uses the Gaussian approximation throughout.
What we use this for
Gaussians and the CLT show up wherever many small noises add:
Thermal noise — Johnson–Nyquist voltage fluctuations across a resistor, Gaussian with variance 4kBTRΔf.
Measurement error — any reading of a noisy instrument is a Gaussian about the true value if the underlying physics has many independent error sources.
Diffusion — the spatial distribution of a Brownian particle is Gaussian with variance growing linearly in time. Developed properly in 11.3.
Photon shot noise at high intensity — Poisson with large mean is Gaussian (CLT applied to the Poisson distribution).
Bayesian inference with Gaussian priors and likelihoods — closed-form Gaussian posteriors. Developed in 11.5.
Modal density and statistical room acoustics — the random superposition of many room modes has a Gaussian envelope by the CLT. Connects to Sound 7.8.
The next lesson, 11.3, develops random walks — the sum-of-i.i.d. picture of the CLT used to derive Brownian motion and the diffusion equation.