8.2 Spectrograms and the time-frequency picture

Most interesting sounds are not time-invariant. Speech is a sequence of phonemes; music is a sequence of notes; a slamming door is a transient. The Fourier transform of the whole signal flattens this temporal structure into a single spectrum — useful but lossy. The fix is the short-time Fourier transform (STFT): chop the signal into overlapping windows, Fourier-transform each window, and plot the magnitude as a 2-D image with time on one axis and frequency on the other.

The result is a spectrogram, and it is the single most useful visualisation in audio analysis.

Construction

For a signal x(t)x(t) and a window function w(t)w(t) (e.g., a Gaussian or a Hamming window) of width TwT_w,

X(t,ω)  =  x(τ)w(τt)eiωτdτ.X(t, \omega) \;=\; \int_{-\infty}^\infty x(\tau)\, w(\tau - t)\, e^{-i\omega \tau}\, d\tau.

This is a function of two variables: time tt (where the window is centred) and frequency ω\omega. Take its magnitude squared, X(t,ω)2|X(t, \omega)|^2, and you have a 2-D heatmap — the spectrogram.

A small interactive synthesizer

time → (13.6 ms, three periods of f₀)p(t) 220 Hz440 Hz660 Hz880 Hz1100 Hz1320 Hzfrequencyamplitude
1.00
0.00
0.00
0.00
0.00
0.00
presets:

Adjust the amplitudes of the first six harmonics of a 220 Hz fundamental, watch the time-domain waveform and the line spectrum side by side, and press play sound to hear it. The presets exhibit the canonical waveforms whose Fourier series are textbook material: pure sine, square wave (odd harmonics with 1/n1/n falloff), sawtooth (all harmonics, alternating signs), triangle (1/n21/n^2 falloff in odd harmonics), and a stylised vowel-like spectrum.

This synthesizer doesn’t yet show a spectrogram (a time-series of such spectra) — that’s a more elaborate visualisation we’ll build in a later iteration. What it does show is the equivalence: the time-domain waveform and the frequency-domain spectrum are two views of the same object, neither more complete than the other.

The window-width tradeoff

The width TwT_w of the analysis window sets the resolution. From the uncertainty principle:

ΔtΔω    12.\Delta t \cdot \Delta \omega \;\geq\; \tfrac12.

There is no window that gives sharp resolution in both. The choice is the central design decision in spectrogram analysis. Different applications want different windows:

What spectrograms reveal

A spectrogram of human speech makes phonetic structure visible: vowels appear as horizontal bars at the formant frequencies; fricatives as broadband high-frequency noise; stops as silences followed by bursts; voiced segments show the harmonic ladder of the vocal folds.

A spectrogram of music shows: the harmonic series of pitched notes (vertical stacks of evenly-spaced lines), chord changes (when one harmonic stack disappears and another appears), rhythm (the temporal periodicity), and timbre (the relative strengths of harmonics for each note).

A spectrogram of birdsong reveals sweeps, trills, and species-specific frequency patterns. A spectrogram of underwater sound shows ship noise, marine mammal calls, and seismic activity — each in their own frequency band.

The same tool, applied differently, becomes the foundation for: speech recognition (input features for ASR systems), music information retrieval, bioacoustic monitoring, mechanical fault diagnosis, ultrasonic medical imaging. The list does not end.

What we use this for in the rest of the book