3.2 The speech banana and the audibility map

The audiogram shows what the patient can detect; the speech banana shows what speech is. Overlaying the two lets you answer a question that pure tones alone cannot: which phonemes is this listener missing? A patient whose threshold curve dips into the banana — even by 10 dB — loses portions of the speech signal entirely. Those portions are systematic: low-frequency loss costs vowels; high-frequency loss costs fricatives; mid-frequency loss costs voiced consonants and formant transitions.

This lesson develops the long-term average speech spectrum (LTASS), the banana shape it produces on the audiogram, and the phoneme map that lives inside it.

The long-term average speech spectrum

A long recording of conversational speech, analysed for its average spectral content over many seconds, produces a characteristic shape. The long-term average speech spectrum (LTASS) at conversational level (about 65 dB SPL overall) has:

A broad low-frequency peak around 500 Hz (dominated by vowel formants F1 and F2)
A monotone decline above 1 kHz, falling roughly at 6 dB per octave through about 4 kHz
A further drop above 4 kHz, ending around 8 kHz

The spectrum is not a single line; speech is dynamic, so any given moment of speech has energy spanning a range above and below the long-term average. The standard audiometric convention is to plot the LTASS as a band — typically the 1st-to-99th percentile envelope of the moment-to-moment spectrum — rather than a single curve. That band, replotted in audiometric coordinates (frequency on log-x, dB HL on inverted-y), is the speech banana.

What lives in the banana

Different phonemes occupy different regions of the banana:

Vowels are loud and low-frequency. Their energy lives near 500–1000 Hz, with formants spreading up to 2.5 kHz. Vowels at conversational level are typically 50–60 dB SPL — well within most listeners’ audibility range.
Voiced consonants (/m/, /n/, /r/, /l/, /b/, /d/, /g/, /v/, /z/) live in the mid-frequency range, 500–3000 Hz, at moderate intensities. They carry both phonemic content and the prosodic envelope of speech.
Voiceless fricatives (/s/, /f/, /ʃ/, /θ/, /h/) live in the high-frequency range, 3–8 kHz, at low intensities — typically 30–45 dB SPL. They are simultaneously the highest-frequency and the softest portions of speech. They are also the most informationally important: /s/ vs /f/ vs /ʃ/ are critical phonemic distinctions, often the difference between think and sink, sue and zoo.

The high-frequency-low-intensity nature of fricatives is the central audiological problem of presbycusis and noise-induced hearing loss. Both losses preferentially affect the high frequencies (typically 2–8 kHz), exactly where the softest and most-informationally-important phonemes live. The classic patient complaint — “I can hear, I just can’t understand” — is the audiological signature of this mismatch.

preset:

The shaded "speech banana" is the long-term average spectrum of conversational speech at about 65 dB SPL, replotted in audiometric coordinates. Phonemes are scattered inside it: vowels concentrate at low frequencies with high energy; voiced consonants are mid-frequency; fricatives like /s/, /f/, /ʃ/, /θ/ are high-frequency and low-energy. A listener's threshold curve is overlaid in blue. Any phoneme that sits below the threshold (i.e., quieter than threshold at its frequency) is grayed out — inaudible. High-frequency sloping losses cut fricatives first; the resulting "I can hear, I just can't understand" complaint is what brings most adults to the clinic.

The interactive overlays a sample listener’s threshold curve (blue X markers) on the speech banana. Phonemes that fall above the listener’s threshold remain audible (coloured by group); phonemes that fall below are grayed out — inaudible. The counter on the side tracks how many phonemes are inaudible. Switch presets to see how different loss patterns selectively erase different parts of the speech signal:

Normal — all phonemes audible.
Sloping mild — the highest-frequency fricatives (/s/, /f/, /θ/) start to disappear.
Sloping moderate — most fricatives gone; voiced consonants in the upper mids start to slip below threshold.
Flat severe — most of the banana below threshold; vowels and voiced consonants still partially accessible.
Cookie-bite — the mids fall, including formant transitions critical for distinguishing many consonants; the patient hears vowels at low frequencies and fricatives at high frequencies but loses the “middle” of speech.

The “count the dots” heuristic

A clinical shorthand: at any given threshold curve, count how many phoneme markers the listener can still hear. A WRS in quiet correlates roughly with the count — losing 1-2 phonemes drops WRS by ~10%; losing 4-5 drops it by 30-40%. The map is imprecise (because it doesn’t capture contextual / linguistic redundancy), but it makes the configuration of the loss vivid in a way the audiogram alone does not.

This is also the most useful patient-counselling tool. Showing a patient that their high-frequency loss specifically cuts these specific phonemes makes the abstract audiogram concrete in a way that “you have a sloping sensorineural loss” never will.

When the banana fails: audibility ≠ intelligibility

The banana picture has limits. It treats speech as a spectrum — but speech is also a time-varying signal with rapid formant transitions, voice-onset-time cues, and prosodic patterns. A listener whose audibility is perfect according to the banana may still have:

Poor temporal resolution — degraded auditory-nerve coding (e.g., from hidden hearing loss, ANSD, or cochlear synaptopathy) that disrupts the time-domain cues used to distinguish stop consonants.
Reduced frequency resolution — broadened cochlear filtering at low SNRs that smears formant peaks.
Cognitive limitations — speech understanding requires attention, working memory, linguistic context; deficits in any of these reduce performance regardless of audibility.

This is why the speech banana is one of several tools, not a substitute for the SRT/WRS measurements of 3.1 or the speech-in-noise tests of 3.3. Audibility is necessary for intelligibility but is not sufficient — and intelligibility itself depends on signal, brain, and context together.

What’s next

The next lesson, 3.3 — Speech in noise, extends speech audiometry to the listening conditions patients actually live in. Real-world environments contain noise, reverberation, and competing talkers. The audiogram and the quiet WRS poorly predict performance in those conditions; modern speech-in-noise tests (HINT, QuickSIN) and the articulation index / speech intelligibility index address the gap.