3.2 The speech banana and the audibility map
The audiogram shows what the patient can detect; the speech banana shows what speech is. Overlaying the two lets you answer a question that pure tones alone cannot: which phonemes is this listener missing? A patient whose threshold curve dips into the banana — even by 10 dB — loses portions of the speech signal entirely. Those portions are systematic: low-frequency loss costs vowels; high-frequency loss costs fricatives; mid-frequency loss costs voiced consonants and formant transitions.
This lesson develops the long-term average speech spectrum (LTASS), the banana shape it produces on the audiogram, and the phoneme map that lives inside it.
The long-term average speech spectrum
A long recording of conversational speech, analysed for its average spectral content over many seconds, produces a characteristic shape. The long-term average speech spectrum (LTASS) at conversational level (about 65 dB SPL overall) has:
- A broad low-frequency peak around 500 Hz (dominated by vowel formants F1 and F2)
- A monotone decline above 1 kHz, falling roughly at 6 dB per octave through about 4 kHz
- A further drop above 4 kHz, ending around 8 kHz
The spectrum is not a single line; speech is dynamic, so any given moment of speech has energy spanning a range above and below the long-term average. The standard audiometric convention is to plot the LTASS as a band — typically the 1st-to-99th percentile envelope of the moment-to-moment spectrum — rather than a single curve. That band, replotted in audiometric coordinates (frequency on log-x, dB HL on inverted-y), is the speech banana.
What lives in the banana
Different phonemes occupy different regions of the banana:
- Vowels are loud and low-frequency. Their energy lives near 500–1000 Hz, with formants spreading up to 2.5 kHz. Vowels at conversational level are typically 50–60 dB SPL — well within most listeners’ audibility range.
- Voiced consonants (/m/, /n/, /r/, /l/, /b/, /d/, /g/, /v/, /z/) live in the mid-frequency range, 500–3000 Hz, at moderate intensities. They carry both phonemic content and the prosodic envelope of speech.
- Voiceless fricatives (/s/, /f/, /ʃ/, /θ/, /h/) live in the high-frequency range, 3–8 kHz, at low intensities — typically 30–45 dB SPL. They are simultaneously the highest-frequency and the softest portions of speech. They are also the most informationally important: /s/ vs /f/ vs /ʃ/ are critical phonemic distinctions, often the difference between think and sink, sue and zoo.
The high-frequency-low-intensity nature of fricatives is the central audiological problem of presbycusis and noise-induced hearing loss. Both losses preferentially affect the high frequencies (typically 2–8 kHz), exactly where the softest and most-informationally-important phonemes live. The classic patient complaint — “I can hear, I just can’t understand” — is the audiological signature of this mismatch.
The shaded "speech banana" is the long-term average spectrum of conversational speech at about 65 dB SPL, replotted in audiometric coordinates. Phonemes are scattered inside it: vowels concentrate at low frequencies with high energy; voiced consonants are mid-frequency; fricatives like /s/, /f/, /ʃ/, /θ/ are high-frequency and low-energy. A listener's threshold curve is overlaid in blue. Any phoneme that sits below the threshold (i.e., quieter than threshold at its frequency) is grayed out — inaudible. High-frequency sloping losses cut fricatives first; the resulting "I can hear, I just can't understand" complaint is what brings most adults to the clinic.
The interactive overlays a sample listener’s threshold curve (blue X markers) on the speech banana. Phonemes that fall above the listener’s threshold remain audible (coloured by group); phonemes that fall below are grayed out — inaudible. The counter on the side tracks how many phonemes are inaudible. Switch presets to see how different loss patterns selectively erase different parts of the speech signal:
- Normal — all phonemes audible.
- Sloping mild — the highest-frequency fricatives (/s/, /f/, /θ/) start to disappear.
- Sloping moderate — most fricatives gone; voiced consonants in the upper mids start to slip below threshold.
- Flat severe — most of the banana below threshold; vowels and voiced consonants still partially accessible.
- Cookie-bite — the mids fall, including formant transitions critical for distinguishing many consonants; the patient hears vowels at low frequencies and fricatives at high frequencies but loses the “middle” of speech.
The “count the dots” heuristic
A clinical shorthand: at any given threshold curve, count how many phoneme markers the listener can still hear. A WRS in quiet correlates roughly with the count — losing 1-2 phonemes drops WRS by ~10%; losing 4-5 drops it by 30-40%. The map is imprecise (because it doesn’t capture contextual / linguistic redundancy), but it makes the configuration of the loss vivid in a way the audiogram alone does not.
This is also the most useful patient-counselling tool. Showing a patient that their high-frequency loss specifically cuts these specific phonemes makes the abstract audiogram concrete in a way that “you have a sloping sensorineural loss” never will.
When the banana fails: audibility ≠ intelligibility
The banana picture has limits. It treats speech as a spectrum — but speech is also a time-varying signal with rapid formant transitions, voice-onset-time cues, and prosodic patterns. A listener whose audibility is perfect according to the banana may still have:
- Poor temporal resolution — degraded auditory-nerve coding (e.g., from hidden hearing loss, ANSD, or cochlear synaptopathy) that disrupts the time-domain cues used to distinguish stop consonants.
- Reduced frequency resolution — broadened cochlear filtering at low SNRs that smears formant peaks.
- Cognitive limitations — speech understanding requires attention, working memory, linguistic context; deficits in any of these reduce performance regardless of audibility.
This is why the speech banana is one of several tools, not a substitute for the SRT/WRS measurements of 3.1 or the speech-in-noise tests of 3.3. Audibility is necessary for intelligibility but is not sufficient — and intelligibility itself depends on signal, brain, and context together.
What’s next
The next lesson, 3.3 — Speech in noise, extends speech audiometry to the listening conditions patients actually live in. Real-world environments contain noise, reverberation, and competing talkers. The audiogram and the quiet WRS poorly predict performance in those conditions; modern speech-in-noise tests (HINT, QuickSIN) and the articulation index / speech intelligibility index address the gap.