7.6 Speech and music in cortex

A particularly striking feature of human auditory cortex is the degree of specialization for speech and music — categories that are clearly culturally constructed but that nevertheless have dedicated cortical machinery.

The superior temporal gyrus (STG), particularly on the left in most people, contains regions specialized for processing the phonemic structure of speech. fMRI studies show that listening to speech (compared to scrambled speech or amplitude-matched noise) elicits much stronger activation in STG than other auditory stimuli. Intracranial recordings in awake humans have identified individual electrodes in STG whose activity tracks specific phonetic features — frication, voicing, place of articulation. The cortex is doing phonetic classification.

Music processing has its own partially-separate substrate, often more strongly right-lateralized. Specific cortical regions track pitch, rhythm, and timbre, and a different region appears specialized for music’s emotional content.

These specializations emerge during development. Newborns do not have them. They are not innate. They are learned over the first decade of life, in response to whatever auditory environment the infant is exposed to. A child raised in a tonal-language environment ends up with somewhat different cortical specializations than one raised in a non-tonal environment. The cortex’s organization is, in part, the imprint of one’s auditory experience.

For the phrase “Hey Dr. Miles!”, the cortical processing pipeline goes roughly like this. A1 represents the cochleagram tonotopically. Belt and parabelt integrate across frequencies and time, recognizing acoustic features. The STG segments the sound into phonemic categories: /h/, /eɪ/, /d/, and so on. Higher-order regions (Wernicke’s area on the left posterior STG, and beyond, into anterior temporal cortex) recognize the words: “Hey”, “Doctor”, “Miles”. A specifically-faced person — the speaker — may be inferred from voice characteristics processed in the anterior STG. The meaning of being addressed by name engages limbic and prefrontal regions.

All of this happens in a few hundred milliseconds.

But there is still one more step. The cortical representation we have just sketched is one of recognition. It does not yet describe what the listener understands the sentence to mean for them — the prior expectations, the social context, the recognition that “Hey Dr. Miles!” was directed at the listener and is, therefore, a greeting that calls for a response. That step — from cortical pattern to meaning, memory, and prediction — is movement 9.