Where others saw a minor inconsistency, Howard Nusbaum saw a major clue. In the early 1980s, as a postdoc in speech and hearing at Indiana University, Nusbaum heard people raving about a particular synthetic speech generator. Often its fans were blind programmers who relied on the machines at their workplaces. To him, though, the computerized voice sounded terrible.
Were people convincing themselves otherwise just because they had no alternative? Nusbaum, now a U of C associate professor of psychology, saw another explanation, one "deeply rooted in the way that the mind works." Perhaps they were relearning how to process sound, unconsciously adjusting to the machine's idiosyncrasies until its bizarre accent became intelligible. That change would require a mental revision of pronunciation rules that, according to linguistic dogma, are frozen in place when a child first learns to comprehend words. So Nusbaum tested the dogma, and his suspicions only grew. He found that blind young adults, asked to identify words in synthetic speech, improved from 20 percent correct to 70 percent in a mere eight days.
"Nothing predicted this," he recalls. "All of the extant theories basically say these 22-year-olds should have fixed and immutable speech-perception mechanisms."
The contradiction Nusbaum discovered has inspired him as he works to trace the mind's first step in understanding language: translating raw sound into meaningful words. Along the way, he's gathered evidence that the first step in listening is far more complex than previously believed. Even in adults, he says, it is an active mental process, more like a toddler's busy guesswork as she figures out language for the first time than a computer's rote procedure for "understanding" spoken words. So similar are adults to children, he says, that, under some conditions, they should be able to learn foreign languages with a fluency that most psychologists think only native speakers--and those who start young--can achieve.
Through the 1950s, says Nusbaum, speech perception wasn't seen as complex at all: The mind simply recognized syllables and words that had sounds as distinctive and unvarying as a printed spelling. Then in the 1960s, speech waveforms, created by spectrographs, showed that consonants, vowels, syllables, and words exhibit a variability that makes comprehension seem miraculous. The consonants b and d, for example, comprise many acoustic patterns, but the pattern we hear as b in one context may be interpreted as d in another--or when it's spoken by a different person.
The puzzle, explains Nusbaum, a member of the committees on biopsychology and cognition & communication, is that "we have these mental constructs--words and vowels--that are representations of [the sound] that's hitting our ear," but those constructs don't correspond in any simple way to actual acoustic properties.
Convinced that acoustic differences alone don't establish unambiguous meanings, psychologists like Nusbaum--who came to Chicago in 1986 and now heads his department's Center for Computational Psychology--focus on the listener's mental process rather than on rules of pronunciation.
Many assume the ear, like some computer programs, interprets speech in fixed "windows" of time, perhaps digesting phonemes--the smallest unit of meaningful sound, like a single vowel or consonant--by listening to several windows in a row, then analyzing more windows to process a word. Others theorize a sequence of processors: a basic one that translates the raw signal into its auditory properties, another that turns those properties into phonemes, and a third that turns phonemes into words.
Nusbaum's research, however, suggests the mind leaps back and forth among these tasks. "We listen to a stream of speech in different ways at different times," he says, depending on our knowledge of who's talking and what's being said. If the talker is a stranger, we might unconsciously listen more to the separate parts of each word. If the speech is fast, we will distinguish consonants using different acoustic cues than when it's slow.
How we pay attention depends, too, on our native language. English generally uses pitch as an attribute of whole sentences, such as a rising tone to mark a question. In Chinese, though, pitch is intrinsic to each word--da, for instance, can mean big, hit, or dozen, depending on its tone. By comparing native English and Chinese speakers, Nusbaum and Lisa Lee, AM'89, PhD'93, have shown that Chinese speakers are less able to ignore pitch even when focusing on other sound cues, like consonants.
Most researchers believe these "listening strategies" are largely formed as early as age 1--by which time babies distinguish their native language's rhythms and phonology--or early adolescence, coordinated perhaps with brain developments that, once complete, also make new languages difficult to learn.
Nusbaum disagrees. Along with the concept of shifting attention, lifelong learning, he thinks, is fundamental to speech perception. In an experiment conducted with graduate student Kate Baldwin, he capitalized on the finding that English speakers distinguish bah, dah, and gah not by their different initial sounds but by slight differences in the vowels. Nusbaum and Baldwin synthesized a sound with one syllable's opening and another's vowel, and instructed listeners that this hybrid sound actually represented the opening sound's syllable. Unconsciously, the subjects learned to hear the new acoustic cue.
Such listening strategies "may not be as well-entrenched" as those learned in infancy, says Nusbaum, "but the mechanism is exactly the same." Then why can't adults become fluent in a new language? Many do, Nusbaum points out. "Typically, they're in immersion situations"--like people who move to another country. The difficulty, he believes, comes when adults study a new language "piecemeal," continuing to speak and hear their native tongue.
In that case, he speculates, "the body of knowledge you have is in conflict with the new thing you're learning." The Chinese tonal distinction contradicts English rules about the meaning of tone; Zulu clicks, on the other hand, are easy to learn because they hold no meaning in English. Just which phonological conflicts affect second-language learning he will investigate in future research. As for the present, Nusbaum observes, he and his colleagues haven't "even scratched the surface" of language understanding. --A.C.
Written by Andrew Campbell.