When speech or verbal language is processed, human brain operates as an adaptive processor that decodes and encodes the information embedded onto sound waves for successful perception and production. In this vein, speech processing can be accounted for as information flow (Denes and Pinson, 1993), in which three different processing levels exist: acoustic, physiological, and linguistic ones. Unlike at the acoustic level, however, it is still unclear what happens at the physiological level. That is, little is known about how speech comes out of the sound waves. Chomskyan linguistics (Chomsky, 2000) can reveal only how information is represented and orga-nized across many different linguistic structures. Therefore it hardly can tell us how linguistic information can be extracted from specific sounds during online speech processing. Besides, there is no complete description of how meaningless sounds become meaningful sounds yet, which is also essential in early language acquisition. It is thus necessary and important to specify neural processes mediating between any two adjacent levels of speech processing: acoustic-to-physiological and physiological-to-linguistic processes, and vice versa.
In fact, human speech sounds are not special more than animal vocalizations and environmental sounds. What is really special emerges when sound waves are
All these ask us to raise a well-posed question, i.e. what makes speech sounds different from others? Answering this question is to elucidate what happens at the intermediate stage between acoustic and linguistic levels. As one of such attempts, here we aimed to describe neural mechanisms for (1) speech codes in the brain; (2) sensorimotor integration supported by auditory-motor interface; and (3) meaningful sound (word) learning. Three consecutive experiments (two functional magnetic resonance imaging, fMRI; and one functional near infrared spectroscopy, fNIRS), based on verbal repetition, are introduced to answer those questions here. Verbal repetition is basically a sort of vocal imitation and usually used in neuropsychology for diagnosis of aphasic patients (Wallesch and Kertesz, 1993). Furthermore, it is suitable for investigating language
Speech representation in the brain
Before information is extracted, speech sounds are relevantly represented as linguistic units in the brain. To know what happens in this process, many studies frequently contrasted speech with other meaningless sounds in terms of neural circuits. However, the contrast between acoustically different sounds is not relevant because it is confounded by sound processing at the sensory level. Note that once learned, meaningless sounds become speech despite that the acoustic feature of the sounds is not changed. It thus seems that speech representation significantly interacts with high-level linguistic process, e.g. meaning. To tap into this assumption, we introduced novel sounds perceived as words or pseudowords depending on the interpretation of ambiguous vowel in the sounds. With these stimuli, we monitored blood oxygenation level-dependent (BOLD) signals from an event-related functional magnetic resonance imaging (fMRI) while subjects immediately repeated what they heard. Repeating the ambiguous sounds revealed the auditory-motor interface at the Sylvian fissures and superior temporal sulci bilaterally, and more importantly, we found neural activities unique to word‐perceived repetition in the left posterior middle temporal areas and those unique to pseudoword‐perceived repetition in the left inferior frontal gyrus by contrasting word‐versus‐pseudoword trials (Yoo et al., 2012).
In the classical view, this result is divided into two separate but interactive networks: frontal and temporal networks. The frontal network, localized in bilateral premotor cortex (BA6), Broca’s area (BA44, 45), and the lat-ter’s right homologue, corresponds to the articulation of target sounds. The prefrontal network, including dorsolateral prefrontal cortex (BA46) and inferior prefrontal gyrus (BA47), is likely to be correlated to syllabification (Hickok and Poeppel, 2004; Indefrey and Levelt, 2004). In contrast, the temporal network is involved in the processing of incoming sounds. They are first processed at Heschl’s gyrus (BA41) and then spectrotemporally analyzed in superior temporal gyrus (BA21, 22, 38, 41, 42) and middle temporal gyrus (BA21) (Hickok and Poeppel, 2007). Last, the Sylvian-parietal-temporal (Spt) area is enacted by simultaneous involvement of speech per-ception and production (Binder et al., 2000; Hickok and Poeppel, 2004).
However, it is to be noted that distinct neural activities modulated by subjects’ perception are not easily incorporated into this classical view, as it requires an account of phonological and semantic interaction at the stage of low-level speech processing. The notion of a phonological learning device can provide a possible account of the result (Baddeley, Gathercole, and Pap-agno, 1998). The auditory-motor interface has been regarded as a linguistic device in general, and we can assume that sounds become familiar after several imitations in the phonological loop provided by the auditory-motor interface. Once being familiar, it is usually associated with certain meanings by learning faculties. In this way, even an acoustically identical sound may be differentially processed after it has been learned. This is consistent with the dual-stream model (Hickok and Poeppel, 2007) in that two sepa-rate networks are used for articulation-based and acoustic-phonetic codes, respectively. Perisylvian connectivity revealed by DT-MRI (diffusion-tensormagnetic resonance imaging) also supports this notion of cooperative divi-sion in speech processing (Catani et al., 2005). In short, even for acoustically identical sounds, two distinct speech codes, i.e. an articulation‐based code of pseudowords and an acoustic‐phonetic code of words, are differentially used for verbal repetition according to whether the speech sounds are meaningful or not.
>
The auditory-motor interface and sensorimotor integration
As discussed, the auditory-motor interface seems to have a critical role in speech processing. However, the same interface was equally activated for nonspeech such as musical sounds (Koelsch et al., 2009), implying some relationships between verbal repetition and vocal imitation. Vocal imitation is rare in animals, but some animals, e.g. songbirds and elephants are known to be able to do it (Schachner et al., 2009). In humans, vocal imita-tion or auditory-oral matching capabilities play an important role in speech language acquisition (Kuhl and Meltzoff, 1996; Chen et al., 2004). Once a novel sound heard, we can articulate a novel sound by phonological imitation or assimilation, which is mediated in the auditory-motor interface (by articulatory learning), and then usually can associate it with a specific meaning (by semantic learning). That is, the auditory-motor interface is not only a linguistic device, but also a phonological learning device that is capable of handling novel auditory sounds.
Therefore, speech perception may be a subset of sound perception, in which linguistic information such as phonetic features, voice-onset-time, its associated concepts, and grammar are extracted from the sounds through sound parsing to be meaningful for the listener. There are several theories of speech perception largely divided into two distinct theoretical perspectives. In the first stance, they assume an acoustic representation of speech sounds in the brain, defined as number of acoustic characteristics (Stevens and Blumstein, 1981; Massaro, 1987; Goldinger, 1997; Johnson, 1997; Coleman, 1998). In the second one, speech perception is not a problem in an acoustic domain, but in an articulatory domain. They argue that speaking and listening are both regulated by the same structural constraints and grammar and listener can perceive articulatory movements to generate actual sounds (Liberman and Mattingly, 1985; Fowler, 1986).
The divergence of speech codes, i.e. articulatory and acoustic-phonetic codes observed in Yoo et al. (2012) partially support the notion of speech perception in an articulatory domain. It is also intriguing in that the locus of articulatory codes (left inferior frontal gyrus) is a part of the mirror neuron system (MNS) known as a core network of movement imitation (Iacoboni, 2005; Iacoboni and Dapretto, 2006). The notion led us to investigate whether the locus of articulation-based codes are modulated while the subjects perceive sounds in different categories. We monitored the hemoglobin concentration change at inferior frontal gyri (IFG) bilaterally measured by functional near infrared spectroscopy (fNIRS) while the subjects listened to natural sounds, animal vocalizations, human emotional sounds, pseudowords, and words, and verbally repeated speech sounds (pseudowords and words) only.
As a result, we observed oxygenated hemoglobin (O 2 Hb) change at left IFG was positive for both speech and nonspeech sounds, but negative at right IFG. Besides, there was hemodynamic modulation by sound types at IFG even in passive listening. Contrasting verbal repetition of words and pseudowords revealed that the proportion of O 2 Hb change in total Hb concentration was significantly higher for pseudowords than for words, indicating that articulatory codes at the left IFG were predominant for pseudowords, not for words. These results show that (1) articulatory circuits are important for perceiving and producing meaningless sounds and (2) the same neural circuits modulate nonspeech sounds according to the types of sounds. While it is already known that perceiving speech sounds are partly dependent on motoric information (Fadiga et al., 2002), our results further imply that articulation-based perception is equally applicable for speech and nonspeech sounds.
We regard these findings as a sensorimotor integration to support both speech and nonspeech perception (Pulvermüller et al., 2006; Wilson et al., 2004), in which motoric information is important in top-down control of auditory perception (Scott et al., 2010). The auditory-motor interface is suit-able for this sensorimotor integration. By verbal repetition in the auditorymotor interface, novel sounds may be imitated and then articulated exactly. After repeating many times, we can learn how to articulate it and then perceive the sounds exactly. That is, an imitative learning in the auditory-motor interface eventually can shape sound perception by generating relevant speech codes. Consistent with this notion, the left IFG is a part of the mirror neuron system used in imitative learning (Rizzolatti and Arbib, 1998).
>
Associative learning and episodic buffer
The imitative learning in the auditory-motor interface generates speech codes for novel sounds. By the way, sounds are transformed into different speech codes according to subjects’ perception even when the same sound waves are processed in human brain (Yoo et al., 2012). It implies that by semantic learning, speech sounds can be differentially represented in con-nection to specific concepts in long-term memory. Then, the next question will be how this learning is achieved in terms of neural plasticity, i.e. which brain regions are recruited for the learning? How a novel sound comes to have specific meanings in human brain has been an intriguing topic in the literature of cognitive linguistics and neurolinguistics for a long time. Word learning often requires a considerable and effortful time until it is incorporated into human language system, i.e. the mental lexicon. A recent study revealed that learning of novel spoken words required at least a day of consolidation by which neural representation of newly learned words can be strengthened and thus recognized more rapidly as existing words in the mental lexicon (Davis et al., 2008).
To be learned successfully, sounds should be first represented and processed on auditory cortex and several corresponding regions, e.g. premotor cortex, inferior frontal cortex, and inferior parietal lobule, which are recip-rocally mediated by each other (Rauschecker and Scott, 2009). Notably, there is lots of neuroscientific evidence demonstrating that once learned, speech is differentially processed in specific brain regions. It means that learning process might reorganize some brain regions to distinguish learned sounds from others. This learning effect was so evident that word-specific neural activities were repeatedly described in many neuroimaging studies. However, there is no complete description of how such learning is pro-cessed and what neural circuits are involved in learning.
It has been suggested that word learning is mediated by the phonologi-cal loop (PL) suggested in verbal working memory (Baddeley et al., 1998). However, there is no direct evidence yet that the PL is an essential compo-nent for word learning. Neuropsychological data show that verbal working memory and some linguistic abilities, e.g. speech perception and sentence comprehension, can be dissociated from each other (Friedrich et al., 1984, 1985; Martin, 2006). This raises the question whether word learning is independently processed from normal speech circuitries. All these confusion might be from the fact that the notion of working memory is likely to be confounded with higher mental processes such as cognitive aptitudes (Cowan et al. 2005, 2006).
To resolve it, we examined brain regions mediating words learning – from novel sounds to known words. We designed a simple, novel associa-tive learning paradigm combined with event-related fMRI. For the asso-ciative learning, some novel sounds were presented with their meanings in simple stories (learned condition), while others were presented without meanings in the same stories (unlearned condition). The subjects could eas-ily learn the novel sounds by associating the sounds with specific meanings. After the short learning, they were again asked to repeat the same sounds. We contrasted verbal repetition of the novel sounds before and after the learning phase. The results revealed that unlearned sounds uniquely evoked neural activities at superior and middle frontal gyri bilaterally, whereas learned sounds uniquely evoked neural activities at superior and inferior parietal lobules as well as superior and middle frontal gyri bilaterally. A connectivity analysis using dynamic causal modeling (DCM) suggested that the dorsal fronto‐parietal network might subserve as episodic buffers used for associative learning of novel sounds.
A novel sound consists of an unfamiliar auditory scene with continuous acoustic waves. To be understood, it should be parsed as recognizable phonetic and phonological items in the brain. In this process, two learning processes seem to be required: one for naming the sounds, and the other for understanding the sounds. The former means that we need to build an abstract representation to refer the sounds because the auditory scene was novel to us and we have no corresponding speech codes in the mental lexicon. The latter indicates that the referred name of sounds is to be associated with specific semantic information, which provides us with relevant under-standing of the sounds. It is similar to the novel face recognition. We have knowledge of individual features consisting of the face, but have no idea of how a novel face is organized with these features. By seeing, we come to have an abstract representation of the face. That is, we can distinguish it from others by learning the visual scene. After this kind of learning, we can additionally associate it with specific meanings that are correlated with the visual features of the face. It corresponds to the second learning processed at semantic level.
In this vein, Tsukiura and his colleagues showed that bilateral prefrontal areas were crucial for the process of associating newly learned people’s faces and names (Tsukiura et al., 2002), which was near the activated loci by repeating the unlearned stimuli in the second experiment. In addition, they found that left superior parietal lobules were activated by contrasting novel and familiar stimuli both in names and occupations (Tsukiura et al., 2002), which was found in the repetition of learned stimuli. Therefore, it is likely that leaning sounds requires two different steps: one for associating the auditory object with its name and the other for associating the name with its semantic contents. It is also intriguing that the learning process seems to be applicable for visual and auditory scenes in common. With regard to this, we noted that the activated loci were significantly overlapped with the prefrontal cortex and temporo-parietal junction, supposed as neural correlates of episodic buffer (Blumenfeld and Ranganath, 2006; Zhang et al., 2004; Wagner et al., 2005; Hutchinson et al., 2009; Baddeley, 2000).
While responding to an episodic event, that is, learning an auditory or visual scene seems to be dealt with in the episodic buffer, irrespective of the modality of objects. That is, before consolidated, the episodic buffer may support the formation and retrieval of memories for events (Ranganath et al., 2003; Davis et al., 2008; Vargha-Khadem et al., 1997). As a temporary storage, the episodic buffer maintains and organizes multimodal sensory information as a single event by strengthening associations among individual items. Probably, the respective roles of prefrontal cortices and parietal lobules in the episodic buffer were monitoring and manipulation of the auditory objects (Champod and Petrides, 2007). Naming the sounds does not require the semantic information (monitoring), but the same informa-tion is important in associating the sounds with specific meanings (manipu-lation). There is a directional connectivity from the prefrontal cortex to posterior parietal cortex, indicating manipulation is initiated after monitoring. This fronto-parietal network is likely to be modulated by attentional control (Wagner et al., 2005; Wang et al., 2009; Deserno et al., 2012).
However, it is still unclear how speech codes are correlated with the associative learning. The connectivity analysis using DCM showed a directional connectivity within the episodic buffer, but at the same time, there was less connectivity between neural circuits for speech codes and associative learning. It implies that there might be another neural mechanism to support speech codes.
In this research report, we explored some recent findings from neurolinguistic studies based on verbal repetition. With well-posed questions and methodologies, we could reveal (1) speech has distinct neural codes modulated by high-level linguistic process specified at semantic level, irrespective of acoustic features of sounds; (2) such modulation was a result of sound learning at episodic buffer mediated by the dorsal fronto-parietal pathways from superior and middle frontal gyri (BA9, 10) to superior and inferior parietal lobules (BA7, 40) bilaterally; and (3) left inferior frontal gyrus (BA47) is pivotal for perception as well as production of meaningless/ meaningful sounds in terms of brain hemodynamics specified by systolic and diastolic pulsations.
These findings are worthy in that it provides neural correlates of the miss-ing linkage at the intermediate level between low-level acoustic processes and high-level linguistic processes. Speech processing is in general hard to study because it is impossible to apply invasive methods used in animal studies to human brain. The present study successfully revealed a facet of online speech processing to some extent by designing novel experimental paradigms that can overcome such a limitation. At the same time, the behavioral tasks used in the study were able to recruit not only speech processing, but also memory and learning mechanism in the brain. As a result, we could describe associative learning and imitative learning, in terms of sensorimotor integration based on the auditory-motor interface.
Another thing to note here is that we introduced verbal repetition to study language. As shown, verbal repetition is one of simple but important tools in investigating online speech processing in human brain. As speech production is inextricably linked to speech perception, it is necessary to study speech processing while both ends of perception and production are active simultaneously. It is also suitable for studying auditory feedback and sensorimotor integration in speech processing, which are prominent features in human speech. However, lots of neurolinguistic processes in verbal repetition remain still unclear. It is thus important to reveal neural mechanism of verbal repetition. It bridges the gap between sounds and speech and sheds lights on early language acquisition, too.
For future studies, we need to investigate how individual speech sounds can be organized at a phrase or sentence level. At this level, sequential processing and syntactic structure building are necessary and it will require more complex memory structures beyond the phonological memory, e.g. memory of syntactic structures or semantic (thematic) features. For such studies, it is important to investigate a sequencing mechanism between con-secutive auditory items and how a specific meaning is built from learning, which will deepen our understanding on auditory sentence processing, too.