Zhang Caicai. 2018

Journal of Chinese linguistics: Monograph series No. 27 - Phonetic Constancy in the Perception of Chinese Tones

How humans achieve constancy in the perception of an object (e.g., the size, color and brightness of a visual object) despite variations in its physical appearance is a fundamental question in human cognition. In speech perception, phonetic constancy, e.g., the ability to recognize a speech sound produced by different talkers as the same one despite acoustic variations, is also critical. Multiple mechanisms have been identified in the literature to account for phonetic constancy based primarily on studies of the perception of consonants and vowels. For instance, the intrinsic normalization mechanism suggests that critical acoustic cues of a speech sound (e.g., F0) are rescaled/transformed against other cues indicative of a talker’s voice characteristics (e.g., voice quality) intrinsically contained in the speech target to reduce variation. On the other hand, the extrinsic normalization mechanism emphasizes the importance of extrinsic cues, e.g., a speech context. According to this mechanism, listeners adapt to a particular talker’s voice via the distribution of acoustic cues in the surrounding context. However, few studies have examined the perception of lexical tones, which are highly susceptible to the influence of talker variation. As a result, it is not very clear what mechanisms support the perceptual normalization of tones and to what extent those mechanisms proposed based on consonant and vowel studies apply to tones. Furthermore, neuroimaging studies on phonetic constancy are relatively scarce, and the neural signatures of the normalization processes remain largely unknown. In this monograph, the author reports a series of behavioral and neuroimaging studies conducted to examine the psychological mechanisms and neural processes of talker normalization, using Chinese tones as an investigation case. With these studies and related work in the literature, an understanding of how phonetic constancy is achieved in lexical tone perception is emerging. The major findings are summarized below. First, in a cross-linguistic study, tone inventories were found to influence the categorization of multi-talker tone stimuli. Mandarin listeners correctly categorized multi-talker stimuli in isolation (i.e., intrinsic normalization), whereas Cantonese listeners performed poorly. This suggests that intrinsic cues may be sufficient for tone normalization in simpler tone inventories like Mandarin where tones are primarily distinguished in the F0 contour, but not in more complex tone inventories like Cantonese where several tones share a similar F0 contour. This finding has implications for understanding how the structure of phonological inventories affects its resistance to talker variability. Second, without contextual cues, the accuracy of the categorization of multi-talker tone stimuli in Cantonese is low and greatly affected by talker typicality. Cantonese words with level tones produced by typical talkers whose F0 range is close to the population-average F0 range are often correctly categorized, whereas the same words produced by less typical talkers whose F0 range is higher or lower than the population-average F0 range are often biased towards higher or lower tones. This suggests that Cantonese listeners rely on a set of tone templates/representation shaped by the population-average F0 characteristics when perceiving tones without contextual cues. Third, speech contexts with cues of a talker’s full F0 range (i.e., extrinsic normalization) greatly enhance phonetic constancy in Cantonese tone categorization, and eliminate the influence of talker typicality, such that the accuracy of tone categorization is uniformly high no matter whether the talkers are typical or less typical. This confirms the importance of extrinsic normalization in Cantonese tone normalization. The context effect is the cumulative end product of the contribution of multiple levels of cues in the context (general auditory, phonetic, phonological, semantic and syntactic cues). But it is primarily driven by the effect of phonological cues (for helping listeners to adapt to a particular talker’s tonal space), and the effect of general auditory cues (e.g., a nonspeech context) is small and negligible. Fourth, the author used event-related potential (ERP) methods to study the temporal loci of extrinsic normalization in Cantonese tone perception. The earliest reliable effects of extrinsic normalization were observed in the time-windows of N400 (250-500 ms) and LPC (500-800 ms). This suggests that speech contexts facilitated lexical activation in the N400 time-window, presumably by reducing lexical ambiguity or competition caused by talker variability, and further facilitated decisional processes in the LPC time-window. When extrinsic normalization is implemented in a top-down way, by pre-adjusting the phonetic expectation of a tone according to talker-specific F0 cues obtainable from a speech context to guide the analysis of F0 in incoming speech signals, the effects of tone normalization are shifted earlier into pre-lexical phonemic processing in the PMN time-window (250-350 ms). Last, the neural circuitries sub-serving the integral processing of lexical tone and talker information are examined in a functional MRI (fMRI) study. In order to recognize speech sounds produced by different talkers, listeners adapt to a particular talker’s voice, suggesting that phonetic processing relies on talker processing. This raises the question of whether phonetic processing and talker processing are sub-served by overlapping brain circuitries in the processing pathway. The author found that lexical tone and talker changes are processed integrally in the bilateral STG, providing evidence for a general neural mechanism of integral phonetic and talker processing in the bilateral STG, irrespective of specific acoustic parameters (F0 or vocal tract length). Based on the findings above, the author proposed a new model of talker normalization, which integrates the effects of population-level tone templates/representations and dynamic context processes mentioned before. The author also proposed a hybrid model of multi-level representations of tones, from the lowest level of representations containing talker-specific episodic exemplars, to the intermediate level of population-level tone templates/representations, to the highest level of abstract representations. These models should be carefully tested in future studies with necessary modifications to reach a deeper and more general understanding of the mechanisms of talker normalization, and the nature of the representations of speech sounds in the brain. Finally, the ERP and fMRI studies reported here, though exploratory, are among the first to examine the temporal and spatial neural signatures of phonetic constancy in tone perception. More neuroimaging studies are required to achieve a full understanding of the neurobiological bases of how phonetic constancy is achieved in the processing pathway. Future directions are also identified and discussed.