Read my lips: McGurk and speech perception

Think of a situation in which you might try to lip read. Would it be chatting with friends in a noisy bar? Or watching a muted TV? Maybe you were born (or became) deaf. In all of these scenarios, the channel for auditory input is severely obstructed or entirely absent. But what if I told you that lip reading isn’t limited to those contexts? The surprising truth is that we lip read, to some extent, during practically every spoken interaction where we can observe the speaker’s face.

This means that the visual information involved in speech perception and processing is way more crucial than we would imagine. For a demonstration, watch this video.[1]

As the video mentions, the phenomenon is called the McGurk effect. It’s named after one of the researchers who first discovered it – psychologists Harry McGurk and John MacDonald described the effect in their 1976 paper “Hearing Lips and Seeing Voices”. They found that when a person watches someone pronounce one sound, but hears the paired audio of another sound, they will perceive a third sound which lies somewhere in-between the visual and auditory cues.

More specifically, the paper details an experiment in which adults watched a video of a woman making the lip/mouth movements for ga, but the audio was dubbed with the syllable ba. Subjects reported hearing a third syllable, da. (When they watched the undubbed video, they correctly heard ga, and when they listed to the audio only, they correctly heard ba.)

How is da “in-between” ba and ga?

In phonetics, speech sounds, or phones (particularly consonants) are classified according to three main parameters: voicing, place of articulation, and manner of articulation. Voicing is whether your vocal folds vibrate when producing the sound – if they vibrate, the sound is voiced; if they don’t, it’s voiceless. Place of articulation deals with how and where your mouth (mostly tongue and lips) moves to constrict or obstruct the vocal tract, creating different consonants. Manner of articulation concerns the way in which air flow passes through or is blocked by the articulators (i.e. mouth/tongue/lips) when they’ve taken a given shape.

Identifying such parameters for the ba–da–ga example in the 1967 paper:

[Voicing] /b/, /d/, and /g/ are all VOICED consonants
- (Try putting fingers to your Adam’s apple while saying ba, da, and ga, and you should feel the vibration of your vocal folds)
[Manner of articulation] /b/, /d/, and /g/ are all stops or plosives – meaning airflow is entirely blocked by the mouth and nose
[Place of articulation]
- /b/ is a bilabial – formed with both lips
- /d/ is an alveolar – formed when the tongue tip touches the alveolar ridge (behind the upper front teeth)
- /g/ is a velar – formed when the tongue body arches towards the velum or soft palate (upper back of the mouth)

Here’s a little sketch of the vocal tract to help visualize.

Articulators and places of articulation

places_of_artic_dwg_alt

We can see that the three consonants in ba, da, and ga share voicing and articulatory manner, and only differ in their place of articulation. Interestingly, the “intermediate” consonant that subjects perceived, /d/, falls almost exactly in-between /b/ (the audio consonant) and /g/ (the lip-read consonant) in its place of articulation – /b/ is formed at the front of the mouth, /g/ is formed at the back, and /d/ is formed in the middle. Our brain is combining the conflicting auditory and visual information, and perceiving a sound lying between the two cues. The combination isn’t always this exact, but the perceived phone will usually have some features in common with both sensory inputs.

The McGurk effect is strong and, as far as we know, universal. It happens even when the voice and face are in different locations, or of different genders! All people tested, regardless of their language background, have demonstrated this speech integration behavior. However, the precise kinds of synthesis (as well as the effect’s intensity) vary by language. The phenomenon is more apparent in German, English, Spanish, and Turkish language listeners than in Chinese and Japanese participants, for example. The particular syllabic configuration of Japanese, and cultural norms (like face avoidance) may mean that they don’t rely on visual evidence of speech to the same extent. Loud environments heighten the effect for everyone though.

The overall robustness of the McGurk effect indicates just how automatically and unconsciously we combine auditory and visual stimuli when perceiving speech. A slew of related studies demonstrate that our brain really functions on constant multisensory (or multimodal) data, and is not limited to the sense we take as primary for a particular type of input.

So what other perceptual abilities might be a mix of stuff from several senses? If you have some guesses and are curious to know more, I recommend See What I’m Saying, by Lawrence D. Rosenblum. (In fact, he was the guy in the video above!) Like McGurk and MacDonald, he’s not quite a linguist, but we can forgive him.

[1] The video says you should have perceived the sound fa during the second clip of the man, but I actually get a kind of va / fa mix, which makes sense – va is intermediate in that it’s voiced like ba (the audio cue) but has a labiodental place of articulation, like fa (the visual cue).

*Photo attribution: page 241 Internal Ear

Read my lips: McGurk and speech perception

Like this:

Leave a Reply Cancel reply

Read my lips: McGurk and speech perception

Share this:

Like this:

Related Posts

Career interviews: Computational linguist for a virtual assistant

A Norwegian smörgåsbord

Leave a Reply Cancel reply