Read my lips: McGurk and speech perception

internal_ear_cropped

Think of a situation in which you might try to lip read. Would it be chatting with friends in a noisy bar? Or watching a muted TV? Maybe you were born (or became) deaf. In all of these scenarios, the channel for auditory input is severely obstructed or entirely absent. But what if I told you that lip reading isn’t limited to those contexts? The surprising truth is that we lip read, to some extent, during practically every spoken interaction where we can observe the speaker’s face.

This means that the visual information involved in speech perception and processing is way more crucial than we would imagine. For a demonstration, watch this video.[1]

As the video mentions, the phenomenon is called the McGurk effect. It’s named after one of the researchers who first discovered it – psychologists Harry McGurk and John MacDonald described the effect in their 1976 paper “Hearing Lips and Seeing Voices”. They found that when a person watches someone pronounce one sound, but hears the paired audio of another sound, they will perceive a third sound which lies somewhere in-between the visual and auditory cues.

More specifically, the paper details an experiment in which adults watched a video of a woman making the lip/mouth movements for ga, but the audio was dubbed with the syllable ba. Subjects reported hearing a third syllable, da. (When they watched the undubbed video, they correctly heard ga, and when they listed to the audio only, they correctly heard ba.)

How is da “in-between” ba and ga?

In phonetics, speech sounds, or phones (particularly consonants) are classified according to three main parameters: voicing, place of articulation, and manner of articulation. Voicing is whether your vocal folds vibrate when producing the sound – if they vibrate, the sound is voiced; if they don’t, it’s voiceless. Place of articulation deals with how and where your mouth (mostly tongue and lips) moves to constrict or obstruct the vocal tract, creating different consonants. Manner of articulation concerns the way in which air flow passes through or is blocked by the articulators (i.e. mouth/tongue/lips) when they’ve taken a given shape.

Identifying such parameters for the badaga example in the 1967 paper:

  • [Voicing] /b/, /d/, and /g/ are all VOICED consonants
    • (Try putting fingers to your Adam’s apple while saying ba, da, and ga, and you should feel the vibration of your vocal folds)
  • [Manner of articulation] /b/, /d/, and /g/ are all stops or plosives – meaning airflow is entirely blocked by the mouth and nose
  • [Place of articulation]
    • /b/ is a bilabial – formed with both lips
    • /d/ is an alveolar – formed when the tongue tip touches the alveolar ridge (behind the upper front teeth)
    • /g/ is a velar – formed when the tongue body arches towards the velum or soft palate (upper back of the mouth)

Here’s a little sketch of the vocal tract to help visualize.

Articulators and places of articulation

places_of_artic_dwg_alt

We can see that the three consonants in ba, da, and ga share voicing and articulatory manner, and only differ in their place of articulation. Interestingly, the “intermediate” consonant that subjects perceived, /d/, falls almost exactly in-between /b/ (the audio consonant) and /g/ (the lip-read consonant) in its place of articulation – /b/ is formed at the front of the mouth, /g/ is formed at the back, and /d/ is formed in the middle. Our brain is combining the conflicting auditory and visual information, and perceiving a sound lying between the two cues. The combination isn’t always this exact, but the perceived phone will usually have some features in common with both sensory inputs.

The McGurk effect is strong and, as far as we know, universal. It happens even when the voice and face are in different locations, or of different genders! All people tested, regardless of their language background, have demonstrated this speech integration behavior. However, the precise kinds of synthesis (as well as the effect’s intensity) vary by language. The phenomenon is more apparent in German, English, Spanish, and Turkish language listeners than in Chinese and Japanese participants, for example. The particular syllabic configuration of Japanese, and cultural norms (like face avoidance) may mean that they don’t rely on visual evidence of speech to the same extent. Loud environments heighten the effect for everyone though.

The overall robustness of the McGurk effect indicates just how automatically and unconsciously we combine auditory and visual stimuli when perceiving speech. A slew of related studies demonstrate that our brain really functions on constant multisensory (or multimodal) data, and is not limited to the sense we take as primary for a particular type of input.

So what other perceptual abilities might be a mix of stuff from several senses? If you have some guesses and are curious to know more, I recommend See What I’m Saying, by Lawrence D. Rosenblum. (In fact, he was the guy in the video above!) Like McGurk and MacDonald, he’s not quite a linguist, but we can forgive him.

 

[1] The video says you should have perceived the sound fa during the second clip of the man, but I actually get a kind of va / fa mix, which makes sense – va is intermediate in that it’s voiced like ba (the audio cue) but has a labiodental place of articulation, like fa (the visual cue).

*Photo attribution: page 241 Internal Ear

Pumpkin Pumpkin Pumpkin Pumpkin Pumpkin Pumpkin Pumpkin Pumpkin Pumpkin…

all-work-and-no-play-makes-jack-a-dull-boy-florian-rodarte

Happy Halloween!

In tribute to the holiday (my favorite), here’s a smidge of spooky linguistics. Ever play that game where you repeat some word so many times it starts to lose its meaning? That’s actually a thing, called semantic satiation.

Semantic satiation is a psychological phenomenon where a person temporarily loses the meaning of the repeated word and perceives only nonsensical sounds. It can happen via reading as well as at the verbal/aural level. The term was coined by psychology professor Leon Jakobovits James in his 1962 dissertation[1]. His and later research shows that word repetition activates peripheral sensorimotor and central neural activity repeatedly in the cortex (activity corresponding with the meaning of a particular expression), which in turn causes reactive inhibition. Consequently, the strength of activity for each new repetition diminishes. More recent semantic verification studies have confirmed that this satiation legitimately falls under semantic memory, and is not just a byproduct of exhaustion of pre-semantic sensory/perceptual processes (for example, the acoustic system).[2]

That’s all well and good, but how is it spooky? In answer I say, see this movie – Pontypool.

In case you don’t have time to view it right this minute, my non-spoiler summary:

“Pontypool” is a low-budget psychological thriller that I found delightfully horrifying. The premise revolves around a virus that is infecting people in a small, remote Canadian town. The virus spreads…through language. When people hear an infected word, they begin repeating it until their entire speech grows garbled, they turn insane and zombie-like, and finally start attacking others in gory fashion.

I highly recommend leaving the office early today to go watch the film. Or to go trick-or-treating. Or something. Since everyone knows that all work and no play…

 

drost_tv

 

[1] “Effects of repeated stimulation on cognitive aspects of behavior: some experiments on the phenomenon of semantic satiation”
[2] See “Introduction – Recent Studies” in “On the Locus of the Semantic Satiation Effect: Evidence from Event-Related Brain Potentials”

*Photo attributions: The Shining painting; Droste effect tv