Read my lips: McGurk and speech perception

internal_ear_cropped

Think of a situation in which you might try to lip read. Would it be chatting with friends in a noisy bar? Or watching a muted TV? Maybe you were born (or became) deaf. In all of these scenarios, the channel for auditory input is severely obstructed or entirely absent. But what if I told you that lip reading isn’t limited to those contexts? The surprising truth is that we lip read, to some extent, during practically every spoken interaction where we can observe the speaker’s face.

This means that the visual information involved in speech perception and processing is way more crucial than we would imagine. For a demonstration, watch this video.[1]

As the video mentions, the phenomenon is called the McGurk effect. It’s named after one of the researchers who first discovered it – psychologists Harry McGurk and John MacDonald described the effect in their 1976 paper “Hearing Lips and Seeing Voices”. They found that when a person watches someone pronounce one sound, but hears the paired audio of another sound, they will perceive a third sound which lies somewhere in-between the visual and auditory cues.

More specifically, the paper details an experiment in which adults watched a video of a woman making the lip/mouth movements for ga, but the audio was dubbed with the syllable ba. Subjects reported hearing a third syllable, da. (When they watched the undubbed video, they correctly heard ga, and when they listed to the audio only, they correctly heard ba.)

How is da “in-between” ba and ga?

In phonetics, speech sounds, or phones (particularly consonants) are classified according to three main parameters: voicing, place of articulation, and manner of articulation. Voicing is whether your vocal folds vibrate when producing the sound – if they vibrate, the sound is voiced; if they don’t, it’s voiceless. Place of articulation deals with how and where your mouth (mostly tongue and lips) moves to constrict or obstruct the vocal tract, creating different consonants. Manner of articulation concerns the way in which air flow passes through or is blocked by the articulators (i.e. mouth/tongue/lips) when they’ve taken a given shape.

Identifying such parameters for the badaga example in the 1967 paper:

  • [Voicing] /b/, /d/, and /g/ are all VOICED consonants
    • (Try putting fingers to your Adam’s apple while saying ba, da, and ga, and you should feel the vibration of your vocal folds)
  • [Manner of articulation] /b/, /d/, and /g/ are all stops or plosives – meaning airflow is entirely blocked by the mouth and nose
  • [Place of articulation]
    • /b/ is a bilabial – formed with both lips
    • /d/ is an alveolar – formed when the tongue tip touches the alveolar ridge (behind the upper front teeth)
    • /g/ is a velar – formed when the tongue body arches towards the velum or soft palate (upper back of the mouth)

Here’s a little sketch of the vocal tract to help visualize.

Articulators and places of articulation

places_of_artic_dwg_alt

We can see that the three consonants in ba, da, and ga share voicing and articulatory manner, and only differ in their place of articulation. Interestingly, the “intermediate” consonant that subjects perceived, /d/, falls almost exactly in-between /b/ (the audio consonant) and /g/ (the lip-read consonant) in its place of articulation – /b/ is formed at the front of the mouth, /g/ is formed at the back, and /d/ is formed in the middle. Our brain is combining the conflicting auditory and visual information, and perceiving a sound lying between the two cues. The combination isn’t always this exact, but the perceived phone will usually have some features in common with both sensory inputs.

The McGurk effect is strong and, as far as we know, universal. It happens even when the voice and face are in different locations, or of different genders! All people tested, regardless of their language background, have demonstrated this speech integration behavior. However, the precise kinds of synthesis (as well as the effect’s intensity) vary by language. The phenomenon is more apparent in German, English, Spanish, and Turkish language listeners than in Chinese and Japanese participants, for example. The particular syllabic configuration of Japanese, and cultural norms (like face avoidance) may mean that they don’t rely on visual evidence of speech to the same extent. Loud environments heighten the effect for everyone though.

The overall robustness of the McGurk effect indicates just how automatically and unconsciously we combine auditory and visual stimuli when perceiving speech. A slew of related studies demonstrate that our brain really functions on constant multisensory (or multimodal) data, and is not limited to the sense we take as primary for a particular type of input.

So what other perceptual abilities might be a mix of stuff from several senses? If you have some guesses and are curious to know more, I recommend See What I’m Saying, by Lawrence D. Rosenblum. (In fact, he was the guy in the video above!) Like McGurk and MacDonald, he’s not quite a linguist, but we can forgive him.

 

[1] The video says you should have perceived the sound fa during the second clip of the man, but I actually get a kind of va / fa mix, which makes sense – va is intermediate in that it’s voiced like ba (the audio cue) but has a labiodental place of articulation, like fa (the visual cue).

*Photo attribution: page 241 Internal Ear

O Syntax Tree, O Syntax Tree!

how_lovely

Digital voice agents like Alexa, Siri, and Google Assistant are all the rage these days. But when we talk to our smart devices, are they actually “understanding” our speech in the same way that another human understands it? Take the command, “Find flights from Chicago to New York on February 21.” We can easily comprehend this sentence; our newborn brains were predisposed to acquire language, and we’ve been using it ever since.

Computers, on the other hand, cannot acquire language. They must be trained. In order to train them, computational linguists, other linguists, and engineers have broken language down into more manageable parts that can be tackled individually. Automatic speech recognition (ASR) deals with training machines to recognize speech (via acoustic properties, etc.), and convert that speech to text. Next, natural language processing (NLP) attempts to figure out what is meant by that text[1]. An NLP system itself is composed of multiple modules[2], one of which will likely be a syntactic parser.

Today we’re going to delve into the parser component. Let’s start with some syntactic basics!

Syntax is the set of rules and processes governing sentence structure in any natural language. It involves things like word order, and constituents (words or phrases that form functional units). One of the most common ways to represent syntactic information (at least as of the 20th century) is with a syntax tree. Traditional syntax trees specify:

  • The words of a phrase/sentence
  • Part of speech for each word, usually abbreviated
    • N (noun); V (verb); P (preposition); D or DET (determiner, a.k.a. article); A (adjective); etc.
  • Larger phrases, also abbreviated
    • S (sentence); NP (noun phrase); VP (verb phrase); etc.
  • Relationships between all of the words and phrases
    • These are hierarchical relationships that show how constituents combine into larger ones (or split into smaller ones, if starting from the opposite end of the tree)

Here’s a tree diagram (specifically, a constituency tree) for the sentence, “My parakeet drinks mimosas in the morning”:

tree

You can see that my parakeet forms a larger chunk which is a noun phrase, in the morning forms a larger chunk which is a prepositional phrase, drinks mimosas in the morning forms an even larger chunk which is a verb phrase, and both the NP and VP combine to form the largest chunk, a full sentence S. Remember that syntax focuses on phrasal order and structure, not meaning or context – so it can’t tell us why on earth you’re feeding boozy orange juice to your pet bird.

Onto the parsing! Very generally, a parser is a piece of software (often a trained machine learning model) that takes input text, and outputs a parse tree or similar structural representation, based on syntactic rules and statistics learned from its training data.

Syntactic parsers include a component called a Context-Free Grammar, which has:

  1. A set of non-terminal symbols – abbreviations for language constituents (lexical parts of speech and phrasal types):

{S, NP, VP, PP, D, N, A…}

  1. A set of terminal symbols – words of the phrase/sentence:

{drinks, parakeet, mimosas, morning, my, in, the}

  1. A set of rules like:

S → NP VP  (a sentence S is composed of a noun phrase NP and verb phrase VP)

NP → D N  (a noun phrase NP is composed of a determiner D and a noun N)

VP → VP PP  (etc.)

PP → P NP

  1. A start symbol: S

The parser starts at S, and applies its rules successively, until it arrives at the terminal symbols. The resulting parse is the labeled relationships connecting those terminals (i.e. words).

There are two main kinds of syntactic parsers: dependency and constituency. To keep this post to a reasonable length, I’ll focus on dependency only, but constituency parsers output structures similar to the parakeet tree above[3]. A dependency parser builds a tree for each input sentence by starting with a sentence root (usually the main verb), and assigning a head word to each word, until it gets to the end of the sentence. (Heads link to dependents.) When it’s done, each word has at least one branch, or relationship, with another word. The parser also characterizes each word-word relationship. These are things like: nominal subject of a verb (“nsubj”); object of a verb or a preposition (“dobj” and “pobj,” respectively); conjunction (“cc” for the conjunction word, and “conj” for the elements being conjoined); determiner (“det”); and adverbial modifier (“advmod”).

A visualized example will probably help. Taking that same sentence, “My parakeet drinks mimosas in the morning,” a visualization of the dependency parse might look like this:

displacy_parse_parakeet_drinks

Can you spot the root, or main verb? It’s the one without any arrows going towards it: drinks. The parser then finds the subject of drinks, which is parakeet, and labels that relationship “nsubj.” It finds mimosas as the direct object of drinks, and labels it “dobj.” And so on and so forth.

Let’s look at another example, for a dollop of variety. Here is “Mr. Vanderloop had smiled and said hello”:

displacy_parse_vanderloop

In this one, the past participle smiled is the root/main verb, which has multiple dependents: its subject Vanderloop, its auxiliary (a.k.a. “helping verb”) had, its conjunction and, and the other verb with which it conjoins, said. The subject Vanderloop has a dependent Mr., with which it forms a compound (proper) noun; said’s dependent is the interjection hello.

How about our sentence from the beginning, “Find flights from Chicago to New York on February 21”? How might it be parsed? (You can check your hypotheses by typing the sentence into an interactive demo of the displaCy dependency visualizer, from which the visualizations above also came[4].) Something to keep in mind here is that English imperative structure leaves the subject – whoever is being addressed – implicit.

A slight aside: I’ve chosen simple examples for demonstration, but parsing gets decidedly complicated when input sentences are themselves complicated. Questions, subordinate clauses, coordination (or all three: “What’s the name of the movie where the guy drives a flying taxi and saves the human race from aliens?”), and structurally ambiguous sentences (“The horse raced past the barn fell”) get tricky quickly.

So now we have some parsed output. How is this structured, annotated data useful? Well, one thing you can do with these word relations is identify noun phrases. Identifying noun phrases across sentences helps with another step in the NLP pipeline called Named Entity Recognition, or NER. NER tries to recognize nouns/noun phrases (names, places, dates, etc.) and label them with categories of concepts from the real world. In our flights example, “Chicago” and “New York” should get tagged with some label like CITY or GEOGRAPHIC LOCALE, and “February 21” should get tagged with DATE. Once a text has been automatically annotated for such named entities, information about those entities can then be pulled from a knowledge base (say, Wikipedia).

Having parts of speech and word relations also makes it easier to match up the specifics of a given user command (e.g. “Text mom saying I’ll call tonight,” or “Show popular Thai restaurants near me”) with slightly more generalized intents (e.g. Send text or Get restaurants); machine models can start learning how words typically pattern across the main verb and direct object positions for various commands. Code then uses the more generalized intent to fulfill that request on a device – be it smartphone, tablet, or home speaker. “Find flights from Chicago to New York on February 21” would hopefully be matched with a more general Get flights intent, and the particular noun phrases could be passed to fields for origin, destination, and date.

* * * * *

Before leaving you to your holiday leftovers, I’d like to reiterate that syntactic parsing is only one step in an NLP system. Its parses don’t tell us much about the actual semantics of the linguistic input. Language meaning, however, is a whole other ball of wax, best left for the new year…

 

[1] There is often terminological confusion between NLP and NLU (natural language understanding). See this graphic for one common breakdown, although I’ve heard the terms used interchangeably as well.

[2] If you’re interested to learn about other NLP steps, read this accessible post, Natural Language Processing is Fun!

[3] You can also play around with this interactive demo from Stanford CoreNLP, http://corenlp.run. In the second “Annotations” field dropdown, make sure you have “constituency parse” selected.

[4] The visualizer is from the creators of spaCy, an awesome open-source NLP library in Python; a dependency parser is one of its components.

Pumpkin Pumpkin Pumpkin Pumpkin Pumpkin Pumpkin Pumpkin Pumpkin Pumpkin…

all-work-and-no-play-makes-jack-a-dull-boy-florian-rodarte

Happy Halloween!

In tribute to the holiday (my favorite), here’s a smidge of spooky linguistics. Ever play that game where you repeat some word so many times it starts to lose its meaning? That’s actually a thing, called semantic satiation.

Semantic satiation is a psychological phenomenon where a person temporarily loses the meaning of the repeated word and perceives only nonsensical sounds. It can happen via reading as well as at the verbal/aural level. The term was coined by psychology professor Leon Jakobovits James in his 1962 dissertation[1]. His and later research shows that word repetition activates peripheral sensorimotor and central neural activity repeatedly in the cortex (activity corresponding with the meaning of a particular expression), which in turn causes reactive inhibition. Consequently, the strength of activity for each new repetition diminishes. More recent semantic verification studies have confirmed that this satiation legitimately falls under semantic memory, and is not just a byproduct of exhaustion of pre-semantic sensory/perceptual processes (for example, the acoustic system).[2]

That’s all well and good, but how is it spooky? In answer I say, see this movie – Pontypool.

In case you don’t have time to view it right this minute, my non-spoiler summary:

“Pontypool” is a low-budget psychological thriller that I found delightfully horrifying. The premise revolves around a virus that is infecting people in a small, remote Canadian town. The virus spreads…through language. When people hear an infected word, they begin repeating it until their entire speech grows garbled, they turn insane and zombie-like, and finally start attacking others in gory fashion.

I highly recommend leaving the office early today to go watch the film. Or to go trick-or-treating. Or something. Since everyone knows that all work and no play…

 

drost_tv

 

[1] “Effects of repeated stimulation on cognitive aspects of behavior: some experiments on the phenomenon of semantic satiation”
[2] See “Introduction – Recent Studies” in “On the Locus of the Semantic Satiation Effect: Evidence from Event-Related Brain Potentials”

*Photo attributions: The Shining painting; Droste effect tv

I heart hangry bagel droids (or: How new words form)

The_fin_de_siècle_newspaper_proprietor_(cropped)

You’re probably familiar with the old adage “the only thing that’s constant is change.” Still, so many people tend to think about language as a relatively fixed affair. I’ve said it before (and will inevitably say it again): all living languages change all the time, and at all levels – phonological (sounds!), morphological (word-bits!), lexical (words!), syntactic (clauses!), and semantic (meaning!).

Historical linguistics (also known as diachronic linguistics) is the study of how and why languages change over time. In this post I’m going to discuss categories of change at the morphological and lexical levels – how new words come into being. In the future, I’ll explore semantic and perhaps phonological change.

Without further ado, here are the main mechanisms of word formation. Almost all examples are for English, but these formation types apply to other languages as well. (NOTE: Processes are not mutually exclusive. It is quite possible for a word to undergo multiple processes simultaneously, or one on the heels of another.)

  1. Derivation

New words are born by adding affixes to existing words. Affixes are bound[1] morphemes that can be prefixes, suffixes, and even (for certain languages, although not really for English) infixes and circumfixes. Derivation is a very common process cross-linguistically.

Zero derivation (also known as conversion) is a special case where a new word, with a new word class (part of speech) is created from an existing word of a different class, without any change in form.

Examples:
(Derivation) hater [hate + -er], truthiness [truth + -i (-y) + -ness], deglobalization [de- + globalization], hipsterdom [hipster + -dom]

(Zero derivation) heart as verb, as in “I heart coffee” [heart as noun]; friend as verb, as in “he friended me on Facebook” [friend as noun]; green as noun, in the golf lawn sense [green as adjective]; down as verb, as in “Hector downed a beer” [down as preposition]

  1. Back-formation

This process creates a new word through the removal of true or incorrectly assumed affixes. It’s kind of the opposite of derivation. This one is easier to explain through examples:

New word

Derived from older word

Analysis

donate, automate, resurrect

(verbs)

donation, automation, resurrection

(nouns)

The nouns were borrowed into English first from Latin. The verbs were back-formed later by discarding the -ion suffix, which speakers did through analogy with other Latinate verb and (-ion) noun pairs that already existed in English.

pea

pease

The older form was initially a mass noun (like water or sand), but was reanalyzed as plural. People then dropped the “plural” -s(e) to form the “singular” count noun pea.

beg, edit, hawk

(verbs)

beggar, editor, hawker

(nouns)

Speakers mistook the -ar, -or, and ­-er on the ends of these nouns (respectively) for the agentive suffix (that did/does exist in English), and removed it to form corresponding verbs.

lime-a-rita, mango-rita

appletini, kiwini

margarita

martini

Actually examples of folk etymology, which is related to back-formation. Here, speakers incorrectly assumed that -rita in margarita and –(t)ini in martini were separate morphemes (indicating the class of cocktail). Under that assumption, they switched out the rest of the word and substituted it with morphemes indicating new twists/ingredients.

  1. Blending

Also known as portmanteaus. Blends are produced by combining two or more words, where parts of one or both words are deleted.

Examples: smog [smoke + fog], brunch [breakfast + lunch], infomercial [information + commercial], bromance [bro + romance], hangry [hungry + angry], clopen [close + open][2]

  1. Borrowing

Also known as loan words. These are expressions taken from other languages. Pronunciation is usually altered to fit the phonological rules of the borrowing language.

Examples: algebra [from Arabic], ménage à trois [from French], whisky [from Scots Gaelic or Irish], bagel [from Yiddish], doppelgänger [from German], karaoke [from Japanese]

  1. Coinage

Words can be created outright to fit some purpose. Many of these are initially product names.

Examples: Xerox, Kleenex, Jell-O, Google, zipper, Frisbee

  1. Compounding

Two or more words join together to form a compound. Frequently the joining words are nouns, but they can belong to different parts of speech, including verbs, adjectives, prepositions, etc. Compounds can be separated by spaces, by hyphens, or glued to each other with nothing intervening.

Examples: homework, grocery store, mother-of-pearl, first world problem, binge-watch, weaksauce, fake news

  1. Eponyms

These are words that derive from proper nouns – usually people and place names. If a proper noun is used frequently enough and across multiple contexts, it eventually becomes a common noun (or verb or adjective).

Examples: sandwich [after the fourth Earl of Sandwich], gargantuan [after Gargantua, name of the giant in Rabelais’ novels], boycott [after Capt. Charles C. Boycott], mesmerize [a back-formation from mesmerism, in turn after Franz Anton Mesmer], sadism [after the Marquis de Sade]

  1. Reducing

Several types of reducing processes exist.  The main ones are clipping, acronyms, and initialisms.

a. Clipping

New words can be formed by shearing one or more syllables off an existing longer word. Syllables can be removed from the word’s beginning, end, or both.

Examples: fax [facsimile], flu [influenza], droid [android], fridge [refrigerator], blog [weblog]

b. Acronyms

Words are created from the initial letters of several other words. Acronyms are pronounced as regular words (in contrast to initialisms below).

Examples: NASA [National Aeronautics and Space Administration], RAM [random-access memory], FOMO [fear of missing out]

c. Initialisms

Also known as Alphabetisms. Like with acronyms, a word is created from the initial letters of other words, but the resulting term is pronounced by saying each letter. This usually happens when the string of letters is not easily pronounced as a word according to the phonological rules of the language.

Examples: NFL [National Football League], UCLA [University of California, Los Angeles], MRI [magnetic resonance imaging], WTF [what the fuck]

  1. Reduplication

Reduplication is one of my favorite phenomena.[3] It’s a process whereby a word or sound is repeated or nearly repeated to form a new word/expression. This is a productive morphological process (meaning, it’s part of the grammar and happens frequently and rather systematically) in many languages – South-East Asian and Austronesian languages particularly (e.g. Malay, Tagalog, Samoan). It’s not an especially productive process in English, although it does still happen.

Examples:
(English) wishy-washy, teensy-weensy, goody-goody, cray-cray, po-po

(Samoan) savali [‘he travels’ – third person singular + verb]; savavali [‘they travel’ – third person plural + verb]

* * * * *

Phew! Since hopefully you can see the light at the end of this long lexical tunnel, I’ll mention that of course languages lose words as well. Diverse factors motivate word loss, but that’s a subject for another post. A few quick examples of words that have fallen out of favor in English:

pell-mell [in a disorderly, reckless, hasty manner]; davenport [couch/sofa – my grandma used to say this]; grass [for marijuana – my mom still says this]; porridge [an oatmeal-like dish boiled in water or milk]; tumbrel [a farmer’s cart for hauling manure]; fain [gladly or willingly]

* * * * *

And now… ADD WORDS TO THE SPREADSHEET – Word shenanigans!

I’ve got almost 200 in there to start us off. If you’re not sure about the process for any particular word, just leave it blank or take a guess. Free bagel droids[4] to all who contribute.

 

[1] Bound meaning they cannot exist on their own, but must be attached to another morpheme.

[2] Describes a shitty situation where one has to work a closing shift followed by an opening shift. We used this term as bartenders, although I’d never seen it in print until recently. It came up in some paperwork I had to sign relating to work week ordinances, and then I saw it here as well.

[3] Some languages even have triplication – where the sound/word is copied twice!

[4] Kidding! These do not exist outside of my head. Sorry.