Accents & dialects (part I): Yinzers and jawn

yinzers sign

What’s the difference between an accent, a dialect, and a language? These concepts are prone to a multitude of misconceptions, often with adverse consequences for millions of people whose speech doesn’t fall within the realm of what’s considered “standard” for their particular region. In this series of posts, I summarize three articles about accents and dialects, and I hope to pique your interest such that you check out the full pieces themselves!

To answer the initial question: an accent is one’s pronunciation and prosody (intonation, tone, stress, etc.) particularized by individual, geographic, temporal, cultural, and socioeconomic factors. A dialect is an accent PLUS all the other linguistic features of a language (syntax, lexicon, idioms, slang) also influenced by those factors. A language is basically a convenient abstraction over a grouping of mutually-intelligible dialects. It helps us conceptualize things, but it’s sometimes hard to draw fool-proof, scientifically valid lines between what’s a language versus a dialect, and aspects like culture and nationality further muddle these line-drawing attempts.

Consider the following two cases, mentioned frequently in linguistic realms: Swedish, Norwegian, and Danish are relatively mutually intelligible – in reality, they’re probably closer to dialects of a single Scandinavian language – but because they’re spoken in separate countries, they’re considered separate languages. The opposite situation holds for “Chinese.” There is actually no single “Chinese” language. There’s Mandarin and Cantonese, which are NOT mutually intelligible, as well as hundreds of other “dialects” across China which are also not necessarily understandable between their groups of speakers. However, because all of the speakers reside within a single nation (and share a writing system, among other things), Mandarin and Cantonese (and others) are usually considered “dialects” of a single, monolithic “Chinese” language.

To use an oft-quoted expression: “a language is a dialect with an army and navy.” This conveys the idea that the distinction between a language and a dialect is arbitrary, becoming possible only through a social lens; a language almost always has more official recognition, more cultural clout, more political power, etc. than a dialect.

Okay, time for some articles. The piece linked and summarized below is an enjoyable read about Pennsylvania dialects. My Part II follow-up will discuss accents, and Part III will consider Black English (also called AAVE – African-American Vernacular English).

greetings from PA

Pennsylvania dialects

Article: “Where Yinz At” (Slate)

“Pennsylvania, in case yinz didn’t know, is a regional dialect hotbed nonpareil.”

While states have on average two to three dialects, Pennsylvania has five – the ones associated with Philadelphia and Pittsburgh being the most widely known.

“The Philadelphia dialect features a focused avoidance of the ‘th’ sound, the swallowing of the L in lots of words, and wooder instead of water, among a zillion other things. In Pittsburgh, it’s dahntahn for downtown, and words like nebby and jagoff and yinz.”

(To sample the actual dialects, watch the funny clip embedded in the article – a skit of a Philly-Pittsburgh phone conversation between two pawnbrokers.)

Geography and migration likely shaped the unique speech patterns found in the Keystone State. North of the Interstate 80 (which roughly bisects the state), ways of speaking were influenced by immigrants from southern England. Below that boundary line, people came from Northern England, Scotland, and Northern Ireland. The Allegheny Mountains also created a barrier between Pittsburgh and other parts of the state. After a couple hundred years, Philadelphians and Pittsburghers have come to sound pretty distinct from each other.

“[…] people from Pittsburgh are talking about ‘gettin’ off the caach and gone dahntawn on the trawly to see the fahrworks for the Fourth a July hawliday n’at,’ while Philadelphia folks provide linguistic gems like the one Monahan offered up as the most Philly sentence possible: ‘Yo Antny, when you’re done your glass of wooder, wanna get a hoagie on Thirdyfish Street awn da way over to Moik’s for de Iggles game?’”

University of Pennsylvania linguistics professor William Labov says the Philadelphia dialect is generally a source of pride for residents, most of whom are positive about the city. Pittsburghese is similarly well-regarded by its speakers. The unique dialect has received a good deal of attention since linguists began visiting the area in the 1930s.

However, the fact that increasing numbers of young people are going farther away for college has resulted in Philly and Pittsburgh accents and dialects being dropped (since of course college kids want to fit in and be understood). Huge surges in online/text communication do not speed that decline though (as is often thought), and in fact, additional exposure to multiple dialects means people don’t judge others’ speech as much as they used to. Both Labov and Carnegie Mellon University English/linguistics professor Barbara Johnstone rightfully point out that the Philly and Pittsburgh dialects are (like all language) constantly evolving.

* * * * * * * * * * * * * * *

I’ll leave you with a few extra Pennsylvania dialect delights:


  • “The Enduring Mystery of ‘Jawn,’ Philadelphia’s All-Purpose Noun”
    • “The word ‘jawn’ is unlike any other English word. In fact, according to the experts that I spoke to, it’s unlike any other word in any other language. It is an all-purpose noun, a stand-in for inanimate objects, abstract concepts, events, places, individual people, and groups of people. It is a completely acceptable statement in Philadelphia to ask someone to ‘remember to bring that jawn to the jawn.’”

Lastly and more generally, if you really want to know just how complicated the dialect situation is in North America, take a gander at this incredibly detailed map/site.


*Photo attributions: Yinzers In The Burgh Sign; Greetings from Pennsylvania; Buses speak #Pittsburghese now, too. “Need vaccinated.”

Read my lips: McGurk and speech perception


Think of a situation in which you might try to lip read. Would it be chatting with friends in a noisy bar? Or watching a muted TV? Maybe you were born (or became) deaf. In all of these scenarios, the channel for auditory input is severely obstructed or entirely absent. But what if I told you that lip reading isn’t limited to those contexts? The surprising truth is that we lip read, to some extent, during practically every spoken interaction where we can observe the speaker’s face.

This means that the visual information involved in speech perception and processing is way more crucial than we would imagine. For a demonstration, watch this video.[1]

As the video mentions, the phenomenon is called the McGurk effect. It’s named after one of the researchers who first discovered it – psychologists Harry McGurk and John MacDonald described the effect in their 1976 paper “Hearing Lips and Seeing Voices”. They found that when a person watches someone pronounce one sound, but hears the paired audio of another sound, they will perceive a third sound which lies somewhere in-between the visual and auditory cues.

More specifically, the paper details an experiment in which adults watched a video of a woman making the lip/mouth movements for ga, but the audio was dubbed with the syllable ba. Subjects reported hearing a third syllable, da. (When they watched the undubbed video, they correctly heard ga, and when they listed to the audio only, they correctly heard ba.)

How is da “in-between” ba and ga?

In phonetics, speech sounds, or phones (particularly consonants) are classified according to three main parameters: voicing, place of articulation, and manner of articulation. Voicing is whether your vocal folds vibrate when producing the sound – if they vibrate, the sound is voiced; if they don’t, it’s voiceless. Place of articulation deals with how and where your mouth (mostly tongue and lips) moves to constrict or obstruct the vocal tract, creating different consonants. Manner of articulation concerns the way in which air flow passes through or is blocked by the articulators (i.e. mouth/tongue/lips) when they’ve taken a given shape.

Identifying such parameters for the badaga example in the 1967 paper:

  • [Voicing] /b/, /d/, and /g/ are all VOICED consonants
    • (Try putting fingers to your Adam’s apple while saying ba, da, and ga, and you should feel the vibration of your vocal folds)
  • [Manner of articulation] /b/, /d/, and /g/ are all stops or plosives – meaning airflow is entirely blocked by the mouth and nose
  • [Place of articulation]
    • /b/ is a bilabial – formed with both lips
    • /d/ is an alveolar – formed when the tongue tip touches the alveolar ridge (behind the upper front teeth)
    • /g/ is a velar – formed when the tongue body arches towards the velum or soft palate (upper back of the mouth)

Here’s a little sketch of the vocal tract to help visualize.

Articulators and places of articulation


We can see that the three consonants in ba, da, and ga share voicing and articulatory manner, and only differ in their place of articulation. Interestingly, the “intermediate” consonant that subjects perceived, /d/, falls almost exactly in-between /b/ (the audio consonant) and /g/ (the lip-read consonant) in its place of articulation – /b/ is formed at the front of the mouth, /g/ is formed at the back, and /d/ is formed in the middle. Our brain is combining the conflicting auditory and visual information, and perceiving a sound lying between the two cues. The combination isn’t always this exact, but the perceived phone will usually have some features in common with both sensory inputs.

The McGurk effect is strong and, as far as we know, universal. It happens even when the voice and face are in different locations, or of different genders! All people tested, regardless of their language background, have demonstrated this speech integration behavior. However, the precise kinds of synthesis (as well as the effect’s intensity) vary by language. The phenomenon is more apparent in German, English, Spanish, and Turkish language listeners than in Chinese and Japanese participants, for example. The particular syllabic configuration of Japanese, and cultural norms (like face avoidance) may mean that they don’t rely on visual evidence of speech to the same extent. Loud environments heighten the effect for everyone though.

The overall robustness of the McGurk effect indicates just how automatically and unconsciously we combine auditory and visual stimuli when perceiving speech. A slew of related studies demonstrate that our brain really functions on constant multisensory (or multimodal) data, and is not limited to the sense we take as primary for a particular type of input.

So what other perceptual abilities might be a mix of stuff from several senses? If you have some guesses and are curious to know more, I recommend See What I’m Saying, by Lawrence D. Rosenblum. (In fact, he was the guy in the video above!) Like McGurk and MacDonald, he’s not quite a linguist, but we can forgive him.


[1] The video says you should have perceived the sound fa during the second clip of the man, but I actually get a kind of va / fa mix, which makes sense – va is intermediate in that it’s voiced like ba (the audio cue) but has a labiodental place of articulation, like fa (the visual cue).

*Photo attribution: page 241 Internal Ear

O Syntax Tree, O Syntax Tree!


Digital voice agents like Alexa, Siri, and Google Assistant are all the rage these days. But when we talk to our smart devices, are they actually “understanding” our speech in the same way that another human understands it? Take the command, “Find flights from Chicago to New York on February 21.” We can easily comprehend this sentence; our newborn brains were predisposed to acquire language, and we’ve been using it ever since.

Computers, on the other hand, cannot acquire language. They must be trained. In order to train them, computational linguists, other linguists, and engineers have broken language down into more manageable parts that can be tackled individually. Automatic speech recognition (ASR) deals with training machines to recognize speech (via acoustic properties, etc.), and convert that speech to text. Next, natural language processing (NLP) attempts to figure out what is meant by that text[1]. An NLP system itself is composed of multiple modules[2], one of which will likely be a syntactic parser.

Today we’re going to delve into the parser component. Let’s start with some syntactic basics!

Syntax is the set of rules and processes governing sentence structure in any natural language. It involves things like word order, and constituents (words or phrases that form functional units). One of the most common ways to represent syntactic information (at least as of the 20th century) is with a syntax tree. Traditional syntax trees specify:

  • The words of a phrase/sentence
  • Part of speech for each word, usually abbreviated
    • N (noun); V (verb); P (preposition); D or DET (determiner, a.k.a. article); A (adjective); etc.
  • Larger phrases, also abbreviated
    • S (sentence); NP (noun phrase); VP (verb phrase); etc.
  • Relationships between all of the words and phrases
    • These are hierarchical relationships that show how constituents combine into larger ones (or split into smaller ones, if starting from the opposite end of the tree)

Here’s a tree diagram (specifically, a constituency tree) for the sentence, “My parakeet drinks mimosas in the morning”:


You can see that my parakeet forms a larger chunk which is a noun phrase, in the morning forms a larger chunk which is a prepositional phrase, drinks mimosas in the morning forms an even larger chunk which is a verb phrase, and both the NP and VP combine to form the largest chunk, a full sentence S. Remember that syntax focuses on phrasal order and structure, not meaning or context – so it can’t tell us why on earth you’re feeding boozy orange juice to your pet bird.

Onto the parsing! Very generally, a parser is a piece of software (often a trained machine learning model) that takes input text, and outputs a parse tree or similar structural representation, based on syntactic rules and statistics learned from its training data.

Syntactic parsers include a component called a Context-Free Grammar, which has:

  1. A set of non-terminal symbols – abbreviations for language constituents (lexical parts of speech and phrasal types):

{S, NP, VP, PP, D, N, A…}

  1. A set of terminal symbols – words of the phrase/sentence:

{drinks, parakeet, mimosas, morning, my, in, the}

  1. A set of rules like:

S → NP VP  (a sentence S is composed of a noun phrase NP and verb phrase VP)

NP → D N  (a noun phrase NP is composed of a determiner D and a noun N)

VP → VP PP  (etc.)


  1. A start symbol: S

The parser starts at S, and applies its rules successively, until it arrives at the terminal symbols. The resulting parse is the labeled relationships connecting those terminals (i.e. words).

There are two main kinds of syntactic parsers: dependency and constituency. To keep this post to a reasonable length, I’ll focus on dependency only, but constituency parsers output structures similar to the parakeet tree above[3]. A dependency parser builds a tree for each input sentence by starting with a sentence root (usually the main verb), and assigning a head word to each word, until it gets to the end of the sentence. (Heads link to dependents.) When it’s done, each word has at least one branch, or relationship, with another word. The parser also characterizes each word-word relationship. These are things like: nominal subject of a verb (“nsubj”); object of a verb or a preposition (“dobj” and “pobj,” respectively); conjunction (“cc” for the conjunction word, and “conj” for the elements being conjoined); determiner (“det”); and adverbial modifier (“advmod”).

A visualized example will probably help. Taking that same sentence, “My parakeet drinks mimosas in the morning,” a visualization of the dependency parse might look like this:


Can you spot the root, or main verb? It’s the one without any arrows going towards it: drinks. The parser then finds the subject of drinks, which is parakeet, and labels that relationship “nsubj.” It finds mimosas as the direct object of drinks, and labels it “dobj.” And so on and so forth.

Let’s look at another example, for a dollop of variety. Here is “Mr. Vanderloop had smiled and said hello”:


In this one, the past participle smiled is the root/main verb, which has multiple dependents: its subject Vanderloop, its auxiliary (a.k.a. “helping verb”) had, its conjunction and, and the other verb with which it conjoins, said. The subject Vanderloop has a dependent Mr., with which it forms a compound (proper) noun; said’s dependent is the interjection hello.

How about our sentence from the beginning, “Find flights from Chicago to New York on February 21”? How might it be parsed? (You can check your hypotheses by typing the sentence into an interactive demo of the displaCy dependency visualizer, from which the visualizations above also came[4].) Something to keep in mind here is that English imperative structure leaves the subject – whoever is being addressed – implicit.

A slight aside: I’ve chosen simple examples for demonstration, but parsing gets decidedly complicated when input sentences are themselves complicated. Questions, subordinate clauses, coordination (or all three: “What’s the name of the movie where the guy drives a flying taxi and saves the human race from aliens?”), and structurally ambiguous sentences (“The horse raced past the barn fell”) get tricky quickly.

So now we have some parsed output. How is this structured, annotated data useful? Well, one thing you can do with these word relations is identify noun phrases. Identifying noun phrases across sentences helps with another step in the NLP pipeline called Named Entity Recognition, or NER. NER tries to recognize nouns/noun phrases (names, places, dates, etc.) and label them with categories of concepts from the real world. In our flights example, “Chicago” and “New York” should get tagged with some label like CITY or GEOGRAPHIC LOCALE, and “February 21” should get tagged with DATE. Once a text has been automatically annotated for such named entities, information about those entities can then be pulled from a knowledge base (say, Wikipedia).

Having parts of speech and word relations also makes it easier to match up the specifics of a given user command (e.g. “Text mom saying I’ll call tonight,” or “Show popular Thai restaurants near me”) with slightly more generalized intents (e.g. Send text or Get restaurants); machine models can start learning how words typically pattern across the main verb and direct object positions for various commands. Code then uses the more generalized intent to fulfill that request on a device – be it smartphone, tablet, or home speaker. “Find flights from Chicago to New York on February 21” would hopefully be matched with a more general Get flights intent, and the particular noun phrases could be passed to fields for origin, destination, and date.

* * * * *

Before leaving you to your holiday leftovers, I’d like to reiterate that syntactic parsing is only one step in an NLP system. Its parses don’t tell us much about the actual semantics of the linguistic input. Language meaning, however, is a whole other ball of wax, best left for the new year…


[1] There is often terminological confusion between NLP and NLU (natural language understanding). See this graphic for one common breakdown, although I’ve heard the terms used interchangeably as well.

[2] If you’re interested to learn about other NLP steps, read this accessible post, Natural Language Processing is Fun!

[3] You can also play around with this interactive demo from Stanford CoreNLP, In the second “Annotations” field dropdown, make sure you have “constituency parse” selected.

[4] The visualizer is from the creators of spaCy, an awesome open-source NLP library in Python; a dependency parser is one of its components.

Pumpkin Pumpkin Pumpkin Pumpkin Pumpkin Pumpkin Pumpkin Pumpkin Pumpkin…


Happy Halloween!

In tribute to the holiday (my favorite), here’s a smidge of spooky linguistics. Ever play that game where you repeat some word so many times it starts to lose its meaning? That’s actually a thing, called semantic satiation.

Semantic satiation is a psychological phenomenon where a person temporarily loses the meaning of the repeated word and perceives only nonsensical sounds. It can happen via reading as well as at the verbal/aural level. The term was coined by psychology professor Leon Jakobovits James in his 1962 dissertation[1]. His and later research shows that word repetition activates peripheral sensorimotor and central neural activity repeatedly in the cortex (activity corresponding with the meaning of a particular expression), which in turn causes reactive inhibition. Consequently, the strength of activity for each new repetition diminishes. More recent semantic verification studies have confirmed that this satiation legitimately falls under semantic memory, and is not just a byproduct of exhaustion of pre-semantic sensory/perceptual processes (for example, the acoustic system).[2]

That’s all well and good, but how is it spooky? In answer I say, see this movie – Pontypool.

In case you don’t have time to view it right this minute, my non-spoiler summary:

“Pontypool” is a low-budget psychological thriller that I found delightfully horrifying. The premise revolves around a virus that is infecting people in a small, remote Canadian town. The virus spreads…through language. When people hear an infected word, they begin repeating it until their entire speech grows garbled, they turn insane and zombie-like, and finally start attacking others in gory fashion.

I highly recommend leaving the office early today to go watch the film. Or to go trick-or-treating. Or something. Since everyone knows that all work and no play…




[1] “Effects of repeated stimulation on cognitive aspects of behavior: some experiments on the phenomenon of semantic satiation”
[2] See “Introduction – Recent Studies” in “On the Locus of the Semantic Satiation Effect: Evidence from Event-Related Brain Potentials”

*Photo attributions: The Shining painting; Droste effect tv