Frame Semantics and FrameNet

FN image

I’d like to discuss a theory in cognitive linguistics which is very near to my heart[1]: frame semantics. I’ll also present FrameNet, a database built using frame semantic theory, which has been and continues to be an excellent resource in the fields of natural language processing (NLP) and machine learning (ML).

Why is frame semantics cool? Why should you want to learn about it? Just this: the theory is an intuitive and comprehensive way to categorize the meaning of any scenario you could possibly dream up and express via language. Unlike many other semantic and syntactic theories, the core concepts are quickly understandable to the non-linguist. What’s more, frame semantics can apply to language meaning at many different levels (from the tiniest morpheme to entire swaths of discourse), and it works equally well for any particular language – be it English, Mandarin, Arabic, or Xhosa. I’ll try to demonstrate the theory’s accessibility and applicability with some details.

American linguist Charles Fillmore developed the frame semantics research program in the 1980s, using the central idea of a frame: a cognitive scene or situation which is based on a person’s prototypical understanding of real-world (social, cultural, biological) experiences. A frame is ‘evoked’ by language – this can be a single word (called a lexical unit), a clause, a sentence, or even longer discourse. Each frame contains various participants and props, called frame elements (FEs). If you’ve studied syntax/semantics (the generative grammar kind), FEs are somewhat analogous to traditional theta roles.

FrameNet is a corpus-based lexicographic and relational database (sort of a complex dictionary) of English frames, the lexical units evoking them, annotated sentences containing those lexical units, and a hierarchy of frame-to-frame relations. It was built and continues to grow at the International Computer Science Institute (ICSI), a nonprofit research center affiliated with UC Berkeley. FrameNets have also been developed in other languages, such as Spanish, Brazilian Portuguese, Japanese, Swedish, French, Chinese, Italian, and Hebrew.

Each frame entry includes a definition, example sentences, frame elements, lexical units, and annotation that illustrates the various fillers (words) of the FEs as well as their syntactic patterns. Let’s unpack all of this!

We’ll take a look at the Motion frame in FrameNet. Some screenshots of the frame entry follow.

framenet_motion1

The Motion frame is first defined. Its definition includes the frame elements that belong to the frame (the text with color highlighting):

“Some entity (Theme) starts out in one place (Source) and ends up in some other place (Goal), having covered some space between the two (Path). Alternatively, the Area or Direction in which the Theme moves or the Distance of the movement may be mentioned.”

After the definition come example sentences, featuring lexical units that evoke the frame (the black-backgrounded text) such as move, drift, float, roll, go.

Further down is the list of frame elements with their definitions and examples.

framenet_motion2

Here, the Theme FE is “the entity that changes location,” while the Goal FE is “the location the Theme ends up in.” In order for language to evoke this Motion frame, it must have some words or phrases which instantiate the Theme, the Goal, and the other FEs listed. In the examples above, me is a Theme in The explosion made [me] MOVE in a hurry; and into the slow lane is a Goal in The car MOVED [into the slow lane].

At the bottom of the entry is a list of lexical units that belong to or evoke the frame, as well as links to annotation of sentences from real data that contain those words.

framenet_motion3

Verbs like come, glide, roll, travel, and zigzag all evoke, quite sensibly, the Motion frame.

Once you click on the “Annotation” link for a particular lexical item, you’re taken to a page that looks like this:

framenet_motion4

Natural language sentences pulled from online corpora (texts from newspapers, magazines, books, tv transcripts, scholarly articles, etc.) are annotated for their Motion FEs. Annotation for the lexical item glide gives us an idea of the types of “entities” (the purple-backgrounded text, or Theme FEs) that “change location” (i.e. that glide) – boats, pink clouds, men, cars, planes, gondolas, and so on.

* * * * *

After this mini FrameNet dive, you may be wondering how the database is used in a concrete sense. To illustrate, let’s compare two sentences:

  1. The boat GLIDED into the harbor.
  2. The dingy DRIFTED away from the harbor.

The entities differ (boat vs. dingy), the verbs differ (glide vs. drift) and the prepositions differ (into vs. [away] from). Yet at a higher level, both of these sentences describe a Theme which “changes location” – either moving towards a Goal in (1), or from a Source in (2). They both indicate motion. Because FrameNet helps machines “learn” that sentences with a variety of nouns, verbs, prepositions, and syntactic patterns can basically point to the same scenario, it’s a useful tool for many applications in the computational realm.

These days computers do all kinds of language-y things for us: answer questions, paraphrase texts, extract relevant information from text (and then maybe organize it thematically – for instance, around people, places, or events), and even generate new texts. These feats require that a computer parse natural language into accurate semantic chunks. FrameNet’s semantically- and syntactically-annotated data can be used as training input for machine models that “learn” how to analyze such meaning chunks, enabling our electronic devices to respond, paraphrase, or extract information appropriately.

To peruse a (very long) list of the projects which have used FrameNet data (organized by requester/researcher), check out the FrameNet Downloaders page.

So – on the off-chance that you find yourself stuck at home and bored out of your mind (?!?!)… you might perhaps enjoy a little investigation of frame-semantic characterization of scenes that involve applying heat, intoxication, or temporal collocation. 🙂

 

[1] Why am I so fond of frame semantics? A terrific professor of mine during grad school introduced the theory, and it resonated with me immediately. I used it in my master’s thesis, then presented the paper at the International Conference on Construction Grammar in 2014. Eventually, I had the privilege of working at FrameNet, where I came to know the brilliant lexicographers/semanticists/cognitive linguists who have dedicated decades of their lives to the theory and the project. Sadly, I never met the legendary Chuck Fillmore, as he passed away the year before I joined the FrameNet team.

Semantics 101 for Caterpillar Inc.

catvscatvscat

It seems that the world’s largest manufacturer of construction equipment, Caterpillar Inc., is in serious need of a basic semantics lesson. I came across this article a couple days ago:

“Santa Cruz coffee shop with ‘cat’ in its name hit with cease and desist from Caterpillar Inc.”

Beyond the ridiculousness of a giant corporation going after a tiny local café, what struck me as even more absurd was the following:

  1. Even if the trademarked ‘CAT’ of Caterpillar Inc. was an oft-used clipping (shortening) of the full word ‘caterpillar’ (and so indicated that wriggling, butterfly-metamorphosing insect), it would not be the same word as the ‘cat’ of the café’s name – “Cat and Cloud Coffee” – which refers to the common feline house pet. These would be homonyms – words which are spelled alike, but have different meanings.[1]
  2. As it is, no one ever calls the aforementioned insect a ‘cat’ (not that I’ve heard, anyway). So the trademarked term is something else entirely. It has its own unique sense, which can in fact refer to at least two related things: (a) a particular machine produced by the company, or (b) the company itself. Obviously, neither of these are that purring, internet-beloved animal either. They are yet another set of homonyms.

Totally different words. Totally different senses. The news piece doesn’t say this explicitly, but most people possess an intuitive understanding, as evidenced by quotes from café customers:

“’I don’t think anyone correlates the Caterpillar company with their big yellow massive trucks with a small café,’ said Rick Tawfik, of San Jose. ‘I mean, I never thought about Cat and Cloud and Caterpillar in the same sentence until we heard about this lawsuit.’

‘I don’t think they have a legitimate case,’ added Emma Davis, of San Jose. ‘I don’t think I would ever confuse the two of them. It doesn’t make sense to me.’”

Caterpillar’s trademark lawyers apparently lack such common sense, or are (more likely) willfully ignoring it.

 

[1] Etymologically, hundreds of years ago, the terms could have been related, in that (according to the Oxford English Dictionary) the Middle English word for ‘caterpillar’ catyrpel may have derived from the Old French chatepelose (literally “hairy or downy cat”)…but enough time has elapsed between now and the 11th century that it’s not reasonable to claim a modern meaning connection. Does anyone you know think of caterpillars as “hairy cats”?

*Photo attributions: CAT excavator; Caterpillar; Pet cat

Pumpkin Pumpkin Pumpkin Pumpkin Pumpkin Pumpkin Pumpkin Pumpkin Pumpkin…

all-work-and-no-play-makes-jack-a-dull-boy-florian-rodarte

Happy Halloween!

In tribute to the holiday (my favorite), here’s a smidge of spooky linguistics. Ever play that game where you repeat some word so many times it starts to lose its meaning? That’s actually a thing, called semantic satiation.

Semantic satiation is a psychological phenomenon where a person temporarily loses the meaning of the repeated word and perceives only nonsensical sounds. It can happen via reading as well as at the verbal/aural level. The term was coined by psychology professor Leon Jakobovits James in his 1962 dissertation[1]. His and later research shows that word repetition activates peripheral sensorimotor and central neural activity repeatedly in the cortex (activity corresponding with the meaning of a particular expression), which in turn causes reactive inhibition. Consequently, the strength of activity for each new repetition diminishes. More recent semantic verification studies have confirmed that this satiation legitimately falls under semantic memory, and is not just a byproduct of exhaustion of pre-semantic sensory/perceptual processes (for example, the acoustic system).[2]

That’s all well and good, but how is it spooky? In answer I say, see this movie – Pontypool.

In case you don’t have time to view it right this minute, my non-spoiler summary:

“Pontypool” is a low-budget psychological thriller that I found delightfully horrifying. The premise revolves around a virus that is infecting people in a small, remote Canadian town. The virus spreads…through language. When people hear an infected word, they begin repeating it until their entire speech grows garbled, they turn insane and zombie-like, and finally start attacking others in gory fashion.

I highly recommend leaving the office early today to go watch the film. Or to go trick-or-treating. Or something. Since everyone knows that all work and no play…

 

drost_tv

 

[1] “Effects of repeated stimulation on cognitive aspects of behavior: some experiments on the phenomenon of semantic satiation”
[2] See “Introduction – Recent Studies” in “On the Locus of the Semantic Satiation Effect: Evidence from Event-Related Brain Potentials”

*Photo attributions: The Shining painting; Droste effect tv

Voynich: The manuscript that keeps on giving

full

The Voynich manuscript is one of those marvels that, even in these times of boundless knowledge and incredible technology, eludes continual efforts to understand it.

Not heard of the thing? Welcome to the show. There has been a vigorous little dance of press coverage over the past couple years. It goes something like this:

Step to your left.  “An eternal mystery.”
Step to your right.  “I’ve cracked the code!” – some dude
Step back.  “Nope, you’re full of shit.”
Step forward.  “We’ve solved it this time for sure.” – some other dudes
Repeat.

The manuscript is a hand-written, illustrated codex that’s been shown through carbon dating to have originated in the early fifteenth century (1404–1438). The writing system used throughout its approximately 240 pages has yet to be identified.[1] Cryptographers, historians, computer scientists and others have proposed numerous hypotheses over the decades, including that it’s a hoax. Based on the illustrations, scholars divide the manuscript into five thematic sections: Herbal, Astrological, Biological, Pharmacological, and Recipes.

Below I list links to the (more recent) rhythmic pulse of “discoveries” and rejections, in chronological order. Under each link I’ve pulled out quotes of the more intriguing tidbits.

* * * * *

November 30, 2016: https://www.newyorker.com/books/page-turner/the-unsolvable-mysteries-of-the-voynich-manuscript

“The first half of the book is filled with drawings of plants; scholars call this the “herbal” section. None of the plants appear to be real, although they are made from the usual stuff (green leaves, roots, and so on […]). The next section contains circular diagrams of the kind often found in medieval zodiacal texts; scholars call this part “astrological,” which is generous. Next, the so-called “balneological” section shows “nude ladies,” in Clemens’s words, in pools of liquid, which are connected to one another via a strange system of tubular plumbing that often snakes around whole pages of text. […] Then we get what appear to be instructions in the practical use of those plants from the beginning of the book, followed by pages that look roughly like recipes.”

“The Voynich MS was an early attempt to construct an artificial or universal language of the a priori type.   –Friedman.”

* * * * *

September 8, 2017: https://arstechnica.com/science/2017/09/the-mysterious-voynich-manuscript-has-finally-been-decoded/

“Now, history researcher and television writer Nicholas Gibbs appears to have cracked the code, discovering that the book is actually a guide to women’s health that’s mostly plagiarized from other guides of the era.”

“Gibbs realized he was seeing a common form of medieval Latin abbreviations, often used in medical treatises about herbs. ‘From the herbarium incorporated into the Voynich manuscript, a standard pattern of abbreviations and ligatures emerged from each plant entry,’ he wrote. ‘The abbreviations correspond to the standard pattern of words used in the Herbarium Apuleius Platonicus – aq = aqua (water), dq = decoque / decoctio (decoction), con = confundo (mix), ris = radacis / radix (root), s aiij = seminis ana iij (3 grains each), etc.’ So this wasn’t a code at all; it was just shorthand. The text would have been very familiar to anyone at the time who was interested in medicine.”

“Gibbs concluded that it’s likely the Voynich Manuscript was a customized book, possibly created for one person, devoted mostly to women’s medicine.”

* * * * *

September 10, 2017: https://www.theatlantic.com/science/archive/2017/09/has-the-voynich-manuscript-really-been-solved/539310/

“This week, the venerable Times Literary Supplement published as its cover story a ‘solution’ for the Voynich manuscript. The article by Nicholas Gibbs suggests the manuscript is a medieval women’s-health manual copied from several older sources. And the cipher is no cipher at all, but simply abbreviations that, once decoded, turn out to be medicinal recipes.”

“’Frankly I’m a little surprised the TLS published it,’ says Lisa Fagin Davis, executive director of the Medieval Academy of America. When she was a doctoral student at Yale—whose Beinecke Library holds the Voynich manuscript—Davis read dozens of theories as part of her job. ‘If they had simply sent to it to the Beinecke Library, they would have rebutted it in a heartbeat,’ she says.”

“In the second part—only two paragraphs long—Gibbs gets into the meat of his solution: Each character in the manuscript is an abbreviated word, not a letter. This could be a breakthrough, but the TLS presents only two lines decoded using Gibbs’s method. Davis did not find those two lines convincing either. ‘They’re not grammatically correct. It doesn’t result in Latin that makes sense,’ she says.”

* * * * *

February 1, 2018: https://www.atlasobscura.com/articles/voynich-manuscript-artificial-intelligence-solved

“There are two problems with this notoriously difficult puzzle—it’s written in code, and no one knows what language that code enciphers.”

“’That was surprising,’ Kondrak said, in a statement. ‘And just saying “this is Hebrew” is the first step. The next step is how do we decipher it.’ The scientists think the code used in the manuscript might have been created using alphagrams. (In standard alphagrams, the letters in a word are placed in alphabetical order—the alphagram of ‘alphagram,’ for example, is ‘aaaghlpmr.’) Vowels also seemed to have been dropped. These assumptions made, they tried to come up with an algorithm to decipher this scrambled Hebrew text, to striking effect. ‘It turned out that over 80 percent of the words were in a Hebrew dictionary,’ said Kondrak.”

“Hebrew-speaking data scientist Shlomo Argamon offered some excoriating feedback. ‘They are saying it looks more like Hebrew than other languages,’ he said. ‘In my opinion, that’s not necessarily saying all that much.’ The use of Google Translate, too, struck him as somewhat unscientific. […] Other scholars have raised doubts about the scientists’ use of modern, rather than medieval, Hebrew.”

* * * * *

Certain researchers have made a compelling case against the “hoax” hypothesis, in any event. In 2013, an interesting paper analyzed the Voynich manuscript from an information theory perspective. They looked at organizational structure resulting from word distribution over the entire text, and concluded that there was “presence of a genuine linguistic structure”.[2] You can read the full paper here.

A couple information theory takeaways:

  1. Highly informative content words occur much more irregularly (and in clusters) throughout a text, while more uninformative function words tend to have a more homogenous or uniform distribution. So it’s the content words that indicate specific text sections.
  2. Words that are semantically related tend to co-occur in the same sections of a text.

 

Who will claim to have cracked the code next? My personal opinion, of course, is that they should throw some linguists on it.

 

[1] https://en.wikipedia.org/wiki/Voynich_manuscript

[2] Montemurro MA, Zanette DH. (2013). Keywords and Co-Occurrence Patterns in the Voynich Manuscript: An Information-Theoretic Analysis. PLoS ONE 8(6): e66344, 5. https://doi.org/10.1371/journal.pone.0066344