O Syntax Tree, O Syntax Tree!

how_lovely

Digital voice agents like Alexa, Siri, and Google Assistant are all the rage these days. But when we talk to our smart devices, are they actually “understanding” our speech in the same way that another human understands it? Take the command, “Find flights from Chicago to New York on February 21.” We can easily comprehend this sentence; our newborn brains were predisposed to acquire language, and we’ve been using it ever since.

Computers, on the other hand, cannot acquire language. They must be trained. In order to train them, computational linguists, other linguists, and engineers have broken language down into more manageable parts that can be tackled individually. Automatic speech recognition (ASR) deals with training machines to recognize speech (via acoustic properties, etc.), and convert that speech to text. Next, natural language processing (NLP) attempts to figure out what is meant by that text[1]. An NLP system itself is composed of multiple modules[2], one of which will likely be a syntactic parser.

Today we’re going to delve into the parser component. Let’s start with some syntactic basics!

Syntax is the set of rules and processes governing sentence structure in any natural language. It involves things like word order, and constituents (words or phrases that form functional units). One of the most common ways to represent syntactic information (at least as of the 20th century) is with a syntax tree. Traditional syntax trees specify:

  • The words of a phrase/sentence
  • Part of speech for each word, usually abbreviated
    • N (noun); V (verb); P (preposition); D or DET (determiner, a.k.a. article); A (adjective); etc.
  • Larger phrases, also abbreviated
    • S (sentence); NP (noun phrase); VP (verb phrase); etc.
  • Relationships between all of the words and phrases
    • These are hierarchical relationships that show how constituents combine into larger ones (or split into smaller ones, if starting from the opposite end of the tree)

Here’s a tree diagram (specifically, a constituency tree) for the sentence, “My parakeet drinks mimosas in the morning”:

tree

You can see that my parakeet forms a larger chunk which is a noun phrase, in the morning forms a larger chunk which is a prepositional phrase, drinks mimosas in the morning forms an even larger chunk which is a verb phrase, and both the NP and VP combine to form the largest chunk, a full sentence S. Remember that syntax focuses on phrasal order and structure, not meaning or context – so it can’t tell us why on earth you’re feeding boozy orange juice to your pet bird.

Onto the parsing! Very generally, a parser is a piece of software (often a trained machine learning model) that takes input text, and outputs a parse tree or similar structural representation, based on syntactic rules and statistics learned from its training data.

Syntactic parsers include a component called a Context-Free Grammar, which has:

  1. A set of non-terminal symbols – abbreviations for language constituents (lexical parts of speech and phrasal types):

{S, NP, VP, PP, D, N, A…}

  1. A set of terminal symbols – words of the phrase/sentence:

{drinks, parakeet, mimosas, morning, my, in, the}

  1. A set of rules like:

S → NP VP  (a sentence S is composed of a noun phrase NP and verb phrase VP)

NP → D N  (a noun phrase NP is composed of a determiner D and a noun N)

VP → VP PP  (etc.)

PP → P NP

  1. A start symbol: S

The parser starts at S, and applies its rules successively, until it arrives at the terminal symbols. The resulting parse is the labeled relationships connecting those terminals (i.e. words).

There are two main kinds of syntactic parsers: dependency and constituency. To keep this post to a reasonable length, I’ll focus on dependency only, but constituency parsers output structures similar to the parakeet tree above[3]. A dependency parser builds a tree for each input sentence by starting with a sentence root (usually the main verb), and assigning a head word to each word, until it gets to the end of the sentence. (Heads link to dependents.) When it’s done, each word has at least one branch, or relationship, with another word. The parser also characterizes each word-word relationship. These are things like: nominal subject of a verb (“nsubj”); object of a verb or a preposition (“dobj” and “pobj,” respectively); conjunction (“cc” for the conjunction word, and “conj” for the elements being conjoined); determiner (“det”); and adverbial modifier (“advmod”).

A visualized example will probably help. Taking that same sentence, “My parakeet drinks mimosas in the morning,” a visualization of the dependency parse might look like this:

displacy_parse_parakeet_drinks

Can you spot the root, or main verb? It’s the one without any arrows going towards it: drinks. The parser then finds the subject of drinks, which is parakeet, and labels that relationship “nsubj.” It finds mimosas as the direct object of drinks, and labels it “dobj.” And so on and so forth.

Let’s look at another example, for a dollop of variety. Here is “Mr. Vanderloop had smiled and said hello”:

displacy_parse_vanderloop

In this one, the past participle smiled is the root/main verb, which has multiple dependents: its subject Vanderloop, its auxiliary (a.k.a. “helping verb”) had, its conjunction and, and the other verb with which it conjoins, said. The subject Vanderloop has a dependent Mr., with which it forms a compound (proper) noun; said’s dependent is the interjection hello.

How about our sentence from the beginning, “Find flights from Chicago to New York on February 21”? How might it be parsed? (You can check your hypotheses by typing the sentence into an interactive demo of the displaCy dependency visualizer, from which the visualizations above also came[4].) Something to keep in mind here is that English imperative structure leaves the subject – whoever is being addressed – implicit.

A slight aside: I’ve chosen simple examples for demonstration, but parsing gets decidedly complicated when input sentences are themselves complicated. Questions, subordinate clauses, coordination (or all three: “What’s the name of the movie where the guy drives a flying taxi and saves the human race from aliens?”), and structurally ambiguous sentences (“The horse raced past the barn fell”) get tricky quickly.

So now we have some parsed output. How is this structured, annotated data useful? Well, one thing you can do with these word relations is identify noun phrases. Identifying noun phrases across sentences helps with another step in the NLP pipeline called Named Entity Recognition, or NER. NER tries to recognize nouns/noun phrases (names, places, dates, etc.) and label them with categories of concepts from the real world. In our flights example, “Chicago” and “New York” should get tagged with some label like CITY or GEOGRAPHIC LOCALE, and “February 21” should get tagged with DATE. Once a text has been automatically annotated for such named entities, information about those entities can then be pulled from a knowledge base (say, Wikipedia).

Having parts of speech and word relations also makes it easier to match up the specifics of a given user command (e.g. “Text mom saying I’ll call tonight,” or “Show popular Thai restaurants near me”) with slightly more generalized intents (e.g. Send text or Get restaurants); machine models can start learning how words typically pattern across the main verb and direct object positions for various commands. Code then uses the more generalized intent to fulfill that request on a device – be it smartphone, tablet, or home speaker. “Find flights from Chicago to New York on February 21” would hopefully be matched with a more general Get flights intent, and the particular noun phrases could be passed to fields for origin, destination, and date.

* * * * *

Before leaving you to your holiday leftovers, I’d like to reiterate that syntactic parsing is only one step in an NLP system. Its parses don’t tell us much about the actual semantics of the linguistic input. Language meaning, however, is a whole other ball of wax, best left for the new year…

 

[1] There is often terminological confusion between NLP and NLU (natural language understanding). See this graphic for one common breakdown, although I’ve heard the terms used interchangeably as well.

[2] If you’re interested to learn about other NLP steps, read this accessible post, Natural Language Processing is Fun!

[3] You can also play around with this interactive demo from Stanford CoreNLP, http://corenlp.run. In the second “Annotations” field dropdown, make sure you have “constituency parse” selected.

[4] The visualizer is from the creators of spaCy, an awesome open-source NLP library in Python; a dependency parser is one of its components.

On machine translation, the media, and meaning (a response)

Shallowness_GoogTrans

I’m a Douglas Hofstadter fan. I read his book I Am a Strange Loop years ago, and it remains one of my three favorite non-fiction books, period. I highly recommend it to anyone who is at all interested in the nature of consciousness. The cognitive scientist’s Pulitzer Prize-winning Gödel, Escher, Bach: An Eternal Golden Braid has also been on my to-read list for a long time. So I was excited to see this article by him in The Atlantic, on another area that interests me: machine translation and machine “intelligence”.

Early on in the piece, Hofstadter says he has a “longstanding belief that it’s important to combat exaggerated claims about artificial intelligence”. Having worked in the machine learning/AI field for a little under a year now (but what an intense year it has been!), and having read countless popular media articles touting the astonishing advances in natural language processing/understanding, ML, and AI, I heartily agree with his sentiment. Such reporting is as misleading as it is annoying.

I came across a statement of this type the other day, in Stanford AI researchers make ‘socially inclusive’ NLP:

“The average person working with NLP today may consider language identification a solved problem.”

I have trouble believing that any researcher working in NLP/NLU/ML/AI thinks anything is a solved problem. Despite much progress, the field is still in its infancy. Doesn’t anyone remember Einstein’s quote (adapted from a similar idea expressed by Socrates) – “The more I learn, the more I realize how much I don’t know”? Where I work, every possible solution to a given problem brings up more questions, and even the “simplest” “facts” cannot always be taken for granted. (Remember when you were taught parts of speech like verb, noun, and preposition in grade school? Working at the level of detail we do, even these fundamental rules are often inadequate, requiring further specification. Turns out it’s hard to throw messy, real language into clean, fixed bins.) So I think the media does the field, its researchers, and the reading public a great disservice by sensationalizing and oversimplifying the challenges.

Hofstadter’s argument about understanding is even more poignant:

“The practical utility of Google Translate and similar technologies is undeniable, and probably it’s a good thing overall, but there is still something deeply lacking in the approach, which is conveyed by a single word: understanding. Machine translation has never focused on understanding language. Instead, the field has always tried to ‘decode’— to get away without worrying about what understanding and meaning are.”

We call the study of meaning and understanding semantics and pragmatics. People’s real world knowledge plays a key role here as well. To my mind, meaning (only complete when tied to real world knowledge) is the last frontier for AI and language. Today’s mobile/home voice assistants have definitely not yet mastered meaning. Technologies have made serious headway in resolving structural patterns (syntax), proper nouns (Named Entity Recognition) and some other aspects of language. But meaning, that great magical beast, eludes its pursuers. It is really, really challenging to computationally model the depth and complexity of human understanding. Because, although language itself is quite complicated, it’s still an impoverished medium for conveying the millions of subtle things we want and are able to convey – it relies heavily on context, implicature, presuppositionentailment, prosody, speaker-listener relationship, etc. I agree again with the author when he says that human-like machines are “not around the corner.”

I do think that Hofstadter seems to be simultaneously recognizing how hard the task of modeling meaning is, while not giving enough credit for things accomplished. Google Translate is way better now than it was at its inception over ten years ago. I also think he glosses over the tool’s main usage – which is more functional than artistic or poetic. Am I wrong to assume people use Translate much more for a quick word or phrase, when traveling or speaking on the fly, than for translating longer passages of literature? If they’re doing the latter… per the author’s experimental results, they clearly shouldn’t be.

 

What do you think – about media reportage of technology, machine “intelligence”, Hofstadter’s article? Feel free to comment!