Career interviews: Computational linguist for a virtual assistant

working wugs_cropped

Wugs go to work

After much delay (eek! just realized it’s been a year!), I have another interview with a career linguist for your reading pleasure. [See the first interview here.] Even though I still get the “I’ve never met a real-live linguist” reaction when telling folks what I do, these days there are indeed people working full-time, earning living wages, as these specialized language nuts – and not all as professors in academia, or as translators/interpreters for the UN.

* * * * *

Just like with my last interviewee, I met Allan at Samsung Research America, where we worked together on Bixby, Samsung’s virtual voice assistant. On the Bixby linguist team, we worked with engineers, Quality Assurance (QA) testers and others to develop a personal assistant that would carry out thousands of different spoken user commands. Also like with my last interviewee, Allan is no longer at the job I interviewed him about. (He’s now a Language Engineer on Amazon’s Alexa!). I’m keeping questions and answers in present tense, however, because I feel like it.

Allan Schwade, a graduate student in linguistics, won the Humanities Division Dean's Award for his poster on the adaptation of Russian words by English speakers

  1. What kind of work do you do?

I’m a computational linguist, which means I create solutions for natural language processing problems using computers. More specifically, I work on the systems and machine learning models that enable your smart devices to understand you when you say “set an alarm for 7am” or “tell me the weather in Chicago”.

  1. Describe a typical day at your job.

I usually start the day by meeting with my manager. The lab I work in supports products in production and conducts research and development for smart devices. If there is an issue with a product in production, I’ll work with the team to solve the problem. Usually this involves curating the training data for the machine learning models – removing aberrant data from training sets or generating new data to support missing patterns. If nothing is on fire, there are usually several projects I’ll be working on at any given time. Projects generally start out with me doing a lot of reading on the state of the art, then I’ll reach a point where I’m confident enough to build a proof of concept (POC). While I’m creating the POC, the linguists will generate data for the models. Once the code and data are ready, I’ll build the models and keep iterating until performance is satisfactory. The only really dependable thing in my schedule is lunch and a mid-afternoon coffee break with colleagues, which are both indispensable.

  1. How does your linguistics background inform your current work?

My degree in linguistics is crucial for my current line of work. When building machine learning models, so much rests on data you feed into your models. If your data set is diverse and representative of the problem, your model will be robust.

Having a linguistics background also gives me quick insight into data sets and how to balance them. Understanding the latent structures in the data allows me to engineer informative feature vectors for my models (feature vectors are derived from the utterances collected and are the true inputs to the machine learning model).

  1. What do you enjoy most and/or least about the job?

I really enjoy getting to see differences between human and machine learning. We have a pretty good idea of the types of things humans will attend to when learning language, but sometimes those things aren’t informative for machines. It can be frustrating when something I’d call “obvious” is useless in a model and even more frustrating when something “marginal” is highly informative. But I never tire of the challenge, the satisfaction I feel at the end of a project is worth it.

The thing I enjoy least is data annotation. The process of doing it is indispensable because you become intimately familiar with the problem, but after a couple of hours of it my mind goes numb.

  1. What did you study in college and/or grad school?

I got my BA from Rutgers University and my MS from the University of California, Santa Cruz. Both degrees were in linguistics and both schools specialized in generative linguistic theory. I enjoyed a lot about the programs but they did a better job of preparing people for careers in academia than industry. Learning programming or common annotation tools and schemas before graduating would have made industry life easier for me.

  1. What is your favorite linguistic phenomenon?

Loanword adaptation! I wrote my master’s thesis on it. Seeing how unfamiliar phonemes are digested by speakers never fails to pique my interest. In general, I love it when stable systems are forced to reconcile things outside their realm of experience.

  1. (If you had the time) what language would you learn, and why?

As a phonetician I’d love to learn Georgian for its consonant clusters, Turkish for its morpho-phonology, Hmong for its tones, or ASL because it’s a completely different modality than what I specialized in. As a subjective entity who does things for personal enjoyment, I’d love to learn Japanese.

  1. Do you have any advice for young people looking to pursue a career in linguistics?

If you want to go into industry doing natural language processing, I cannot stress enough how important the ability to code is. It’s true that for annotation work you won’t usually need it, but if you want to be annotation lead, the ability to write utility scripts will save you a lot of time. Also, how I transitioned from annotator to computational linguist came from me showing basic coding competency – the engineers were too busy to work on some projects so they threw the smaller ones my way. This brings me to my next piece of advice: always voice your interest in things that interest you to those with the potential to get you involved. Telling your co-worker you really want to work on a cool project will do next to nothing, but telling your manager or the project lead that you are interested in a project may get you involved.

Frame Semantics and FrameNet

FN image

I’d like to discuss a theory in cognitive linguistics which is very near to my heart[1]: frame semantics. I’ll also present FrameNet, a database built using frame semantic theory, which has been and continues to be an excellent resource in the fields of natural language processing (NLP) and machine learning (ML).

Why is frame semantics cool? Why should you want to learn about it? Just this: the theory is an intuitive and comprehensive way to categorize the meaning of any scenario you could possibly dream up and express via language. Unlike many other semantic and syntactic theories, the core concepts are quickly understandable to the non-linguist. What’s more, frame semantics can apply to language meaning at many different levels (from the tiniest morpheme to entire swaths of discourse), and it works equally well for any particular language – be it English, Mandarin, Arabic, or Xhosa. I’ll try to demonstrate the theory’s accessibility and applicability with some details.

American linguist Charles Fillmore developed the frame semantics research program in the 1980s, using the central idea of a frame: a cognitive scene or situation which is based on a person’s prototypical understanding of real-world (social, cultural, biological) experiences. A frame is ‘evoked’ by language – this can be a single word (called a lexical unit), a clause, a sentence, or even longer discourse. Each frame contains various participants and props, called frame elements (FEs). If you’ve studied syntax/semantics (the generative grammar kind), FEs are somewhat analogous to traditional theta roles.

FrameNet is a corpus-based lexicographic and relational database (sort of a complex dictionary) of English frames, the lexical units evoking them, annotated sentences containing those lexical units, and a hierarchy of frame-to-frame relations. It was built and continues to grow at the International Computer Science Institute (ICSI), a nonprofit research center affiliated with UC Berkeley. FrameNets have also been developed in other languages, such as Spanish, Brazilian Portuguese, Japanese, Swedish, French, Chinese, Italian, and Hebrew.

Each frame entry includes a definition, example sentences, frame elements, lexical units, and annotation that illustrates the various fillers (words) of the FEs as well as their syntactic patterns. Let’s unpack all of this!

We’ll take a look at the Motion frame in FrameNet. Some screenshots of the frame entry follow.

framenet_motion1

The Motion frame is first defined. Its definition includes the frame elements that belong to the frame (the text with color highlighting):

“Some entity (Theme) starts out in one place (Source) and ends up in some other place (Goal), having covered some space between the two (Path). Alternatively, the Area or Direction in which the Theme moves or the Distance of the movement may be mentioned.”

After the definition come example sentences, featuring lexical units that evoke the frame (the black-backgrounded text) such as move, drift, float, roll, go.

Further down is the list of frame elements with their definitions and examples.

framenet_motion2

Here, the Theme FE is “the entity that changes location,” while the Goal FE is “the location the Theme ends up in.” In order for language to evoke this Motion frame, it must have some words or phrases which instantiate the Theme, the Goal, and the other FEs listed. In the examples above, me is a Theme in The explosion made [me] MOVE in a hurry; and into the slow lane is a Goal in The car MOVED [into the slow lane].

At the bottom of the entry is a list of lexical units that belong to or evoke the frame, as well as links to annotation of sentences from real data that contain those words.

framenet_motion3

Verbs like come, glide, roll, travel, and zigzag all evoke, quite sensibly, the Motion frame.

Once you click on the “Annotation” link for a particular lexical item, you’re taken to a page that looks like this:

framenet_motion4

Natural language sentences pulled from online corpora (texts from newspapers, magazines, books, tv transcripts, scholarly articles, etc.) are annotated for their Motion FEs. Annotation for the lexical item glide gives us an idea of the types of “entities” (the purple-backgrounded text, or Theme FEs) that “change location” (i.e. that glide) – boats, pink clouds, men, cars, planes, gondolas, and so on.

* * * * *

After this mini FrameNet dive, you may be wondering how the database is used in a concrete sense. To illustrate, let’s compare two sentences:

  1. The boat GLIDED into the harbor.
  2. The dingy DRIFTED away from the harbor.

The entities differ (boat vs. dingy), the verbs differ (glide vs. drift) and the prepositions differ (into vs. [away] from). Yet at a higher level, both of these sentences describe a Theme which “changes location” – either moving towards a Goal in (1), or from a Source in (2). They both indicate motion. Because FrameNet helps machines “learn” that sentences with a variety of nouns, verbs, prepositions, and syntactic patterns can basically point to the same scenario, it’s a useful tool for many applications in the computational realm.

These days computers do all kinds of language-y things for us: answer questions, paraphrase texts, extract relevant information from text (and then maybe organize it thematically – for instance, around people, places, or events), and even generate new texts. These feats require that a computer parse natural language into accurate semantic chunks. FrameNet’s semantically- and syntactically-annotated data can be used as training input for machine models that “learn” how to analyze such meaning chunks, enabling our electronic devices to respond, paraphrase, or extract information appropriately.

To peruse a (very long) list of the projects which have used FrameNet data (organized by requester/researcher), check out the FrameNet Downloaders page.

So – on the off-chance that you find yourself stuck at home and bored out of your mind (?!?!)… you might perhaps enjoy a little investigation of frame-semantic characterization of scenes that involve applying heat, intoxication, or temporal collocation. 🙂

 

[1] Why am I so fond of frame semantics? A terrific professor of mine during grad school introduced the theory, and it resonated with me immediately. I used it in my master’s thesis, then presented the paper at the International Conference on Construction Grammar in 2014. Eventually, I had the privilege of working at FrameNet, where I came to know the brilliant lexicographers/semanticists/cognitive linguists who have dedicated decades of their lives to the theory and the project. Sadly, I never met the legendary Chuck Fillmore, as he passed away the year before I joined the FrameNet team.

O Syntax Tree, O Syntax Tree!

how_lovely

Digital voice agents like Alexa, Siri, and Google Assistant are all the rage these days. But when we talk to our smart devices, are they actually “understanding” our speech in the same way that another human understands it? Take the command, “Find flights from Chicago to New York on February 21.” We can easily comprehend this sentence; our newborn brains were predisposed to acquire language, and we’ve been using it ever since.

Computers, on the other hand, cannot acquire language. They must be trained. In order to train them, computational linguists, other linguists, and engineers have broken language down into more manageable parts that can be tackled individually. Automatic speech recognition (ASR) deals with training machines to recognize speech (via acoustic properties, etc.), and convert that speech to text. Next, natural language processing (NLP) attempts to figure out what is meant by that text[1]. An NLP system itself is composed of multiple modules[2], one of which will likely be a syntactic parser.

Today we’re going to delve into the parser component. Let’s start with some syntactic basics!

Syntax is the set of rules and processes governing sentence structure in any natural language. It involves things like word order, and constituents (words or phrases that form functional units). One of the most common ways to represent syntactic information (at least as of the 20th century) is with a syntax tree. Traditional syntax trees specify:

  • The words of a phrase/sentence
  • Part of speech for each word, usually abbreviated
    • N (noun); V (verb); P (preposition); D or DET (determiner, a.k.a. article); A (adjective); etc.
  • Larger phrases, also abbreviated
    • S (sentence); NP (noun phrase); VP (verb phrase); etc.
  • Relationships between all of the words and phrases
    • These are hierarchical relationships that show how constituents combine into larger ones (or split into smaller ones, if starting from the opposite end of the tree)

Here’s a tree diagram (specifically, a constituency tree) for the sentence, “My parakeet drinks mimosas in the morning”:

tree

You can see that my parakeet forms a larger chunk which is a noun phrase, in the morning forms a larger chunk which is a prepositional phrase, drinks mimosas in the morning forms an even larger chunk which is a verb phrase, and both the NP and VP combine to form the largest chunk, a full sentence S. Remember that syntax focuses on phrasal order and structure, not meaning or context – so it can’t tell us why on earth you’re feeding boozy orange juice to your pet bird.

Onto the parsing! Very generally, a parser is a piece of software (often a trained machine learning model) that takes input text, and outputs a parse tree or similar structural representation, based on syntactic rules and statistics learned from its training data.

Syntactic parsers include a component called a Context-Free Grammar, which has:

  1. A set of non-terminal symbols – abbreviations for language constituents (lexical parts of speech and phrasal types):

{S, NP, VP, PP, D, N, A…}

  1. A set of terminal symbols – words of the phrase/sentence:

{drinks, parakeet, mimosas, morning, my, in, the}

  1. A set of rules like:

S → NP VP  (a sentence S is composed of a noun phrase NP and verb phrase VP)

NP → D N  (a noun phrase NP is composed of a determiner D and a noun N)

VP → VP PP  (etc.)

PP → P NP

  1. A start symbol: S

The parser starts at S, and applies its rules successively, until it arrives at the terminal symbols. The resulting parse is the labeled relationships connecting those terminals (i.e. words).

There are two main kinds of syntactic parsers: dependency and constituency. To keep this post to a reasonable length, I’ll focus on dependency only, but constituency parsers output structures similar to the parakeet tree above[3]. A dependency parser builds a tree for each input sentence by starting with a sentence root (usually the main verb), and assigning a head word to each word, until it gets to the end of the sentence. (Heads link to dependents.) When it’s done, each word has at least one branch, or relationship, with another word. The parser also characterizes each word-word relationship. These are things like: nominal subject of a verb (“nsubj”); object of a verb or a preposition (“dobj” and “pobj,” respectively); conjunction (“cc” for the conjunction word, and “conj” for the elements being conjoined); determiner (“det”); and adverbial modifier (“advmod”).

A visualized example will probably help. Taking that same sentence, “My parakeet drinks mimosas in the morning,” a visualization of the dependency parse might look like this:

displacy_parse_parakeet_drinks

Can you spot the root, or main verb? It’s the one without any arrows going towards it: drinks. The parser then finds the subject of drinks, which is parakeet, and labels that relationship “nsubj.” It finds mimosas as the direct object of drinks, and labels it “dobj.” And so on and so forth.

Let’s look at another example, for a dollop of variety. Here is “Mr. Vanderloop had smiled and said hello”:

displacy_parse_vanderloop

In this one, the past participle smiled is the root/main verb, which has multiple dependents: its subject Vanderloop, its auxiliary (a.k.a. “helping verb”) had, its conjunction and, and the other verb with which it conjoins, said. The subject Vanderloop has a dependent Mr., with which it forms a compound (proper) noun; said’s dependent is the interjection hello.

How about our sentence from the beginning, “Find flights from Chicago to New York on February 21”? How might it be parsed? (You can check your hypotheses by typing the sentence into an interactive demo of the displaCy dependency visualizer, from which the visualizations above also came[4].) Something to keep in mind here is that English imperative structure leaves the subject – whoever is being addressed – implicit.

A slight aside: I’ve chosen simple examples for demonstration, but parsing gets decidedly complicated when input sentences are themselves complicated. Questions, subordinate clauses, coordination (or all three: “What’s the name of the movie where the guy drives a flying taxi and saves the human race from aliens?”), and structurally ambiguous sentences (“The horse raced past the barn fell”) get tricky quickly.

So now we have some parsed output. How is this structured, annotated data useful? Well, one thing you can do with these word relations is identify noun phrases. Identifying noun phrases across sentences helps with another step in the NLP pipeline called Named Entity Recognition, or NER. NER tries to recognize nouns/noun phrases (names, places, dates, etc.) and label them with categories of concepts from the real world. In our flights example, “Chicago” and “New York” should get tagged with some label like CITY or GEOGRAPHIC LOCALE, and “February 21” should get tagged with DATE. Once a text has been automatically annotated for such named entities, information about those entities can then be pulled from a knowledge base (say, Wikipedia).

Having parts of speech and word relations also makes it easier to match up the specifics of a given user command (e.g. “Text mom saying I’ll call tonight,” or “Show popular Thai restaurants near me”) with slightly more generalized intents (e.g. Send text or Get restaurants); machine models can start learning how words typically pattern across the main verb and direct object positions for various commands. Code then uses the more generalized intent to fulfill that request on a device – be it smartphone, tablet, or home speaker. “Find flights from Chicago to New York on February 21” would hopefully be matched with a more general Get flights intent, and the particular noun phrases could be passed to fields for origin, destination, and date.

* * * * *

Before leaving you to your holiday leftovers, I’d like to reiterate that syntactic parsing is only one step in an NLP system. Its parses don’t tell us much about the actual semantics of the linguistic input. Language meaning, however, is a whole other ball of wax, best left for the new year…

 

[1] There is often terminological confusion between NLP and NLU (natural language understanding). See this graphic for one common breakdown, although I’ve heard the terms used interchangeably as well.

[2] If you’re interested to learn about other NLP steps, read this accessible post, Natural Language Processing is Fun!

[3] You can also play around with this interactive demo from Stanford CoreNLP, http://corenlp.run. In the second “Annotations” field dropdown, make sure you have “constituency parse” selected.

[4] The visualizer is from the creators of spaCy, an awesome open-source NLP library in Python; a dependency parser is one of its components.

On machine translation, the media, and meaning (a response)

Shallowness_GoogTrans

I’m a Douglas Hofstadter fan. I read his book I Am a Strange Loop years ago, and it remains one of my three favorite non-fiction books, period. I highly recommend it to anyone who is at all interested in the nature of consciousness. The cognitive scientist’s Pulitzer Prize-winning Gödel, Escher, Bach: An Eternal Golden Braid has also been on my to-read list for a long time. So I was excited to see this article by him in The Atlantic, on another area that interests me: machine translation and machine “intelligence”.

Early on in the piece, Hofstadter says he has a “longstanding belief that it’s important to combat exaggerated claims about artificial intelligence”. Having worked in the machine learning/AI field for a little under a year now (but what an intense year it has been!), and having read countless popular media articles touting the astonishing advances in natural language processing/understanding, ML, and AI, I heartily agree with his sentiment. Such reporting is as misleading as it is annoying.

I came across a statement of this type the other day, in Stanford AI researchers make ‘socially inclusive’ NLP:

“The average person working with NLP today may consider language identification a solved problem.”

I have trouble believing that any researcher working in NLP/NLU/ML/AI thinks anything is a solved problem. Despite much progress, the field is still in its infancy. Doesn’t anyone remember Einstein’s quote (adapted from a similar idea expressed by Socrates) – “The more I learn, the more I realize how much I don’t know”? Where I work, every possible solution to a given problem brings up more questions, and even the “simplest” “facts” cannot always be taken for granted. (Remember when you were taught parts of speech like verb, noun, and preposition in grade school? Working at the level of detail we do, even these fundamental rules are often inadequate, requiring further specification. Turns out it’s hard to throw messy, real language into clean, fixed bins.) So I think the media does the field, its researchers, and the reading public a great disservice by sensationalizing and oversimplifying the challenges.

Hofstadter’s argument about understanding is even more poignant:

“The practical utility of Google Translate and similar technologies is undeniable, and probably it’s a good thing overall, but there is still something deeply lacking in the approach, which is conveyed by a single word: understanding. Machine translation has never focused on understanding language. Instead, the field has always tried to ‘decode’— to get away without worrying about what understanding and meaning are.”

We call the study of meaning and understanding semantics and pragmatics. People’s real world knowledge plays a key role here as well. To my mind, meaning (only complete when tied to real world knowledge) is the last frontier for AI and language. Today’s mobile/home voice assistants have definitely not yet mastered meaning. Technologies have made serious headway in resolving structural patterns (syntax), proper nouns (Named Entity Recognition) and some other aspects of language. But meaning, that great magical beast, eludes its pursuers. It is really, really challenging to computationally model the depth and complexity of human understanding. Because, although language itself is quite complicated, it’s still an impoverished medium for conveying the millions of subtle things we want and are able to convey – it relies heavily on context, implicature, presuppositionentailment, prosody, speaker-listener relationship, etc. I agree again with the author when he says that human-like machines are “not around the corner.”

I do think that Hofstadter seems to be simultaneously recognizing how hard the task of modeling meaning is, while not giving enough credit for things accomplished. Google Translate is way better now than it was at its inception over ten years ago. I also think he glosses over the tool’s main usage – which is more functional than artistic or poetic. Am I wrong to assume people use Translate much more for a quick word or phrase, when traveling or speaking on the fly, than for translating longer passages of literature? If they’re doing the latter… per the author’s experimental results, they clearly shouldn’t be.

 

What do you think – about media reportage of technology, machine “intelligence”, Hofstadter’s article? Feel free to comment!