Career interviews: Computational linguist for a virtual assistant

working wugs_cropped

Wugs go to work

After much delay (eek! just realized it’s been a year!), I have another interview with a career linguist for your reading pleasure. [See the first interview here.] Even though I still get the “I’ve never met a real-live linguist” reaction when telling folks what I do, these days there are indeed people working full-time, earning living wages, as these specialized language nuts – and not all as professors in academia, or as translators/interpreters for the UN.

* * * * *

Just like with my last interviewee, I met Allan at Samsung Research America, where we worked together on Bixby, Samsung’s virtual voice assistant. On the Bixby linguist team, we worked with engineers, Quality Assurance (QA) testers and others to develop a personal assistant that would carry out thousands of different spoken user commands. Also like with my last interviewee, Allan is no longer at the job I interviewed him about. (He’s now a Language Engineer on Amazon’s Alexa!). I’m keeping questions and answers in present tense, however, because I feel like it.

Allan Schwade, a graduate student in linguistics, won the Humanities Division Dean's Award for his poster on the adaptation of Russian words by English speakers

  1. What kind of work do you do?

I’m a computational linguist, which means I create solutions for natural language processing problems using computers. More specifically, I work on the systems and machine learning models that enable your smart devices to understand you when you say “set an alarm for 7am” or “tell me the weather in Chicago”.

  1. Describe a typical day at your job.

I usually start the day by meeting with my manager. The lab I work in supports products in production and conducts research and development for smart devices. If there is an issue with a product in production, I’ll work with the team to solve the problem. Usually this involves curating the training data for the machine learning models – removing aberrant data from training sets or generating new data to support missing patterns. If nothing is on fire, there are usually several projects I’ll be working on at any given time. Projects generally start out with me doing a lot of reading on the state of the art, then I’ll reach a point where I’m confident enough to build a proof of concept (POC). While I’m creating the POC, the linguists will generate data for the models. Once the code and data are ready, I’ll build the models and keep iterating until performance is satisfactory. The only really dependable thing in my schedule is lunch and a mid-afternoon coffee break with colleagues, which are both indispensable.

  1. How does your linguistics background inform your current work?

My degree in linguistics is crucial for my current line of work. When building machine learning models, so much rests on data you feed into your models. If your data set is diverse and representative of the problem, your model will be robust.

Having a linguistics background also gives me quick insight into data sets and how to balance them. Understanding the latent structures in the data allows me to engineer informative feature vectors for my models (feature vectors are derived from the utterances collected and are the true inputs to the machine learning model).

  1. What do you enjoy most and/or least about the job?

I really enjoy getting to see differences between human and machine learning. We have a pretty good idea of the types of things humans will attend to when learning language, but sometimes those things aren’t informative for machines. It can be frustrating when something I’d call “obvious” is useless in a model and even more frustrating when something “marginal” is highly informative. But I never tire of the challenge, the satisfaction I feel at the end of a project is worth it.

The thing I enjoy least is data annotation. The process of doing it is indispensable because you become intimately familiar with the problem, but after a couple of hours of it my mind goes numb.

  1. What did you study in college and/or grad school?

I got my BA from Rutgers University and my MS from the University of California, Santa Cruz. Both degrees were in linguistics and both schools specialized in generative linguistic theory. I enjoyed a lot about the programs but they did a better job of preparing people for careers in academia than industry. Learning programming or common annotation tools and schemas before graduating would have made industry life easier for me.

  1. What is your favorite linguistic phenomenon?

Loanword adaptation! I wrote my master’s thesis on it. Seeing how unfamiliar phonemes are digested by speakers never fails to pique my interest. In general, I love it when stable systems are forced to reconcile things outside their realm of experience.

  1. (If you had the time) what language would you learn, and why?

As a phonetician I’d love to learn Georgian for its consonant clusters, Turkish for its morpho-phonology, Hmong for its tones, or ASL because it’s a completely different modality than what I specialized in. As a subjective entity who does things for personal enjoyment, I’d love to learn Japanese.

  1. Do you have any advice for young people looking to pursue a career in linguistics?

If you want to go into industry doing natural language processing, I cannot stress enough how important the ability to code is. It’s true that for annotation work you won’t usually need it, but if you want to be annotation lead, the ability to write utility scripts will save you a lot of time. Also, how I transitioned from annotator to computational linguist came from me showing basic coding competency – the engineers were too busy to work on some projects so they threw the smaller ones my way. This brings me to my next piece of advice: always voice your interest in things that interest you to those with the potential to get you involved. Telling your co-worker you really want to work on a cool project will do next to nothing, but telling your manager or the project lead that you are interested in a project may get you involved.

Frame Semantics and FrameNet

FN image

I’d like to discuss a theory in cognitive linguistics which is very near to my heart[1]: frame semantics. I’ll also present FrameNet, a database built using frame semantic theory, which has been and continues to be an excellent resource in the fields of natural language processing (NLP) and machine learning (ML).

Why is frame semantics cool? Why should you want to learn about it? Just this: the theory is an intuitive and comprehensive way to categorize the meaning of any scenario you could possibly dream up and express via language. Unlike many other semantic and syntactic theories, the core concepts are quickly understandable to the non-linguist. What’s more, frame semantics can apply to language meaning at many different levels (from the tiniest morpheme to entire swaths of discourse), and it works equally well for any particular language – be it English, Mandarin, Arabic, or Xhosa. I’ll try to demonstrate the theory’s accessibility and applicability with some details.

American linguist Charles Fillmore developed the frame semantics research program in the 1980s, using the central idea of a frame: a cognitive scene or situation which is based on a person’s prototypical understanding of real-world (social, cultural, biological) experiences. A frame is ‘evoked’ by language – this can be a single word (called a lexical unit), a clause, a sentence, or even longer discourse. Each frame contains various participants and props, called frame elements (FEs). If you’ve studied syntax/semantics (the generative grammar kind), FEs are somewhat analogous to traditional theta roles.

FrameNet is a corpus-based lexicographic and relational database (sort of a complex dictionary) of English frames, the lexical units evoking them, annotated sentences containing those lexical units, and a hierarchy of frame-to-frame relations. It was built and continues to grow at the International Computer Science Institute (ICSI), a nonprofit research center affiliated with UC Berkeley. FrameNets have also been developed in other languages, such as Spanish, Brazilian Portuguese, Japanese, Swedish, French, Chinese, Italian, and Hebrew.

Each frame entry includes a definition, example sentences, frame elements, lexical units, and annotation that illustrates the various fillers (words) of the FEs as well as their syntactic patterns. Let’s unpack all of this!

We’ll take a look at the Motion frame in FrameNet. Some screenshots of the frame entry follow.

framenet_motion1

The Motion frame is first defined. Its definition includes the frame elements that belong to the frame (the text with color highlighting):

“Some entity (Theme) starts out in one place (Source) and ends up in some other place (Goal), having covered some space between the two (Path). Alternatively, the Area or Direction in which the Theme moves or the Distance of the movement may be mentioned.”

After the definition come example sentences, featuring lexical units that evoke the frame (the black-backgrounded text) such as move, drift, float, roll, go.

Further down is the list of frame elements with their definitions and examples.

framenet_motion2

Here, the Theme FE is “the entity that changes location,” while the Goal FE is “the location the Theme ends up in.” In order for language to evoke this Motion frame, it must have some words or phrases which instantiate the Theme, the Goal, and the other FEs listed. In the examples above, me is a Theme in The explosion made [me] MOVE in a hurry; and into the slow lane is a Goal in The car MOVED [into the slow lane].

At the bottom of the entry is a list of lexical units that belong to or evoke the frame, as well as links to annotation of sentences from real data that contain those words.

framenet_motion3

Verbs like come, glide, roll, travel, and zigzag all evoke, quite sensibly, the Motion frame.

Once you click on the “Annotation” link for a particular lexical item, you’re taken to a page that looks like this:

framenet_motion4

Natural language sentences pulled from online corpora (texts from newspapers, magazines, books, tv transcripts, scholarly articles, etc.) are annotated for their Motion FEs. Annotation for the lexical item glide gives us an idea of the types of “entities” (the purple-backgrounded text, or Theme FEs) that “change location” (i.e. that glide) – boats, pink clouds, men, cars, planes, gondolas, and so on.

* * * * *

After this mini FrameNet dive, you may be wondering how the database is used in a concrete sense. To illustrate, let’s compare two sentences:

  1. The boat GLIDED into the harbor.
  2. The dingy DRIFTED away from the harbor.

The entities differ (boat vs. dingy), the verbs differ (glide vs. drift) and the prepositions differ (into vs. [away] from). Yet at a higher level, both of these sentences describe a Theme which “changes location” – either moving towards a Goal in (1), or from a Source in (2). They both indicate motion. Because FrameNet helps machines “learn” that sentences with a variety of nouns, verbs, prepositions, and syntactic patterns can basically point to the same scenario, it’s a useful tool for many applications in the computational realm.

These days computers do all kinds of language-y things for us: answer questions, paraphrase texts, extract relevant information from text (and then maybe organize it thematically – for instance, around people, places, or events), and even generate new texts. These feats require that a computer parse natural language into accurate semantic chunks. FrameNet’s semantically- and syntactically-annotated data can be used as training input for machine models that “learn” how to analyze such meaning chunks, enabling our electronic devices to respond, paraphrase, or extract information appropriately.

To peruse a (very long) list of the projects which have used FrameNet data (organized by requester/researcher), check out the FrameNet Downloaders page.

So – on the off-chance that you find yourself stuck at home and bored out of your mind (?!?!)… you might perhaps enjoy a little investigation of frame-semantic characterization of scenes that involve applying heat, intoxication, or temporal collocation. 🙂

 

[1] Why am I so fond of frame semantics? A terrific professor of mine during grad school introduced the theory, and it resonated with me immediately. I used it in my master’s thesis, then presented the paper at the International Conference on Construction Grammar in 2014. Eventually, I had the privilege of working at FrameNet, where I came to know the brilliant lexicographers/semanticists/cognitive linguists who have dedicated decades of their lives to the theory and the project. Sadly, I never met the legendary Chuck Fillmore, as he passed away the year before I joined the FrameNet team.

Back from hiatus

Why, hello there! It’s been ages since I’ve posted, but I’ve been pretty busy with a tiny new experiment:

IMG_20191125_111833_cropped

Ryden was born in October (the photo was taken at not-quite-two-months) and is now emphatically ingesting solids, crawling (but only backwards), and beginning to babble.

Now that my life has gone from hallucinatorily topsy-turvy to relatively stable (in a pandemic – yes, that’s how childbirth and newborn-land will relativize things), I plan on posting again more regularly. Coming up, stuff on:

  • frame semantics and FrameNet
  • “parentese” (apropos, yes?)
  • another linguist career interview
  • “crashblossoms”

Hurray!

 

A Norwegian smörgåsbord

norwegian_sign_cropped

Okay, “smörgåsbord” is a Swedish borrowing, but close enough. It’s appropriate for this post, which will be a buffet of miscellaneous facts about the Norwegian language.

I became interested in and started learning Norwegian because my brother has been living in Oslo for the past several years, where he is getting his Ph.D. in lichenology.[1] My family and I traveled to visit him last summer. To characterize the country in a few words, I’d say Norway is – more iconically – Vikings, fjords, trolls, nature, Norse mythology, and – more personally – lichens, stellar black coffee, gross sweet brown cheese, overly-restricted booze-purchasing hours, part of my paternal ancestry, and vampires.[2]

Heddal stavkirke (stave church), built in the early 13th century

So what’s cool about Norwegian?

Dialects

First (as I mentioned in one of the recent dialect posts), Norwegian forms a dialect continuum with Swedish and Danish, languages with which it is, to a greater or lesser extent, mutually intelligible. These are Scandinavian or North Germanic languages, along with Icelandic and Faroese. My brother, who now has a decent command of Norwegian, says he can understand Swedish relatively well too, although Danish is harder. Have a listen to differences between Danish and Norwegian in this video.

However, there are also a staggering number of Norwegian dialects spread across Norway. People claim it’s often harder to understand someone from a different part of the country (for example, Oslo inhabitants vs. speakers of trøndersk, a group of sub-dialects in north-central Trøndelag county) than it is to understand a Swede speaking Swedish. Wikipedia corroborates: “Variations in grammar, syntax, vocabulary, and pronunciation cut across geographical boundaries and can create a distinct dialect at the level of farm clusters. Dialects are in some cases so dissimilar as to be unintelligible to unfamiliar listeners.”

There are two official standard forms for the written language, even if there is no standard for spoken Norwegian (since local dialects rule in most situations). Bokmål (literally “book tongue”) is used in the majority of publications, and Nynorsk (“new Norwegian”) in under 10% of written communication.

Lexicon and Morphology

Onto smaller language-y bits: words and morphemes. Norwegian is super fun because it is prone to extensive compounding (like German), and these compounds often break down into etymologically amusing or charming pieces. By this I mean that the component words reveal interesting (but usually sensible) semantic relationships with the larger compound. Let me give you some examples:

Norwegian compound English word Individual morphemes
fruktkjøtt “pulp” frukt (“fruit”) + kjøtt (“meat”)  ⇒  “fruit meat”
matbit “snack” mat (“food”) + bit (“bite”)  ⇒  “food bite”
sommerfugl “butterfly” sommer (“summer”) + fugl (“bird”) ⇒  “summer bird”
morkake “placenta” mor (“mother”) + kake (“cake”)  ⇒  “mother cake”
verdensrommet “(outer) space” verden (“world”) + s (possessive) + romm (“room”) + et (“the”)  ⇒  “the room of the world”
skyehus “hospital” skye (“sick”) + hus (“house”)  ⇒  “sick house”
grønnsak “vegetable” grøn (“green”) + sak (“thing”)  ⇒  “green thing”
støvsuger “vacuum cleaner” støv (“dust”) + suger (“suck[er]”)  ⇒  “dust suck[er]”
flaggermus “bat” flagger (“flying”) + mus (“mouse”)  ⇒  “flying mouse”
piggsvin “hedgehog” pig (“spike”) + svin (“pig”)  ⇒  “spike pig”

Morphosyntax 

rommegraut_cropped


Rest stop on the road back to Oslo. Rømmegraut is the Nynorsk word for a traditional porridge – kind of like cream of wheat, but sweeter and topped with butter.

One facet of Norwegian morphosyntax that was novel to me is the structure of its determiners. In English, both definite (“the”) and indefinite (“a / an”) articles are independent words that always precede their noun or noun phrase. So we have:

“the house”          “the big blue house”
“a house”             “a big blue house”

The same is true for the Romance languages I know about (French, Spanish, Italian), the other Germanic language I’m familiar with (German)… and it is simply not relevant for the Asian languages I’ve dabbled in (Japanese, Cantonese) because they lack articles entirely.

In Norwegian (as well as in Swedish and Danish), indefinite articles are, familiarly, the independent words which precede the noun, while definite articles are actually suffixes, which attach to the end of the noun they modify. What’s more – if you place something in front of the noun, like an adjective or a number, there’s another set of determiners to use, called demonstratives (in English: this, that, these, those). These precede the noun phrase (adjective/number + noun), where the noun already contains its definite suffix. Again, a table might help illustrate:

Norwegian (Bokmål) determiners

Indefinite articles

Definite articles

Masc. singular

Fem. singular

Neuter singular

Masc. singular

Fem. singular

Neuter singular

en

ei

et

-en

-a

-et

en sykkel
“a bicycle”

ei jente
“a girl”

et hus
“a house”

bilen
“the car”

døra
“the door”

huset
“the house”

Demonstratives + noun phrase

den

den

det

den røde bilen
“the red car”

den røde døra
“the red door”

det røde huset
“the red house”

Because Norwegian and English are closely related in their linguistic genealogy, a native English speaker may have less trouble learning Norwegian than, say, Taa (also known as !Xóõ, a southern African language with possibly the largest phoneme inventory in the world, including dozens of clicks) – but as the determiner situation here demonstrates, it’s still no piece of bløtkake.

IMG_20180708_100933

View (!) from our rental house deck on Hardangerfjord

Phonology and Prosody

Norwegian is what’s called a pitch-accent language. There are roughly three categories of languages when it comes to stress and pitch. Here’s a super abridged breakdown [3]:

  1. Stress-accented languages

Stress (emphasis) is placed on a syllable in a word, or on a word in a phrase/sentence. This can create a difference in word meaning, but it doesn’t have to. Stress is a combination of loudness, length, and higher pitch.

  • Example languages: English, Czech, Finnish, Classical Arabic, Quechua, Italian
  • Example words/phrases [English]:
    • On a word in a sentence (no difference in meaning) – “I REALLY like your jacket”
    • On a syllable in a word (meaning difference) –

NOUNS vs. VERBS
REcord vs. reCORD
INcrease vs. inCREASE
PERmit vs. perMIT

  1. Pitch-accented languages

A syllable on a word/morpheme is accentuated by a particular pitch contour (instead of by stress). So only pitch is involved, not loudness or length. Distinct tonal patterns occur in words that otherwise look and sound the same, giving them different meanings.

  • Example languages: Norwegian, Swedish, Japanese, Turkish, Filipino, Yaqui (a Native American language)
  • Example words/phrases [Norwegian]:
    • Norwegian has two kinds of tonal accents or pitch patterns:

ACCENT 1 (ACUTE) and ACCENT 2 (GRAVE)

(Audio extracted from video by “Norwegian Teacher – Karin”)

hender – “hands” vs. hender – “happens”
ånden – “the spirit” vs. ånden – “the breath”
bønder – “farmer” vs. bønner – “beans”
været – “the weather” vs. være – “to be”

  1. Tonal languages

Each syllable of the language has an independent tone or pitch contour. Tones are used to distinguish between words (they create a difference in meaning between words that otherwise look and sound the same).

  • Example languages: Mandarin, Cantonese, Thai, Zulu, Navajo, Yucatec (a Mayan language)
  • Examples words/phrases [Mandarin]:
    • Tones combine with the syllable ma, resulting in different words:
  1. “mother” [high level tone]
  2. “hemp” [mid pitch rising to high pitch]
  3. “horse” [low with slight fall]
  4. “scold” [short, sharply falling tone]
  5. ma (an interrogative particle) [neutral, used on weak syllables]

 

The pitch-accent feature of Norwegian contributes to the language’s sing-song quality. Just listen to the melodiousness of Norway’s King Harald V as he gives a speech:

(Audio extracted from full NRK video)

Orthography

Norwegian writing uses the same Latin alphabet as English, except that it has three additional letters at the end – æ, ø, and å. I highly recommend insist that you watch this ridiculous video to hear how the vowels are pronounced, as well as be entertained in musically nerdy fashion. (Final note: Contrary to the video’s main argument, several letters – c, q, w, x, and z – are not actually used to spell Norwegian-native words, although they’re sometimes used in loan words. One could therefore quibble that they shouldn’t count towards the alphabet size…)

vowels_cropped

 

 

[1] If you want to ogle some gorgeous macrophotography of lichens, scope out his Instagram, https://www.instagram.com/lichens_of_norway/.

[2] The ancient stave churches for some reason reminded me of True Blood (plus three of the show’s main characters, Eric, Pam, and Godric, were Swedish and Norwegian); also I was coincidentally reading The Vampire Lestat while we were there… but NO I’m not generally obsessed with vampires.

[3] This subject gets really complex. There are a lot more subtleties and distinctions than I make above.