Career interviews: Computational linguist for a virtual assistant

working wugs_cropped

Wugs go to work

After much delay (eek! just realized it’s been a year!), I have another interview with a career linguist for your reading pleasure. [See the first interview here.] Even though I still get the “I’ve never met a real-live linguist” reaction when telling folks what I do, these days there are indeed people working full-time, earning living wages, as these specialized language nuts – and not all as professors in academia, or as translators/interpreters for the UN.

* * * * *

Just like with my last interviewee, I met Allan at Samsung Research America, where we worked together on Bixby, Samsung’s virtual voice assistant. On the Bixby linguist team, we worked with engineers, Quality Assurance (QA) testers and others to develop a personal assistant that would carry out thousands of different spoken user commands. Also like with my last interviewee, Allan is no longer at the job I interviewed him about. (He’s now a Language Engineer on Amazon’s Alexa!). I’m keeping questions and answers in present tense, however, because I feel like it.

Allan Schwade, a graduate student in linguistics, won the Humanities Division Dean's Award for his poster on the adaptation of Russian words by English speakers

  1. What kind of work do you do?

I’m a computational linguist, which means I create solutions for natural language processing problems using computers. More specifically, I work on the systems and machine learning models that enable your smart devices to understand you when you say “set an alarm for 7am” or “tell me the weather in Chicago”.

  1. Describe a typical day at your job.

I usually start the day by meeting with my manager. The lab I work in supports products in production and conducts research and development for smart devices. If there is an issue with a product in production, I’ll work with the team to solve the problem. Usually this involves curating the training data for the machine learning models – removing aberrant data from training sets or generating new data to support missing patterns. If nothing is on fire, there are usually several projects I’ll be working on at any given time. Projects generally start out with me doing a lot of reading on the state of the art, then I’ll reach a point where I’m confident enough to build a proof of concept (POC). While I’m creating the POC, the linguists will generate data for the models. Once the code and data are ready, I’ll build the models and keep iterating until performance is satisfactory. The only really dependable thing in my schedule is lunch and a mid-afternoon coffee break with colleagues, which are both indispensable.

  1. How does your linguistics background inform your current work?

My degree in linguistics is crucial for my current line of work. When building machine learning models, so much rests on data you feed into your models. If your data set is diverse and representative of the problem, your model will be robust.

Having a linguistics background also gives me quick insight into data sets and how to balance them. Understanding the latent structures in the data allows me to engineer informative feature vectors for my models (feature vectors are derived from the utterances collected and are the true inputs to the machine learning model).

  1. What do you enjoy most and/or least about the job?

I really enjoy getting to see differences between human and machine learning. We have a pretty good idea of the types of things humans will attend to when learning language, but sometimes those things aren’t informative for machines. It can be frustrating when something I’d call “obvious” is useless in a model and even more frustrating when something “marginal” is highly informative. But I never tire of the challenge, the satisfaction I feel at the end of a project is worth it.

The thing I enjoy least is data annotation. The process of doing it is indispensable because you become intimately familiar with the problem, but after a couple of hours of it my mind goes numb.

  1. What did you study in college and/or grad school?

I got my BA from Rutgers University and my MS from the University of California, Santa Cruz. Both degrees were in linguistics and both schools specialized in generative linguistic theory. I enjoyed a lot about the programs but they did a better job of preparing people for careers in academia than industry. Learning programming or common annotation tools and schemas before graduating would have made industry life easier for me.

  1. What is your favorite linguistic phenomenon?

Loanword adaptation! I wrote my master’s thesis on it. Seeing how unfamiliar phonemes are digested by speakers never fails to pique my interest. In general, I love it when stable systems are forced to reconcile things outside their realm of experience.

  1. (If you had the time) what language would you learn, and why?

As a phonetician I’d love to learn Georgian for its consonant clusters, Turkish for its morpho-phonology, Hmong for its tones, or ASL because it’s a completely different modality than what I specialized in. As a subjective entity who does things for personal enjoyment, I’d love to learn Japanese.

  1. Do you have any advice for young people looking to pursue a career in linguistics?

If you want to go into industry doing natural language processing, I cannot stress enough how important the ability to code is. It’s true that for annotation work you won’t usually need it, but if you want to be annotation lead, the ability to write utility scripts will save you a lot of time. Also, how I transitioned from annotator to computational linguist came from me showing basic coding competency – the engineers were too busy to work on some projects so they threw the smaller ones my way. This brings me to my next piece of advice: always voice your interest in things that interest you to those with the potential to get you involved. Telling your co-worker you really want to work on a cool project will do next to nothing, but telling your manager or the project lead that you are interested in a project may get you involved.

A Norwegian smörgåsbord

norwegian_sign_cropped

Okay, “smörgåsbord” is a Swedish borrowing, but close enough. It’s appropriate for this post, which will be a buffet of miscellaneous facts about the Norwegian language.

I became interested in and started learning Norwegian because my brother has been living in Oslo for the past several years, where he is getting his Ph.D. in lichenology.[1] My family and I traveled to visit him last summer. To characterize the country in a few words, I’d say Norway is – more iconically – Vikings, fjords, trolls, nature, Norse mythology, and – more personally – lichens, stellar black coffee, gross sweet brown cheese, overly-restricted booze-purchasing hours, part of my paternal ancestry, and vampires.[2]

Heddal stavkirke (stave church), built in the early 13th century

So what’s cool about Norwegian?

Dialects

First (as I mentioned in one of the recent dialect posts), Norwegian forms a dialect continuum with Swedish and Danish, languages with which it is, to a greater or lesser extent, mutually intelligible. These are Scandinavian or North Germanic languages, along with Icelandic and Faroese. My brother, who now has a decent command of Norwegian, says he can understand Swedish relatively well too, although Danish is harder. Have a listen to differences between Danish and Norwegian in this video.

However, there are also a staggering number of Norwegian dialects spread across Norway. People claim it’s often harder to understand someone from a different part of the country (for example, Oslo inhabitants vs. speakers of trøndersk, a group of sub-dialects in north-central Trøndelag county) than it is to understand a Swede speaking Swedish. Wikipedia corroborates: “Variations in grammar, syntax, vocabulary, and pronunciation cut across geographical boundaries and can create a distinct dialect at the level of farm clusters. Dialects are in some cases so dissimilar as to be unintelligible to unfamiliar listeners.”

There are two official standard forms for the written language, even if there is no standard for spoken Norwegian (since local dialects rule in most situations). Bokmål (literally “book tongue”) is used in the majority of publications, and Nynorsk (“new Norwegian”) in under 10% of written communication.

Lexicon and Morphology

Onto smaller language-y bits: words and morphemes. Norwegian is super fun because it is prone to extensive compounding (like German), and these compounds often break down into etymologically amusing or charming pieces. By this I mean that the component words reveal interesting (but usually sensible) semantic relationships with the larger compound. Let me give you some examples:

Norwegian compound English word Individual morphemes
fruktkjøtt “pulp” frukt (“fruit”) + kjøtt (“meat”)  ⇒  “fruit meat”
matbit “snack” mat (“food”) + bit (“bite”)  ⇒  “food bite”
sommerfugl “butterfly” sommer (“summer”) + fugl (“bird”) ⇒  “summer bird”
morkake “placenta” mor (“mother”) + kake (“cake”)  ⇒  “mother cake”
verdensrommet “(outer) space” verden (“world”) + s (possessive) + romm (“room”) + et (“the”)  ⇒  “the room of the world”
skyehus “hospital” skye (“sick”) + hus (“house”)  ⇒  “sick house”
grønnsak “vegetable” grøn (“green”) + sak (“thing”)  ⇒  “green thing”
støvsuger “vacuum cleaner” støv (“dust”) + suger (“suck[er]”)  ⇒  “dust suck[er]”
flaggermus “bat” flagger (“flying”) + mus (“mouse”)  ⇒  “flying mouse”
piggsvin “hedgehog” pig (“spike”) + svin (“pig”)  ⇒  “spike pig”

Morphosyntax 

rommegraut_cropped


Rest stop on the road back to Oslo. Rømmegraut is the Nynorsk word for a traditional porridge – kind of like cream of wheat, but sweeter and topped with butter.

One facet of Norwegian morphosyntax that was novel to me is the structure of its determiners. In English, both definite (“the”) and indefinite (“a / an”) articles are independent words that always precede their noun or noun phrase. So we have:

“the house”          “the big blue house”
“a house”             “a big blue house”

The same is true for the Romance languages I know about (French, Spanish, Italian), the other Germanic language I’m familiar with (German)… and it is simply not relevant for the Asian languages I’ve dabbled in (Japanese, Cantonese) because they lack articles entirely.

In Norwegian (as well as in Swedish and Danish), indefinite articles are, familiarly, the independent words which precede the noun, while definite articles are actually suffixes, which attach to the end of the noun they modify. What’s more – if you place something in front of the noun, like an adjective or a number, there’s another set of determiners to use, called demonstratives (in English: this, that, these, those). These precede the noun phrase (adjective/number + noun), where the noun already contains its definite suffix. Again, a table might help illustrate:

Norwegian (Bokmål) determiners

Indefinite articles

Definite articles

Masc. singular

Fem. singular

Neuter singular

Masc. singular

Fem. singular

Neuter singular

en

ei

et

-en

-a

-et

en sykkel
“a bicycle”

ei jente
“a girl”

et hus
“a house”

bilen
“the car”

døra
“the door”

huset
“the house”

Demonstratives + noun phrase

den

den

det

den røde bilen
“the red car”

den røde døra
“the red door”

det røde huset
“the red house”

Because Norwegian and English are closely related in their linguistic genealogy, a native English speaker may have less trouble learning Norwegian than, say, Taa (also known as !Xóõ, a southern African language with possibly the largest phoneme inventory in the world, including dozens of clicks) – but as the determiner situation here demonstrates, it’s still no piece of bløtkake.

IMG_20180708_100933

View (!) from our rental house deck on Hardangerfjord

Phonology and Prosody

Norwegian is what’s called a pitch-accent language. There are roughly three categories of languages when it comes to stress and pitch. Here’s a super abridged breakdown [3]:

  1. Stress-accented languages

Stress (emphasis) is placed on a syllable in a word, or on a word in a phrase/sentence. This can create a difference in word meaning, but it doesn’t have to. Stress is a combination of loudness, length, and higher pitch.

  • Example languages: English, Czech, Finnish, Classical Arabic, Quechua, Italian
  • Example words/phrases [English]:
    • On a word in a sentence (no difference in meaning) – “I REALLY like your jacket”
    • On a syllable in a word (meaning difference) –

NOUNS vs. VERBS
REcord vs. reCORD
INcrease vs. inCREASE
PERmit vs. perMIT

  1. Pitch-accented languages

A syllable on a word/morpheme is accentuated by a particular pitch contour (instead of by stress). So only pitch is involved, not loudness or length. Distinct tonal patterns occur in words that otherwise look and sound the same, giving them different meanings.

  • Example languages: Norwegian, Swedish, Japanese, Turkish, Filipino, Yaqui (a Native American language)
  • Example words/phrases [Norwegian]:
    • Norwegian has two kinds of tonal accents or pitch patterns:

ACCENT 1 (ACUTE) and ACCENT 2 (GRAVE)

(Audio extracted from video by “Norwegian Teacher – Karin”)

hender – “hands” vs. hender – “happens”
ånden – “the spirit” vs. ånden – “the breath”
bønder – “farmer” vs. bønner – “beans”
været – “the weather” vs. være – “to be”

  1. Tonal languages

Each syllable of the language has an independent tone or pitch contour. Tones are used to distinguish between words (they create a difference in meaning between words that otherwise look and sound the same).

  • Example languages: Mandarin, Cantonese, Thai, Zulu, Navajo, Yucatec (a Mayan language)
  • Examples words/phrases [Mandarin]:
    • Tones combine with the syllable ma, resulting in different words:
  1. “mother” [high level tone]
  2. “hemp” [mid pitch rising to high pitch]
  3. “horse” [low with slight fall]
  4. “scold” [short, sharply falling tone]
  5. ma (an interrogative particle) [neutral, used on weak syllables]

 

The pitch-accent feature of Norwegian contributes to the language’s sing-song quality. Just listen to the melodiousness of Norway’s King Harald V as he gives a speech:

(Audio extracted from full NRK video)

Orthography

Norwegian writing uses the same Latin alphabet as English, except that it has three additional letters at the end – æ, ø, and å. I highly recommend insist that you watch this ridiculous video to hear how the vowels are pronounced, as well as be entertained in musically nerdy fashion. (Final note: Contrary to the video’s main argument, several letters – c, q, w, x, and z – are not actually used to spell Norwegian-native words, although they’re sometimes used in loan words. One could therefore quibble that they shouldn’t count towards the alphabet size…)

vowels_cropped

 

 

[1] If you want to ogle some gorgeous macrophotography of lichens, scope out his Instagram, https://www.instagram.com/lichens_of_norway/.

[2] The ancient stave churches for some reason reminded me of True Blood (plus three of the show’s main characters, Eric, Pam, and Godric, were Swedish and Norwegian); also I was coincidentally reading The Vampire Lestat while we were there… but NO I’m not generally obsessed with vampires.

[3] This subject gets really complex. There are a lot more subtleties and distinctions than I make above.

Read my lips: McGurk and speech perception

internal_ear_cropped

Think of a situation in which you might try to lip read. Would it be chatting with friends in a noisy bar? Or watching a muted TV? Maybe you were born (or became) deaf. In all of these scenarios, the channel for auditory input is severely obstructed or entirely absent. But what if I told you that lip reading isn’t limited to those contexts? The surprising truth is that we lip read, to some extent, during practically every spoken interaction where we can observe the speaker’s face.

This means that the visual information involved in speech perception and processing is way more crucial than we would imagine. For a demonstration, watch this video.[1]

As the video mentions, the phenomenon is called the McGurk effect. It’s named after one of the researchers who first discovered it – psychologists Harry McGurk and John MacDonald described the effect in their 1976 paper “Hearing Lips and Seeing Voices”. They found that when a person watches someone pronounce one sound, but hears the paired audio of another sound, they will perceive a third sound which lies somewhere in-between the visual and auditory cues.

More specifically, the paper details an experiment in which adults watched a video of a woman making the lip/mouth movements for ga, but the audio was dubbed with the syllable ba. Subjects reported hearing a third syllable, da. (When they watched the undubbed video, they correctly heard ga, and when they listed to the audio only, they correctly heard ba.)

How is da “in-between” ba and ga?

In phonetics, speech sounds, or phones (particularly consonants) are classified according to three main parameters: voicing, place of articulation, and manner of articulation. Voicing is whether your vocal folds vibrate when producing the sound – if they vibrate, the sound is voiced; if they don’t, it’s voiceless. Place of articulation deals with how and where your mouth (mostly tongue and lips) moves to constrict or obstruct the vocal tract, creating different consonants. Manner of articulation concerns the way in which air flow passes through or is blocked by the articulators (i.e. mouth/tongue/lips) when they’ve taken a given shape.

Identifying such parameters for the badaga example in the 1967 paper:

  • [Voicing] /b/, /d/, and /g/ are all VOICED consonants
    • (Try putting fingers to your Adam’s apple while saying ba, da, and ga, and you should feel the vibration of your vocal folds)
  • [Manner of articulation] /b/, /d/, and /g/ are all stops or plosives – meaning airflow is entirely blocked by the mouth and nose
  • [Place of articulation]
    • /b/ is a bilabial – formed with both lips
    • /d/ is an alveolar – formed when the tongue tip touches the alveolar ridge (behind the upper front teeth)
    • /g/ is a velar – formed when the tongue body arches towards the velum or soft palate (upper back of the mouth)

Here’s a little sketch of the vocal tract to help visualize.

Articulators and places of articulation

places_of_artic_dwg_alt

We can see that the three consonants in ba, da, and ga share voicing and articulatory manner, and only differ in their place of articulation. Interestingly, the “intermediate” consonant that subjects perceived, /d/, falls almost exactly in-between /b/ (the audio consonant) and /g/ (the lip-read consonant) in its place of articulation – /b/ is formed at the front of the mouth, /g/ is formed at the back, and /d/ is formed in the middle. Our brain is combining the conflicting auditory and visual information, and perceiving a sound lying between the two cues. The combination isn’t always this exact, but the perceived phone will usually have some features in common with both sensory inputs.

The McGurk effect is strong and, as far as we know, universal. It happens even when the voice and face are in different locations, or of different genders! All people tested, regardless of their language background, have demonstrated this speech integration behavior. However, the precise kinds of synthesis (as well as the effect’s intensity) vary by language. The phenomenon is more apparent in German, English, Spanish, and Turkish language listeners than in Chinese and Japanese participants, for example. The particular syllabic configuration of Japanese, and cultural norms (like face avoidance) may mean that they don’t rely on visual evidence of speech to the same extent. Loud environments heighten the effect for everyone though.

The overall robustness of the McGurk effect indicates just how automatically and unconsciously we combine auditory and visual stimuli when perceiving speech. A slew of related studies demonstrate that our brain really functions on constant multisensory (or multimodal) data, and is not limited to the sense we take as primary for a particular type of input.

So what other perceptual abilities might be a mix of stuff from several senses? If you have some guesses and are curious to know more, I recommend See What I’m Saying, by Lawrence D. Rosenblum. (In fact, he was the guy in the video above!) Like McGurk and MacDonald, he’s not quite a linguist, but we can forgive him.

 

[1] The video says you should have perceived the sound fa during the second clip of the man, but I actually get a kind of va / fa mix, which makes sense – va is intermediate in that it’s voiced like ba (the audio cue) but has a labiodental place of articulation, like fa (the visual cue).

*Photo attribution: page 241 Internal Ear

Foreign accent syndrome and speech understanding

Maybe you’ve heard about people who, due to a stroke or accident, can no longer speak or understand speech normally. Aphasia is a condition where damage (usually from stroke or head trauma) to particular regions of the brain causes a person to lose specific language production and comprehension abilities. Aphasia studies are remarkable, and have taught us much about how language works inside the brain. Recently, however, I learned of a strange condition kind of similar to, but not as debilitating as aphasia: Foreign accent syndrome.

Here’s the article: The curious case of people who can’t stop speaking in foreign accents.

You should read the full piece, but a few highlights follow.

“Foreign accent syndrome (FAS for short) is a real thing, though it’s very rare — fewer than 200 cases diagnosed since it was first described in 1907. It may sound like it’s just a delusion or fantasy, but fewer than 10 percent of cases have a psychological basis (for example, related to schizophrenia). Nearly all of the rest are of neurological origin: They are caused by damage or impairment to a specific area of the brain.”

“The short explanation is that FAS (possibly excepting the few psychological cases) is a disorder of speech planning and execution.”

“Naturally, when we hear speech tweaked on one or several of those parameters, we associate it with an accent that has a similar feature set. And if we hear an accent that sounds somehow different but we’re not sure, we make our best guess at what it is. The less familiar we are with an accent, the less we’re likely to notice things that aren’t quite right about it: Many Americans can’t tell Australian accents from English ones even though people from England and Australia have no trouble telling the two apart. Researchers studying cases of FAS have sometimes tested samples of the speech with listeners who didn’t know its origin to see if they could identify where it was from, and, unsurprisingly, opinions varied considerably. But when they tested “controls” — samples of native speakers with unimpaired speech from several countries — listeners couldn’t always identify their accents accurately either.”

Videos of people with FAS:

Texan woman with “British English” accent
English woman with “Chinese” accent

What does the author mean by feature set in the quote above? Each language has its own unique set of phonemes or word-distinguishing units of sound (called its phonemic inventory). This set is pulled from a larger set of all possible speech sounds. We recognize different languages and accents by these feature sets. And yet, the FAS article had me thinking about our general ineptitude when it comes to identifying accents. Because the speech of the person with FAS in the first video is close enough to a British English accent from an American English point of view, we lump it in that category…while to a native British English speaker, they’re noticeably dissimilar. Even in our native accent, speech can be hard to disambiguate. The recent viral “Laurel/Yanny” debate is a good example.

“Laurel”/“Yanny” makes you wonder how we understand each other at all. I’d say that pragmatics, i.e. context, plays a vital role. When communication is stripped of context, it’s significantly more difficult to interpret. My current work has made this very clear to me. We’re listening to and transcribing 2-to-4-second audio clips of users’ commands (and other speech) to their smartphone voice assistants. Knowing the contexts of “voice assistant-directed speech” and “things one might want to do on their phone” helps some. But outside of that, because the sound clips are so short and we’re not actually there, interacting with the speaker, a part of the conversation – it’s often a challenge to resolve (i.e. accurately transcribe) certain categories of words. Proper nouns are especially tricky. Contacts’ names, product or app names, names of unfamiliar towns and musicians, et cetera.  If we were face-to-face with the speaker, we’d have the entire conversational, situational, and interpersonal background at our disposal, which helps resolve such unknowns.

Successful communication must rely on many factors. And language doesn’t exist in a vacuum.