Of Kanji and Kana

jap_fabric_writing

The Japanese writing system, like other aspects of Japanese culture, is complicated and fascinating. Its three main character sets are a notorious struggle for second-language learners and young native speakers alike. While many tongues have what is called synchronic digraphia (where two or more writing systems for the same language coexist), Japanese is famous for having three main character sets within one single writing system.[1] Of interest to linguistics-minded folks, these three character sets systematically express different areas of the language’s grammar (word classes, for instance). Below is my attempt at a fun, informative introduction to the system.

The three main character sets of Japanese are kanji, hiragana, and katakana.

漢字 | KANJI

Kanji characters are logographic, meaning they cannot be spelled (sounded) out, but instead must be memorized whole. As many know, they were taken from the Chinese writing system. The term kanji literally means “Chinese characters”. If you’ve ever complained about the obtuse nature of English orthography, or remember the pain of memorizing weird word spellings as a child, consider this: a Japanese person of average education knows (i.e. has memorized) about three thousand kanji. Dictionaries contain about ten thousand kanji.[2]

Kanji are used for content words – nouns, verb stems, adjective stems, adverbs, personal names and place names. They’re composed of radicals, graphical pieces that often have either a semantic or phonetic quality (they indicate part of the meaning or the sound of the character, respectively). There is a particular stroke order for each character, which everyone is expected to follow when writing. And as if all that wasn’t enough of a challenge, there are also two separate pronunciations – on’yomi and kun’yomi – that depend on context or conjugation.

Here are some examples of kanji:

東京 – Tokyo (place name)                                   長谷川 – Hasegawa (surname)

薔薇 – bara (a noun, means “rose”)

違う – chigau (a verb or adjective, means “to be wrong” or “wrong”. Only the first character, the verb stem, is kanji; the second character, or conjugation, is hiragana)

Kana characters include the two sets hiragana and katakana. They’re both phonetic, meaning they can be sounded out. Kana also originally came from Chinese, but the characters are so altered and simplified that their sources are not apparent today. Japan adopted Chinese writing in the third century, and ran into trouble since the two spoken languages were completely unrelated. They began using characters not for their meanings, but for their sound values only. Both modern-day kana sets have an inventory of 46 characters (along with two types of diacritics), and these constitute a syllabary[3] of consonant-vowel pairings.

ひらがな | HIRAGANA

Hiragana has rounded symbols, smooth curves. The hiragana syllabary is used for native words, and grammatical elements like particles, auxiliary verbs, and inflections (e.g. verb conjugations, noun suffixes). Japanese children’s books are mostly in hiragana since younger kids haven’t yet learned many kanji. When books do include kanji, they have small furigana by the side – hiragana or katakana to help with pronunciation.

Here are some examples of hiragana:

               ありがとう – arigatou (“thank you”)                ください – kudasai (“please”)

               です – desu (auxiliary verb, “is”)                           の, は, を – no, wa, o (particles)

カタカナ | KATAKANA

With katakana, you’ll notice similarities to hiragana, but the symbol shapes are clearly more angular. Katakana is used for foreign names and words, loanwords, onomatopoeia, and emphasis.

Here are some examples of katakana:

アメリカ – amerika (foreign name, “America”)

サラリーマン – sararii man (“salary man”, i.e. office worker)

テレビ – terebi (loanword, “television”)

ニャンニャン – nyan nyan (onomatopoeia, sound of cat meowing)

* * * * *

The Japanese system has TWO directions for writing: vertical (tategaki), and horizontal (yokogaki). Vertical is the traditional form, running from top to bottom, right to left on the page. Books written with vertical text open the opposite way from Western language books. Horizontal is the direction Western language readers are used to – left to right on the page. This Western style is used in more modern applications, like websites. To maximize space, newspapers, magazines, and signs frequently use both directions![4] Then, because we still haven’t juggled enough variables, Japanese text doesn’t include spaces between words, so readers must infer based on context where divisions are to be made.

Cool Japanese literature tangent: The Tale of Genji (源氏物語 – Genji Monogatari), written by noblewoman Lady Murasaki Shikibu in the early 11th century, is frequently considered the world’s first novel or first modern novel.

I’ll leave you with some marvelously idiosyncratic Japanese words and concepts, for which there are definitely no concise words/phrases in English. You can observe how the three character sets interact in various ways. (Most of the words come from this site).

Enjoy!

Japanese Pronunciation     (in rōmaji) Character set(s) Definition Literal meaning
教育ママ kyouiku mama kanji + katakana A mother who is obsessed with her children’s education
バーコード人 baakoudo jin katakana + kanji Men with ridiculous comb-overs “barcode people”
横飯 yoko meshi kanji Western food “horizontal rice”
侘寂 wabi-sabi kanji An aesthetic that sees beauty in the ephemerality and imperfection of things both natural and manmade
ぽかぽか poka poka hiragana Feeling warm throughout one’s body
口寂しい kuchi sabishii kanji + hiragana When you’re not hungry but you eat anyway “mouth lonely”
猫糞 neko baba kanji To steal/pocket and pretend innocence “cat feces”
ありがた迷惑 arigata meiwaku hiragana + kanji “An act someone does for you that you didn’t want to have them do and tried to avoid having them do, but they went ahead anyway, determined to do you a favor, and then things went wrong and caused you a lot of trouble, yet in the end social conventions required you to express gratitude”[5]

 

[1] I say three “main” character sets because there are actually more, if you count Arabic numerals, rōmaji (i.e. the Roman alphabet), punctuation, etc. Also, this person argues that the focus on three+ character sets in Japanese is silly and that English and other writing systems have multiple sets as well (capital and lowercase letters in English, for example), but in order to keep things succinct here, I didn’t go into that level of detail. Additionally, I disagree with them that capital vs. lowercase Roman letters possess the same grammatical significance as kanji/hiragana/katakana and so it’s not an apples to apples comparison.

[2] https://nihongoichiban.com/2011/05/24/the-japanese-writing-system/

[3] Where each symbol represents a syllable.

[4] See this nice article with lots of illustrative pictures.

[5] https://sobadsogood.com/2012/04/28/25-words-that-simply-dont-exist-in-english/

Voynich: The manuscript that keeps on giving

full

The Voynich manuscript is one of those marvels that, even in these times of boundless knowledge and incredible technology, eludes continual efforts to understand it.

Not heard of the thing? Welcome to the show. There has been a vigorous little dance of press coverage over the past couple years. It goes something like this:

Step to your left.  “An eternal mystery.”
Step to your right.  “I’ve cracked the code!” – some dude
Step back.  “Nope, you’re full of shit.”
Step forward.  “We’ve solved it this time for sure.” – some other dudes
Repeat.

The manuscript is a hand-written, illustrated codex that’s been shown through carbon dating to have originated in the early fifteenth century (1404–1438). The writing system used throughout its approximately 240 pages has yet to be identified.[1] Cryptographers, historians, computer scientists and others have proposed numerous hypotheses over the decades, including that it’s a hoax. Based on the illustrations, scholars divide the manuscript into five thematic sections: Herbal, Astrological, Biological, Pharmacological, and Recipes.

Below I list links to the (more recent) rhythmic pulse of “discoveries” and rejections, in chronological order. Under each link I’ve pulled out quotes of the more intriguing tidbits.

* * * * *

November 30, 2016: https://www.newyorker.com/books/page-turner/the-unsolvable-mysteries-of-the-voynich-manuscript

“The first half of the book is filled with drawings of plants; scholars call this the “herbal” section. None of the plants appear to be real, although they are made from the usual stuff (green leaves, roots, and so on […]). The next section contains circular diagrams of the kind often found in medieval zodiacal texts; scholars call this part “astrological,” which is generous. Next, the so-called “balneological” section shows “nude ladies,” in Clemens’s words, in pools of liquid, which are connected to one another via a strange system of tubular plumbing that often snakes around whole pages of text. […] Then we get what appear to be instructions in the practical use of those plants from the beginning of the book, followed by pages that look roughly like recipes.”

“The Voynich MS was an early attempt to construct an artificial or universal language of the a priori type.   –Friedman.”

* * * * *

September 8, 2017: https://arstechnica.com/science/2017/09/the-mysterious-voynich-manuscript-has-finally-been-decoded/

“Now, history researcher and television writer Nicholas Gibbs appears to have cracked the code, discovering that the book is actually a guide to women’s health that’s mostly plagiarized from other guides of the era.”

“Gibbs realized he was seeing a common form of medieval Latin abbreviations, often used in medical treatises about herbs. ‘From the herbarium incorporated into the Voynich manuscript, a standard pattern of abbreviations and ligatures emerged from each plant entry,’ he wrote. ‘The abbreviations correspond to the standard pattern of words used in the Herbarium Apuleius Platonicus – aq = aqua (water), dq = decoque / decoctio (decoction), con = confundo (mix), ris = radacis / radix (root), s aiij = seminis ana iij (3 grains each), etc.’ So this wasn’t a code at all; it was just shorthand. The text would have been very familiar to anyone at the time who was interested in medicine.”

“Gibbs concluded that it’s likely the Voynich Manuscript was a customized book, possibly created for one person, devoted mostly to women’s medicine.”

* * * * *

September 10, 2017: https://www.theatlantic.com/science/archive/2017/09/has-the-voynich-manuscript-really-been-solved/539310/

“This week, the venerable Times Literary Supplement published as its cover story a ‘solution’ for the Voynich manuscript. The article by Nicholas Gibbs suggests the manuscript is a medieval women’s-health manual copied from several older sources. And the cipher is no cipher at all, but simply abbreviations that, once decoded, turn out to be medicinal recipes.”

“’Frankly I’m a little surprised the TLS published it,’ says Lisa Fagin Davis, executive director of the Medieval Academy of America. When she was a doctoral student at Yale—whose Beinecke Library holds the Voynich manuscript—Davis read dozens of theories as part of her job. ‘If they had simply sent to it to the Beinecke Library, they would have rebutted it in a heartbeat,’ she says.”

“In the second part—only two paragraphs long—Gibbs gets into the meat of his solution: Each character in the manuscript is an abbreviated word, not a letter. This could be a breakthrough, but the TLS presents only two lines decoded using Gibbs’s method. Davis did not find those two lines convincing either. ‘They’re not grammatically correct. It doesn’t result in Latin that makes sense,’ she says.”

* * * * *

February 1, 2018: https://www.atlasobscura.com/articles/voynich-manuscript-artificial-intelligence-solved

“There are two problems with this notoriously difficult puzzle—it’s written in code, and no one knows what language that code enciphers.”

“’That was surprising,’ Kondrak said, in a statement. ‘And just saying “this is Hebrew” is the first step. The next step is how do we decipher it.’ The scientists think the code used in the manuscript might have been created using alphagrams. (In standard alphagrams, the letters in a word are placed in alphabetical order—the alphagram of ‘alphagram,’ for example, is ‘aaaghlpmr.’) Vowels also seemed to have been dropped. These assumptions made, they tried to come up with an algorithm to decipher this scrambled Hebrew text, to striking effect. ‘It turned out that over 80 percent of the words were in a Hebrew dictionary,’ said Kondrak.”

“Hebrew-speaking data scientist Shlomo Argamon offered some excoriating feedback. ‘They are saying it looks more like Hebrew than other languages,’ he said. ‘In my opinion, that’s not necessarily saying all that much.’ The use of Google Translate, too, struck him as somewhat unscientific. […] Other scholars have raised doubts about the scientists’ use of modern, rather than medieval, Hebrew.”

* * * * *

Certain researchers have made a compelling case against the “hoax” hypothesis, in any event. In 2013, an interesting paper analyzed the Voynich manuscript from an information theory perspective. They looked at organizational structure resulting from word distribution over the entire text, and concluded that there was “presence of a genuine linguistic structure”.[2] You can read the full paper here.

A couple information theory takeaways:

  1. Highly informative content words occur much more irregularly (and in clusters) throughout a text, while more uninformative function words tend to have a more homogenous or uniform distribution. So it’s the content words that indicate specific text sections.
  2. Words that are semantically related tend to co-occur in the same sections of a text.

 

Who will claim to have cracked the code next? My personal opinion, of course, is that they should throw some linguists on it.

 

[1] https://en.wikipedia.org/wiki/Voynich_manuscript

[2] Montemurro MA, Zanette DH. (2013). Keywords and Co-Occurrence Patterns in the Voynich Manuscript: An Information-Theoretic Analysis. PLoS ONE 8(6): e66344, 5. https://doi.org/10.1371/journal.pone.0066344

On machine translation, the media, and meaning (a response)

Shallowness_GoogTrans

I’m a Douglas Hofstadter fan. I read his book I Am a Strange Loop years ago, and it remains one of my three favorite non-fiction books, period. I highly recommend it to anyone who is at all interested in the nature of consciousness. The cognitive scientist’s Pulitzer Prize-winning Gödel, Escher, Bach: An Eternal Golden Braid has also been on my to-read list for a long time. So I was excited to see this article by him in The Atlantic, on another area that interests me: machine translation and machine “intelligence”.

Early on in the piece, Hofstadter says he has a “longstanding belief that it’s important to combat exaggerated claims about artificial intelligence”. Having worked in the machine learning/AI field for a little under a year now (but what an intense year it has been!), and having read countless popular media articles touting the astonishing advances in natural language processing/understanding, ML, and AI, I heartily agree with his sentiment. Such reporting is as misleading as it is annoying.

I came across a statement of this type the other day, in Stanford AI researchers make ‘socially inclusive’ NLP:

“The average person working with NLP today may consider language identification a solved problem.”

I have trouble believing that any researcher working in NLP/NLU/ML/AI thinks anything is a solved problem. Despite much progress, the field is still in its infancy. Doesn’t anyone remember Einstein’s quote (adapted from a similar idea expressed by Socrates) – “The more I learn, the more I realize how much I don’t know”? Where I work, every possible solution to a given problem brings up more questions, and even the “simplest” “facts” cannot always be taken for granted. (Remember when you were taught parts of speech like verb, noun, and preposition in grade school? Working at the level of detail we do, even these fundamental rules are often inadequate, requiring further specification. Turns out it’s hard to throw messy, real language into clean, fixed bins.) So I think the media does the field, its researchers, and the reading public a great disservice by sensationalizing and oversimplifying the challenges.

Hofstadter’s argument about understanding is even more poignant:

“The practical utility of Google Translate and similar technologies is undeniable, and probably it’s a good thing overall, but there is still something deeply lacking in the approach, which is conveyed by a single word: understanding. Machine translation has never focused on understanding language. Instead, the field has always tried to ‘decode’— to get away without worrying about what understanding and meaning are.”

We call the study of meaning and understanding semantics and pragmatics. People’s real world knowledge plays a key role here as well. To my mind, meaning (only complete when tied to real world knowledge) is the last frontier for AI and language. Today’s mobile/home voice assistants have definitely not yet mastered meaning. Technologies have made serious headway in resolving structural patterns (syntax), proper nouns (Named Entity Recognition) and some other aspects of language. But meaning, that great magical beast, eludes its pursuers. It is really, really challenging to computationally model the depth and complexity of human understanding. Because, although language itself is quite complicated, it’s still an impoverished medium for conveying the millions of subtle things we want and are able to convey – it relies heavily on context, implicature, presuppositionentailment, prosody, speaker-listener relationship, etc. I agree again with the author when he says that human-like machines are “not around the corner.”

I do think that Hofstadter seems to be simultaneously recognizing how hard the task of modeling meaning is, while not giving enough credit for things accomplished. Google Translate is way better now than it was at its inception over ten years ago. I also think he glosses over the tool’s main usage – which is more functional than artistic or poetic. Am I wrong to assume people use Translate much more for a quick word or phrase, when traveling or speaking on the fly, than for translating longer passages of literature? If they’re doing the latter… per the author’s experimental results, they clearly shouldn’t be.

 

What do you think – about media reportage of technology, machine “intelligence”, Hofstadter’s article? Feel free to comment!

Literally cray: A linguist’s attitude toward speech errors and slang

Slang_wordcloud

In a recent Lyft Line, it surfaced that the other rider in the car with me also had a linguistics background. Our driver was a non-native English speaker (from his accent maybe Russian) – although his English was pretty fluent. As he was deciding whether to make a left turn at a chaotic, construction-clogged intersection, he stuttered a bit and said, “well, it’s not not allowed”. Then, making the turn, he followed that with, “oh boy, and making these language mistakes with two linguists in the car…” The driver was assuming, as many do, that we would be more critical than the average person of said language “mistakes”.

First off, the driver’s statement wasn’t even a real speech error. Although slightly harder for us to process cognitively because of the two negatives, it’s not not allowed is in fact a perfectly grammatical sentence of English. A similar utterance might be said that avoids the duplicated notit’s not illegal, for example. But what’s going on here is this:

It’s [not [not [allowed1]2]3].

Between each opening and closing bracket is a structural unit, called a constituent in syntax. (The sentence as a whole is also a constituent, but I didn’t want to blind you with brackets.) So, allowed by itself is a constituent (subscript 1). The inner not negates allowed; together they’re a constituent (subscript 2). The outer not negates not allowed, and becomes a larger unit of its own (subscript 3). In the end, this structure has a very nuanced meaning – more nuanced than just it’s not illegal – which is something like, “this action is not necessarily encouraged and may even be frowned upon, but it’s not against the law”.

Second, even if the driver had made a speech error, linguists as a group are much less inclined to judge than the average person. There is a prevalent misconception that linguists and English teachers are siblings in a “grammar nazi” family.  This is untrue. Indeed, just as biologists thrill in discovering some new mutation in a species, linguists are generally delighted by speech errors and seek them out as important material to study; they give vital insights into how human language and the human brain function.

It shouldn’t come as a surprise, then, that a couple of my colleagues and I have had fun collecting both native and non-native English speech errors we’ve encountered over the past year. Here is a sample:

Actual speech Intended speech Speaker’s native lang Type of error
“thinking loudly” “thinking out loud” Farsi Idiom
“cross the finger” “fingers crossed” Farsi Idiom
“stepping over their toes” “stepping on their toes” Farsi Idiom
“thank you for fast react” “thank you for the fast reply/response Korean Dropping definite article; Wrong word
“confusication” probably “confusion” or “miscommunication” Hindi Blend
“decrepit rules” “deprecated rules” English Wrong word
“laids norm” “Lord’s name” English Metathesis[1]
“my tights are hip” “my hips are tight” English Metathesis

 

Of major relevance to the speech attitudes topic are two concepts, flip sides of a coin: descriptivism and prescriptivism.

Descriptivism is a process which attempts to objectively describe actual language usage, as well as speakers’ basic and intuitive linguistic knowledge. From several centuries of descriptive investigation, researchers have concluded that all languages and dialects are complex and rule-governed. No clearly superior or inferior languages/dialects exist.[2] The judgements we, as members of a society have about a particular language or dialect are inextricably influenced by sociological factors.

Prescriptivism, on the other hand, is a process which attempts to prescribe, subjectively, what should happen in language. You are familiar with this from years of English/grammar classes and from style guides mandating rules for spoken and written language. What you may not know is that many of these rules are arbitrary, based on personal taste and accidents of history.

A few of the most common “rules” that persist today are actually confused English misappropriations of Latin by pompous old men playing king-of-the-intellectual-castle games. One example is preposition-stranding, which dictates: Do not separate a preposition from its noun, leaving it at the end of a clause. Say “To whom did you talk?” instead of “Who did you talk to?” Seventeenth century poet John Dryden made this up (misapplying Latin, where preposition-like pieces attach to nouns and truly cannot separate from them) in order to disparage the work of Ben Johnson. Other examples include the predicative nominative, split infinitives, and the count–mass noun distinction (less vs. fewer).

English teachers are not alone in their prescriptivist tendencies. People generally are rather opinionated about language. Certain “errors” even become so despised as to prompt real-world action. Take the word literally. A New York City bar now has signage banning its use and warns that offending customers will be kicked out. Countless online articles and forums bemoan the word’s ubiquity with the rationale that speakers are using it to mean its opposite (figuratively). A bit of history and context, however, lend perspective.

Literally has been used as figuratively, or more precisely, as an intensifier, for over 300 years. Such literary greats as Charles Dickens, Mark Twain, and James Joyce (among others) have used it in this emphatic way. And the adverb’s paradoxical plight is similarly shared by a whole cast of terms, known as auto-antonyms. Interestingly, none of the other English auto-antonyms get the attention that is lavished on “literally”.

Now that I’ve outlined descriptivism and prescriptivism, I would like to add two final clarifications. First, being a descriptivist does not mean throwing out the idea of spelling conventions, or tossing aside standard education. Linguists of course recognize the utility of teaching standardized writing and speaking for particular contexts (school, job, etc.) for purposes of clarity, versatility, and social mobility. Language is rich and its uses are necessarily multifaceted.

All of the above also does NOT mean specific words or expressions or ways of speaking never make linguists cringe. (Enjoy that double negation?) We’re human after all. Despite knowing the full historical and linguistic context of “literally”, I still grind my teeth hearing it many times in succession. I have other personal struggles with clippings (cray, totes, obvi) as well as with internet chat-cum-speech acronyms and initialisms (lol, idk, wtf, omg). Simultaneously, I view them as fascinating lexical change phenomena. And I never take my individual tastes to mean that the language is somehow “degrading”. Languages don’t degrade; they change, and have been changing ever since our ancestors began to talk. If not for such constant metamorphosis, we wouldn’t have the enormous linguistic diversity – the thousands of languages and dialects – that exists today.

 

[1] Where sounds, syllables, or words are switched.

[2] It has been an oft-repeated creed in linguistics over the last few decades to make the stronger claim that “all languages are equal”. However, the statement has not been scientifically proven, as researchers have not yet determined the precise criteria by which languages are to be measured, much less figured out how to measure and compare such enormous complexity. This thought-provoking topic will be the subject of at least one future post.