Voynich: The manuscript that keeps on giving


The Voynich manuscript is one of those marvels that, even in these times of boundless knowledge and incredible technology, eludes continual efforts to understand it.

Not heard of the thing? Welcome to the show. There has been a vigorous little dance of press coverage over the past couple years. It goes something like this:

Step to your left.  “An eternal mystery.”
Step to your right.  “I’ve cracked the code!” – some dude
Step back.  “Nope, you’re full of shit.”
Step forward.  “We’ve solved it this time for sure.” – some other dudes

The manuscript is a hand-written, illustrated codex that’s been shown through carbon dating to have originated in the early fifteenth century (1404–1438). The writing system used throughout its approximately 240 pages has yet to be identified.[1] Cryptographers, historians, computer scientists and others have proposed numerous hypotheses over the decades, including that it’s a hoax. Based on the illustrations, scholars divide the manuscript into five thematic sections: Herbal, Astrological, Biological, Pharmacological, and Recipes.

Below I list links to the (more recent) rhythmic pulse of “discoveries” and rejections, in chronological order. Under each link I’ve pulled out quotes of the more intriguing tidbits.

* * * * *

November 30, 2016: https://www.newyorker.com/books/page-turner/the-unsolvable-mysteries-of-the-voynich-manuscript

“The first half of the book is filled with drawings of plants; scholars call this the “herbal” section. None of the plants appear to be real, although they are made from the usual stuff (green leaves, roots, and so on […]). The next section contains circular diagrams of the kind often found in medieval zodiacal texts; scholars call this part “astrological,” which is generous. Next, the so-called “balneological” section shows “nude ladies,” in Clemens’s words, in pools of liquid, which are connected to one another via a strange system of tubular plumbing that often snakes around whole pages of text. […] Then we get what appear to be instructions in the practical use of those plants from the beginning of the book, followed by pages that look roughly like recipes.”

“The Voynich MS was an early attempt to construct an artificial or universal language of the a priori type.   –Friedman.”

* * * * *

September 8, 2017: https://arstechnica.com/science/2017/09/the-mysterious-voynich-manuscript-has-finally-been-decoded/

“Now, history researcher and television writer Nicholas Gibbs appears to have cracked the code, discovering that the book is actually a guide to women’s health that’s mostly plagiarized from other guides of the era.”

“Gibbs realized he was seeing a common form of medieval Latin abbreviations, often used in medical treatises about herbs. ‘From the herbarium incorporated into the Voynich manuscript, a standard pattern of abbreviations and ligatures emerged from each plant entry,’ he wrote. ‘The abbreviations correspond to the standard pattern of words used in the Herbarium Apuleius Platonicus – aq = aqua (water), dq = decoque / decoctio (decoction), con = confundo (mix), ris = radacis / radix (root), s aiij = seminis ana iij (3 grains each), etc.’ So this wasn’t a code at all; it was just shorthand. The text would have been very familiar to anyone at the time who was interested in medicine.”

“Gibbs concluded that it’s likely the Voynich Manuscript was a customized book, possibly created for one person, devoted mostly to women’s medicine.”

* * * * *

September 10, 2017: https://www.theatlantic.com/science/archive/2017/09/has-the-voynich-manuscript-really-been-solved/539310/

“This week, the venerable Times Literary Supplement published as its cover story a ‘solution’ for the Voynich manuscript. The article by Nicholas Gibbs suggests the manuscript is a medieval women’s-health manual copied from several older sources. And the cipher is no cipher at all, but simply abbreviations that, once decoded, turn out to be medicinal recipes.”

“’Frankly I’m a little surprised the TLS published it,’ says Lisa Fagin Davis, executive director of the Medieval Academy of America. When she was a doctoral student at Yale—whose Beinecke Library holds the Voynich manuscript—Davis read dozens of theories as part of her job. ‘If they had simply sent to it to the Beinecke Library, they would have rebutted it in a heartbeat,’ she says.”

“In the second part—only two paragraphs long—Gibbs gets into the meat of his solution: Each character in the manuscript is an abbreviated word, not a letter. This could be a breakthrough, but the TLS presents only two lines decoded using Gibbs’s method. Davis did not find those two lines convincing either. ‘They’re not grammatically correct. It doesn’t result in Latin that makes sense,’ she says.”

* * * * *

February 1, 2018: https://www.atlasobscura.com/articles/voynich-manuscript-artificial-intelligence-solved

“There are two problems with this notoriously difficult puzzle—it’s written in code, and no one knows what language that code enciphers.”

“’That was surprising,’ Kondrak said, in a statement. ‘And just saying “this is Hebrew” is the first step. The next step is how do we decipher it.’ The scientists think the code used in the manuscript might have been created using alphagrams. (In standard alphagrams, the letters in a word are placed in alphabetical order—the alphagram of ‘alphagram,’ for example, is ‘aaaghlpmr.’) Vowels also seemed to have been dropped. These assumptions made, they tried to come up with an algorithm to decipher this scrambled Hebrew text, to striking effect. ‘It turned out that over 80 percent of the words were in a Hebrew dictionary,’ said Kondrak.”

“Hebrew-speaking data scientist Shlomo Argamon offered some excoriating feedback. ‘They are saying it looks more like Hebrew than other languages,’ he said. ‘In my opinion, that’s not necessarily saying all that much.’ The use of Google Translate, too, struck him as somewhat unscientific. […] Other scholars have raised doubts about the scientists’ use of modern, rather than medieval, Hebrew.”

* * * * *

Certain researchers have made a compelling case against the “hoax” hypothesis, in any event. In 2013, an interesting paper analyzed the Voynich manuscript from an information theory perspective. They looked at organizational structure resulting from word distribution over the entire text, and concluded that there was “presence of a genuine linguistic structure”.[2] You can read the full paper here.

A couple information theory takeaways:

  1. Highly informative content words occur much more irregularly (and in clusters) throughout a text, while more uninformative function words tend to have a more homogenous or uniform distribution. So it’s the content words that indicate specific text sections.
  2. Words that are semantically related tend to co-occur in the same sections of a text.


Who will claim to have cracked the code next? My personal opinion, of course, is that they should throw some linguists on it.


[1] https://en.wikipedia.org/wiki/Voynich_manuscript

[2] Montemurro MA, Zanette DH. (2013). Keywords and Co-Occurrence Patterns in the Voynich Manuscript: An Information-Theoretic Analysis. PLoS ONE 8(6): e66344, 5. https://doi.org/10.1371/journal.pone.0066344

On machine translation, the media, and meaning (a response)


I’m a Douglas Hofstadter fan. I read his book I Am a Strange Loop years ago, and it remains one of my three favorite non-fiction books, period. I highly recommend it to anyone who is at all interested in the nature of consciousness. The cognitive scientist’s Pulitzer Prize-winning Gödel, Escher, Bach: An Eternal Golden Braid has also been on my to-read list for a long time. So I was excited to see this article by him in The Atlantic, on another area that interests me: machine translation and machine “intelligence”.

Early on in the piece, Hofstadter says he has a “longstanding belief that it’s important to combat exaggerated claims about artificial intelligence”. Having worked in the machine learning/AI field for a little under a year now (but what an intense year it has been!), and having read countless popular media articles touting the astonishing advances in natural language processing/understanding, ML, and AI, I heartily agree with his sentiment. Such reporting is as misleading as it is annoying.

I came across a statement of this type the other day, in Stanford AI researchers make ‘socially inclusive’ NLP:

“The average person working with NLP today may consider language identification a solved problem.”

I have trouble believing that any researcher working in NLP/NLU/ML/AI thinks anything is a solved problem. Despite much progress, the field is still in its infancy. Doesn’t anyone remember Einstein’s quote (adapted from a similar idea expressed by Socrates) – “The more I learn, the more I realize how much I don’t know”? Where I work, every possible solution to a given problem brings up more questions, and even the “simplest” “facts” cannot always be taken for granted. (Remember when you were taught parts of speech like verb, noun, and preposition in grade school? Working at the level of detail we do, even these fundamental rules are often inadequate, requiring further specification. Turns out it’s hard to throw messy, real language into clean, fixed bins.) So I think the media does the field, its researchers, and the reading public a great disservice by sensationalizing and oversimplifying the challenges.

Hofstadter’s argument about understanding is even more poignant:

“The practical utility of Google Translate and similar technologies is undeniable, and probably it’s a good thing overall, but there is still something deeply lacking in the approach, which is conveyed by a single word: understanding. Machine translation has never focused on understanding language. Instead, the field has always tried to ‘decode’— to get away without worrying about what understanding and meaning are.”

We call the study of meaning and understanding semantics and pragmatics. People’s real world knowledge plays a key role here as well. To my mind, meaning (only complete when tied to real world knowledge) is the last frontier for AI and language. Today’s mobile/home voice assistants have definitely not yet mastered meaning. Technologies have made serious headway in resolving structural patterns (syntax), proper nouns (Named Entity Recognition) and some other aspects of language. But meaning, that great magical beast, eludes its pursuers. It is really, really challenging to computationally model the depth and complexity of human understanding. Because, although language itself is quite complicated, it’s still an impoverished medium for conveying the millions of subtle things we want and are able to convey – it relies heavily on context, implicature, presuppositionentailment, prosody, speaker-listener relationship, etc. I agree again with the author when he says that human-like machines are “not around the corner.”

I do think that Hofstadter seems to be simultaneously recognizing how hard the task of modeling meaning is, while not giving enough credit for things accomplished. Google Translate is way better now than it was at its inception over ten years ago. I also think he glosses over the tool’s main usage – which is more functional than artistic or poetic. Am I wrong to assume people use Translate much more for a quick word or phrase, when traveling or speaking on the fly, than for translating longer passages of literature? If they’re doing the latter… per the author’s experimental results, they clearly shouldn’t be.


What do you think – about media reportage of technology, machine “intelligence”, Hofstadter’s article? Feel free to comment!

What is linguistics, and what do linguists do?


I love patterns. They’re how we learn and evolve, and they’re everywhere.

Here’s a pattern for you.
When I tell someone new that I do linguistics, their response often goes like this:

Nod and/or smile and/or give small verbal acknowledgment.
Slight awkward pause.
“And what is linguistics again exactly?”[1]

People know that linguistics has to do with language, but beyond that, things get fuzzy. My goal with this post is to unfuzz (defuzz? disfuzz?) the basics of the field.

Most succinctly put, “Linguistics is the scientific study of language”. Like all sciences, linguistics is about patterns. Identifying them, analyzing them, making generalizations about them, making predictions (or hypotheses) from the generalizations, and then testing the predictions. What does that mean more specifically?

Well, what is language? Language is a conventionalized and arbitrary pairing of form and meaning. The form is usually sound, but it can also be gesture – in the case of sign language. There are many levels at which we might observe and analyze such form-meaning pairings, and these levels comprise the main subfields of the linguistics discipline. I’ll introduce each subfield through a couple of questions:

  1. How do our mouth, tongue, and throat produce consonants versus vowels? How do we segment a continuous stream of speech into words, so that we may understand it? How do we perceive sounds as belonging to our native language(s) versus other languages?

The study of speech sounds is Phonetics, and speech patterns, Phonology.

  1. What is going on when we add the prefix un- to the word happy, and the resulting word (unhappy) means the opposite of happy? How do words like steampunktoberfest, appletini, or totes come about? Why is the plural for cat cats, while the plural for mouse is mice?

The study of word structure and formation is Morphology.

  1. Why do we say the red car in English (with the adjective before the noun), when French has la voiture rouge and Spanish el coche rojo (both with the adjective after the noun)? Why is the interpretation of John saw the man with a telescope ambiguous?

The study of sentence structure is Syntax.

  1. How do we know that a poodle is a type of dog, or that if something is alive it cannot also be dead, or that Maddie plays the drums like a rock star must imply that Maddie plays the drums?

The study of meaning is Semantics.

  1. Why do we understand that it is annoying to say “Yes” (and take no subsequent action) in answer to your dinner partner’s question “Can you pass the salt”?

The study of discourse in context is Pragmatics.


Once we’ve discussed what linguistics is, the question that inevitably follows looks something akin to: where does studying language patterns get you in the real world? What do linguists actually do for a living? Until more recently, linguists were generally constrained to teaching and researching within academia. Many still do follow that path. However, in the last couple decades, various industrial sectors have realized the necessity of employing people with serious language knowledge. Here is a short list of possible careers outside of academia for those with a linguistics background:

  • Computational Linguist (works on improving computers’ ability to “understand” and generate human language – often in machine learning contexts)
  • Conlanger for Movie/TV Industry (invents new languages based on attested linguistic principles)
  • Data Scientist (statistically analyzes large amounts of data to provide business insights)
  • Field Linguist/Researcher (documents endangered or dying languages – although often from a university position)
  • Forensic Linguist (analyzes legal and judicial language; provides linguistic evidence in legal proceedings)
  • Lexicographer (builds dictionaries)
  • Naming/Branding Consultant
  • Nonprofit sociolinguistic research
  • Second or Foreign Language Instructor
  • Speech-Language Pathologist (diagnoses and treats communication, voice, and swallowing disorders)
  • Translator & Interpreter

Here are a few cool examples of actual people using their linguistics training in the real world:

One of my acquaintances is an interactional sociolinguist at the FrameWorks Institute, a nonprofit organization that conducts research on sociopolitical and scientific topics like aging, criminal justice, and climate change. FrameWorks investigates the language used in talk about these subjects, and teaches ways of reframing each issue. The woman I know manages the Institute’s Learning Unit, where she organizes professional learning events for advocates who want to change particular social dialogues.

Another friend of mine is a Speech-language Pathologist, or SLP. She works with veterans at the VA Hospital in San Francisco. Her patients have swallowing conditions, aphasia, and other disorders that interfere with speaking or understanding. The SLP path requires a master’s in Communicative Disorders/Speech-Language Pathology. Although it doesn’t require a degree in linguistics, my friend has this too, and she says that it has lent her a deeper understanding of the disorders she’s trying to treat, as well as the subtleties involved in clinician-patient communication.

David Peterson is neither a friend nor an acquaintance, although I wish he was one. He is a conlanger who created Dothraki and Valerian for the HBO series “Game of Thrones”. Dothraki and Valerian are not just random sets of made-up words. They are full languages, with their own phonology, morphology, and syntax. For example, to form a question in Dothraki – as in Hash yer ray tih zhavors chiorisi anni (“Have you seen my lady’s dragon?”) – one must include a word whose main purpose is to formulate questions, hash. English lacks a single separate word with just this function; instead, we use multifunctional auxiliary verbs like do, be and have, or rising intonation. French on the other hand does have a word with this unique function: est-ce que (subject-verb inversion and rising intonation are other possible strategies). Conlanging for film basically started in the eighties with Marc Okrand, the inventor of Vulcan and Klingon, used in the Star Trek movies. With sci-fi/fantasy shows becoming more and more involved these days, the opportunity for such constructed language work seems to be growing.

And then, take a watch of these videos. Anna Marie Trester, author of the Career Linguist blog, has interviewed and recorded multiple linguist folks (me among them!) working in different areas of industry.


I’d like to wrap up with some historical and contextual nuggets about the field.

Linguistics termed as such, and as its own independent discipline, is relatively new. It arose at the beginning of the twentieth century; the University of California (Berkeley) formed America’s first “Department of Linguistics” in 1901. Edward Sapir and Leonard Bloomfield were two prominent linguists early on. There was also structuralism or structural linguistics which dealt with signs, syntax, and other formal units of language. Main characters included Ferdinand de Saussure and Roman Jakobson. In the 1950s, Noam Chomsky devised his generative theory of language and Universal Grammar, and the discipline really took off. Chomsky is thus usually known as the “father of modern linguistics”.

Pre-twentieth century, philology (the study of ancient languages and texts), and then comparative philology (studies comparing European languages and language groups) existed from the middle ages through the 1800s. The first formal study of language comes from India; in the fifth century BC a man named Pāṇini categorized Sanskrit consonants and vowels, word classes like nouns and verbs, and other patterns.

One curious aspect of linguistics is that it has borrowed a good bit of terminology (and corresponding concepts) from biology. My brother is getting his PhD in lichenology, a little-known subfield of biology, and it’s super fun for us to chat about our respective fields because there’s an immediate overlap of understanding. For instance, linguistics uses the terms root, stem, tree to describe words and phrase structures. It adopts jargon like morphology, genealogy, diachronic, convergent and divergent evolution. A fascinating “language as organism”[2] metaphor appears frequently.

Lastly, linguistics is a small field. Even large university departments usually count no more than twenty to twenty-five graduate students at a time. Meeting another linguist randomly, outside of dedicated school or work contexts is, for me at least, a rare treat. Meeting people who want to talk about language, however, is wonderfully common! And understandably so – it applies to us all. I hope my post has provided a sprinkling of insight into this universal human subject.


Please check back soon for upcoming content – planned posts include a linguist’s perspective on speech errors, an explanation of the nifty phenomenon of metathesis (where sounds, syllables, or words are switched around), and summaries of Japanese and Korean writing systems.


[1] Another frequent response is: “So how many languages do you speak?” See this great post addressing the topic: http://allthingslinguistic.com/post/48473292525/why-linguists-hate-being-asked-how-many-languages

[2] Janse, M., Verlinden, A., & Uhlenbeck, E.M. (1998). Productivity and Studies in General and Descriptive Linguistics in Honor of E.M. Uhlenbeck. Trends in linguistics, 116, 197.