Can “made up” languages help computers translate real ones?

PDF version of the article

In the late 1880s a Russian-born doctor Leyzer Leyvi Zamengov created an easy-to-learn, politically neutral language aimed at transcending nationality and fostering peace between people with different languages. Dubbed “Esperanto”, the artificial language still has a significant presence in over 100 countries today with estimates of fluent Esperanto speakers ranging from 10,000 to 2 million worldwide[1]

The idea of communication via artificial language by Zamengov and others, today is known as constructed languages or “conlangs”.  Hollywood and the film industry in general is particularly fond of conlangs, which offer a quick way of instilling a certain kind of exoticism  in even the most run of the mill scenarios.  What do the box office hits Star Trek, Game of thrones and Avatar have in common? They all have their own conlang: the Klingon, Dothraki or Na'vi languages, which were specifically designed to enrich their different universes with a touch of the strangely unknown. The sounds, vowels and consonants of these languages were chosen exactly as you would choose costumes and makeup to convey the brutality or the softness of these imaginary worlds. These constructed languages are in fact languages in the proper sense of the term, in which dialogues are written, to which linguistics theories apply, albeit without the complexity of our natural languages, since these languages can be described in extenso.  This inherent factor means that conlangs are also an efficient way to help computers translate text into multiple languages – on the fly.

The challenges of machine learning

Despite 60 years of research in computational linguistics, computers still fail to grasp the meaning of the most simple of sentences and ambiguity still plagues the most refined linguistic theories. We human beings solve ambiguities with our understanding of the world, with our capacity to put utterances back into their context. In contrast, the most advanced techniques such as machine learning, can only count words or phrases, relying on co-occurrences and word distances to make sense of texts and documents.

For instance, a very simple sentence such as “The dogs are loud” will be correctly translated into French, by a well known translation site, as “Les chiens sont bruyants”. However, if the sentence is modified into “The dogs are way too loud”, then the same site yields “Les chiens sont beaucoup trop fort”, which means “The dogs are way too strong”: a quite surprising semantic drift from the original English sentence. Furthermore, the agreement between “fort” and “chiens” is lost in translation. Thus, even the most advanced translation systems can be easily disrupted with a few modifications. This is because texts are ambiguous, terribly ambiguous. Natural languages have evolved in a rather organic way, without any actual plan, hence this inextricable fabric, with which our computer programs have so many difficulties. At each step, syntactic analyzers or parsers are confronted with ambiguous words and ambiguous constructions. The combinatorial aspect of natural language is such that a computer program might end up with thousands of potential analyses for a regular sentence of twenty words. Machine Learning techniques have tried to reduce this complexity, weighting words and constructions to feed complex classifiers, but ambiguity cannot always be reduced to correlation. Parsers are dumb, they analyze sentences one after the other without keeping track of past analyses. What we need is a non-ambiguous representation which could be used not only to store previous analyses, but also data from the real world, an intermediary structure which would be close to a human language but would have the properties of a computer language: a constructed natural language that could be compiled as a program, in other words a conlang. John McCarthy, the man who coined the term “Artificial Intelligence”, had this idea back in 1976[2] , when he proposed to solve Natural Language Processing issues with what he called Artificial Natural Languages, another name for conlang.

Lingvata is one of these languages, designed to be free of all forms of ambiguity. An artificial language that can be used as an intermediary step for computers to translate, for instance, from one language into any other.  Lingvata uses suffixes, which encompass one single part of speech, to avoid category ambiguities as in “drink” which can be either a noun or a verb. Lingvata provides a unique ending for nouns, pronouns, adjectives, verbs, prepositions, determiners and adverbs, eliminating the risk of ambiguous interpretations and errors. Words are simply created by combining a semantic root[3]  with one of these suffixes.

For instance, the root “parole” is related to speech.

  • paroleta  with the noun suffix “ta” means “speech”
  • paroleiag  with the verb suffix “iag” means “to speak”.

Thanks to this simple mechanism, Lingvata can be enriched with as many words as necessary, without introducing any homonyms or too many synonyms.

Lingvata also provides a mechanism to avoid syntactic ambiguity, based on Latin as a model. In Latin, the role of the different words in a sentence is governed by their suffixes or case markers. For instance, the sentence: “domina rosam amat”, means “the lady loves the rose.” The termination “am” indicates which element in the sentence is the direct object. If you shuffle the suffixes, you also change the sense as in “dominam rosa amat”, which now means: “the rose loves the lady”. This is a very efficient way to encode syntax, as each case marker conveys only one possible syntactic interpretation. Thus, Lingvata has case markers to indicate not only a direct object or an indirect object as in Latin, but also specific combinations to encode verb and noun complements. These follow a strict word order to make the whole syntactic process totally deterministic[4] , for instance, the verb is always at the end of the sentence. Lingvata offers four different terminations for case marking (vs. six in Latin), which are shared by all categories:

  • Nominative:  subject, no specific ending.   Example: “paroleta”
  • Accusative: direct object. Ending is always ‘n’.  Example: “paroletan”
  • Genitive: noun complement. Ending is always ‘s’.  Example: “paroletas”
  • Dative:  Prepositional phrase. Ending is always ‘d’  Example: “paroletad”

Our previous Latin example “the lady loves the rose” would then translate as:  “Dameta rosetan ameiag”.

 The grammar also provides mechanisms to handle clauses and conjunctions, but most of the sentences rely on the few above rules. As an experiment, we wrote a little text in Lingvata and checked if we could automatically translate it into French and English. Thanks to the simplicity of the grammar, the analysis of a Lingvata sentence is straightforward. It requires less than 50 rules to cover all aspects of the language, which enables us to translate each sentence into French or into English in less than a few milliseconds on a basic computer. In comparison, the English grammar comprises more than 3000 rules. We have also developed a tool that takes as input a sentence in French and translates it into Lingvata. We can then fix the errors in the Lingvata output, since we know that the French analysis is not always reliable, and store the results in a file, which can then be used to translate into any language for which we have a generator. This could be used for example for web site content in multiple languages where the original text would be in Lingvata. The system would then translate the content on the fly into the user’s language, removing the necessity to maintain as many versions of the text as there are languages.

In a certain way, the most compact way to store the semantic representation of a text is...the text itself. For a long time, linguists have tried to formalize languages into strict mathematical frameworks, but languages have proved to be so elusive that most theories “leak” – that may be why we call them natural. Machine learning techniques, despite their careful injection of hard science into the problem, did bring some improvement, but the best systems still fail to provide a precise and reliable analysis for too many cases. Today, with the advent of the internet, textual information is everywhere. Yet, the ambiguity and complexity of natural languages makes it quite difficult to draw on these resources in an efficient manner. On the contrary, an Artificial Natural Language representation keeps the whole spectrum of linguistic data intact with very little or no loss of information. A paragraph or a sentence written in a conlang is a description as precise as any piece of text and at the same time the semantic encoding of that text: a symbolic representation which sits half-way between man and machine.


[1][2]  McCarthy J. (1976). An example for natural language understanding and the AI problems it raises. Formalizing Common Sense: Papers by John McCarthy. Ablex Publishing Corporation, 355.[3]  The Esperanto language was the source of many of the semantic roots that we use in Lingvata. Since most of these roots have already been translated in many natural languages, it proved the most efficient way to bootstrap our own implementation of the Lingvata language.[4]  Word order, case markers and terminations in Lingvata are of course arbitrary. One could design a completely different grammar that would still retain the same properties. However, if designing a conlang is actually fun, it requires quite a lot of work and experiments to achieve the right balance between simplicity, conciseness and expressiveness.


About the author: Claude Roux  received his Ph.D. in syntactic parsing algorithms from the Université de Montréal (Canada) in 1996. This work was the basis of the Xerox Incremental Parser (XIP) which can be accessed on Xerox’s virtual lab Open Xerox . His main interest lies in syntactic parsing and formal language theories. He is the creator of the Lingvata conlang.