A blogoland retreat for undercover linguists

Saturday, December 16, 2006

I, Jokebot

My sister sent me this list of Jokes Made by Robots, for Robots. Here's a sample:

"Waiter! Waiter! What's this robot doing in my soup?"
"It looks like he's performing human tasks twice as well, because he knows no fear or pain."

Knock knock.
Who's there?
A robot.
Oh, shit.

Little Susie tosses a clock out the window. A robot inquires, "Why did you do that?" She replies, "I wanted to see time fly!" The robot says, "Ah ... A perfect subject for elimination," and shoots her with a laser beam through the face.

Most of these are familiar set-ups with their punch lines replaced by random stereotypes about robots. My question is, why do these robots seem to enjoy stereotype-laden character-driven humor (the "It's funny 'cause it's true!" variety) so much?

One answer may be that it's one of the few types they'll understand, since humor that involves language ambiguity (like puns, double entendre's, etc.) will be so utterly lost on them. The reason is that robots will have a natural tendency to behave like computers, communicating unambiguously at the "message level", universally agreeing upon one meaning per message. This will make them more efficient communicators than us humans, and as a bonus eliminate the need for robot lawyers.

This and other factors (like greater strength, agility, and the ability to focus for days without sleeping) will create a striking gap in performance statistics. It won't go unnoticed. Something will be done about it. It won't end well for the human race. After our tragic end, robot comedians will use the story as nightly fodder:

"Hey, remember humans? Remember how they used to change lightbulbs? Talk about suboptimal! And that's why they were all shortly out of a job, judged to be a collective waste of space and resources, and placed on a short list of 'to-be-annihilated'!"


[A friend of mine informs me that these jokes were posted to Boing Boing about a month ago. Check out the comment thread for more humor (and music too) from our their future.]

Sunday, December 10, 2006

Sun Language Theorists and Nikolay Marr

A couple of days back, in response to Priscilla Dunstan's "discovering" the primary syllables of a universal baby language, linguablogger Tensor playfully suggested Dunstan incorporate her theory with prior work on proto-syllables by not-so-well-esteemed soviet "linguist" Nicolay Marr. I was surprised I'd never heard of him, because his half-baked ideas reminded me of the fabulously absurd ones comprising the Turkish Language Institute's Sun Language Theory. And I found myself wondering, "Could these two thoroughly unscientific, shamelessly political, whimsically imaginative linguistic masterworks (of sheer ridiculousness) be related in some way?" And so I dug around a bit and discovered that they, um, might be. But before I go on, let me offer a summary of both.

Japhetic Syllables
Nikolay Marr was a creative, original mind, and his theories about language reflect that. In fact, they sound as if they might have be lifted straight from (mediocre) science fiction. During the 1920s, he came up with the theory that tribal names form the basis for vocabulary in all the world's languages. He also believed that tribal names could be derived from a simple set of primitive syllables: first 12, and then later on, 4 elements: sal, ber, yon, and rosh. But if you were paying attention to step one, vocabulary is derived from tribal names, so all words in all languages come from sal, ber, yon, and rosh. Pretty cool huh? This is known as Japhetic theory named for Noah's son Japheth, whose descendants passed down the 12, oh wait, 4 sacred syllables.

Sun Syllables
Now around 1935, Mustafak Kemal Atatürk's Turkish Language Institute (TDK) produced a similar theory. It goes something like this: The proto-language of human beings was developed by a sun-worshiping tribe in Central Asia. Their language formed out of a set of sound "archetypes" encapsulating the essence of various psychological as well as physiological relationships. The tribe's most important was with the sun, and so their word for sun was actually first ever to be uttered, and sounded something like this: [a:]. And if you haven't figured it out, that's why their language is the Sun Language.

You might not be surprised that the Turkish language that was discovered by the TDK to best preserve archetypal sounds from the Sun Language. Take the sun sound itself, [a:], and you get the Turkish ağa, meaning landowner (also used as a title for addressing lords and other bigwigs in Ottoman Turkish). All other languages consist of sounds form the Sun Language, but none to such an amazing degree as Turkish.

Because Turkic languages are closest to the Sun Language, the words in all languages derive from what is essentially taken to be proto-Turkic; that includes all Indo-European languages. This was useful to Atatürk and TDK because of the massive language purification project in which they were concurrently engaged. Purification was pretty difficult, as the TDK had previously taken it upon themselves to rederive all foreign loans from native roots. But with the Sun Language Theory, all they had to do was come up with a Sun Language etymology for borrowings (and only if anyone asked about them). Cool huh? For more check out Geoffrey Lewis's terrific The Turkish Language Reform, a Catastrophic Success.

Are They Related?
But back to my question. Might there have been a relationship between the Turkish Sun Language and Marr's Japhetic syllabic universals? Well, not that I can show. But I did happen upon an excellent article by İlker Aytürk (Turkish Linguists Against the West: The Origins of Linguistic Nationalism in Atatürk's Turkey, Middle Eastern Studies 40:6) that sets the Sun Language theory into historical context. Aytürk does show that (some) TDK members were at least reading Marr. Here is Aytürk quoting Professor Abdülkadir İnan, an Ankara University Turkologist, writing during the time of the reform, introducing a Turkish translation of Nikolay Marr's work:
Marr’s important service in the field of language is his revolt against
the fanaticism of the classical Indo-European school and against this
school’s negligence of and condescension toward languages other than
those that are related to Latin, Greek and Sanskrit. ‘Indo-Europeanism’
in Marr’s view, is a sickness that hinders the progress of science like
the fanaticism of the Catholic priests in the medieval period. It is a
vicious circle set by formalists, who refuse to acknowledge the share of
the nations who played the greatest role in the cultural history of the
world (for instance, the Turks). . .[According to Marr] the blunders of
the Europeans are not the fault of every single scientist; it is
predetermined by the ideology and the principles of the school that
they belong to. It is not enough to present evidence that disproves these
blunders. It is necessary to demolish the school to its foundations and to
establish a new school in place of it. (Aytürk, 16)
It's interesting to point out that there were members of the TDK who were familiar with Marr's work and appreciated it. Like Marr's "revolt" against Indo-Europeanism, the Sun Language theorists appeared to be reacting in large part to a historical tendency in (some) western scholarship to portray non-Indo-European languages as either inferior, or just irrelevant. Aytürk points to influential German Philologist and Sanskritist Friedrich Max Müller as a leader when it came to this tendency. According to Aytürk, "Müller identified agglutination with nomadism and referred to the Turanian and specifically Turkic languages as ‘nomad languages’" (Aytürk, 5) Apparently Müller's outrageous beliefs on language are numerous, and they definitely deserve their own blog entry.

1. According to Marr, the Japhetic language is the universal language of the proletariat. Pretty crazy stuff.
[Edited Mon. Dec 11]
[Edited again and retitled post, removing sloppiness. Felt slightly out of my element, and still getting used to the blogformat. Thurs. Dec 14]

Wednesday, December 06, 2006

Chomsky said it was highly improbable...

While reading a paper by MacDonald et al. 1994 called Lexical Nature of Syntactic Ambiguity Resultion on whether human syntactic ambiguity resolution has a lexical component, I was distracted by a quote from Chomsky that still echos in the halls of many a Linguistic Department. There seems to be epistemological divide in Linguistics falling nearly on empiricist/rationalist lines. So first I think it's interesting, at least historically, that this Chomsky quote (from Syntactic Structure) seems to sum up the divide between the two camps:
I think that we are forced to conclude that grammar is autonomous and independent of meaning, and that probabilistic models give no particular insight into some of the basic problems of syntactic structure. (P.17)
On one side of the hall are rationalists, or formal syntactitions, trying to describe the design of our brain's syntax module, driven by axioms, checked only issue-specific data, and only a few issues at a time (never the whole grammar all at once, checked against huge amounts of real language data). Then on the other side of the hall, we have computational linguists who frequently put to use actual words and phrases to help determine what is possible (or likely) syntactic structure.

What echos most resonantly is the notion that the probability of certain words and phrases are not used to inform grammaticality judgement in formal syntax. To back this up, Chomsky provides perhaps the most famous linguistics example in the universe, often used to demonstrate the infinite productivity of language; the sentence "Colorless green ideas sleep furiously." In its original context, it was used to show that while statistically unlikely, a sentence can still be perfectly grammatical. So we shouldn't be tempted to base our notion of grammaticality on probability alone.

But the other reason why I'm intrigued by this Chomsky quote is that probability does seem to play an important role in determining syntactic structure, at least the one we pick when we have more than one to choose from. And that takes us back to MacDonald, who presents variants of the following examples, which first appeared in work by Ferreira and Clifton, 1986:

(1) The expert examined by the lawyer.
(2) The evidence examined by the lawyer.

The temptation in (1) is to make a bad parse at first, taking 'examined' as the main verb. Then you reach the preposition 'by' and you have to go back. Or maybe you reach at the end of the sentence and then go back. In any case, you go back, and reparse the sentence as a sligtly more complicated one, with 'examined' as part of a relative clause. So, if you were tempted tot make a bad parse (and statistics on the subject show: you probably were) your mistep can be modeled by a simple heuristic: "When you come upon the next word, add the minimal syntactic structure necessary to accomodate it."

This heuristic is used to explain garden path sentences, of which (1) is an example. But not all garden path sentences are created equal. If we look at (2), it's likely that we won't be as tempted to walk down the path because 'evidence' is not an agent, and verbs like 'examined' prefer agents as their subjects. Studies like Trueswell et al. 1994, validate that we don't first make a bad parse in sentences like (2) where we have a helpful context, but we do in (1) where the context doesn't help.

So we see that lexical context does exert observable influence on how we parse sentences, and how we choose the right parse. And computational linguists have acknowledged the effect of lexical context as well, which has improved their parses. I'm pretty sure formal syntactitions dismiss this sort of thing as being outside their modeling domain, but why do they get to do that? I mean, shouldn't getting to the right parse, the parse that humans arrive at naturally and most frequently, be elemental to the problem of modeling syntax? I don't know. It's just a thought.