While reading a paper by MacDonald et al. 1994 called Lexical Nature of Syntactic Ambiguity Resultion on whether human syntactic ambiguity resolution has a lexical component, I was distracted by a quote from Chomsky that still echos in the halls of many a Linguistic Department. There seems to be epistemological divide in Linguistics falling nearly on empiricist/rationalist lines. So first I think it's interesting, at least historically, that this Chomsky quote (from Syntactic Structure) seems to sum up the divide between the two camps:What echos most resonantly is the notion that the probability of certain words and phrases are not used to inform grammaticality judgement in formal syntax. To back this up, Chomsky provides perhaps the most famous linguistics example in the universe, often used to demonstrate the infinite productivity of language; the sentence "Colorless green ideas sleep furiously." In its original context, it was used to show that while statistically unlikely, a sentence can still be perfectly grammatical. So we shouldn't be tempted to base our notion of grammaticality on probability alone.
I think that we are forced to conclude that grammar is autonomous and independent of meaning, and that probabilistic models give no particular insight into some of the basic problems of syntactic structure. (P.17)On one side of the hall are rationalists, or formal syntactitions, trying to describe the design of our brain's syntax module, driven by axioms, checked only issue-specific data, and only a few issues at a time (never the whole grammar all at once, checked against huge amounts of real language data). Then on the other side of the hall, we have computational linguists who frequently put to use actual words and phrases to help determine what is possible (or likely) syntactic structure.
But the other reason why I'm intrigued by this Chomsky quote is that probability does seem to play an important role in determining syntactic structure, at least the one we pick when we have more than one to choose from. And that takes us back to MacDonald, who presents variants of the following examples, which first appeared in work by Ferreira and Clifton, 1986:
(1) The expert examined by the lawyer.
(2) The evidence examined by the lawyer.
The temptation in (1) is to make a bad parse at first, taking 'examined' as the main verb. Then you reach the preposition 'by' and you have to go back. Or maybe you reach at the end of the sentence and then go back. In any case, you go back, and reparse the sentence as a sligtly more complicated one, with 'examined' as part of a relative clause. So, if you were tempted tot make a bad parse (and statistics on the subject show: you probably were) your mistep can be modeled by a simple heuristic: "When you come upon the next word, add the minimal syntactic structure necessary to accomodate it."
This heuristic is used to explain garden path sentences, of which (1) is an example. But not all garden path sentences are created equal. If we look at (2), it's likely that we won't be as tempted to walk down the path because 'evidence' is not an agent, and verbs like 'examined' prefer agents as their subjects. Studies like Trueswell et al. 1994, validate that we don't first make a bad parse in sentences like (2) where we have a helpful context, but we do in (1) where the context doesn't help.
So we see that lexical context does exert observable influence on how we parse sentences, and how we choose the right parse. And computational linguists have acknowledged the effect of lexical context as well, which has improved their parses. I'm pretty sure formal syntactitions dismiss this sort of thing as being outside their modeling domain, but why do they get to do that? I mean, shouldn't getting to the right parse, the parse that humans arrive at naturally and most frequently, be elemental to the problem of modeling syntax? I don't know. It's just a thought.