Abstract of Grammar, Uncertainty and Sentence Processing by John T. Hale A dissertation submitted to The Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy. Baltimore, Maryland August, 2003 INTRODUCTION, SIGNIFICANCE AND OUTLINE Toward a probabilistic theory of human sentence processing, this dissertation proposes a definition of computational work done in the course of analyzing sentences generated by formal grammars. It applies the idea of entropy from information theory to the set of derivations compatible with an initial substring of a sentence. Given a probabilistic grammar, this permits the set of such compatible derivations to be viewed as a random variable, and the change in derivational uncertainty from word to word to be calculated. This definition of computational work is examined as a cognitive model of human sentence processing difficulty. To apply the model, a variety of existing syntac- tic proposals for English sentences are cast as probabilistic Generalized Phrase Struc- ture Grammars (Gazdar et al., 1985) and probabilistic Minimalist Grammars (Sta- bler, 1997). It is shown that the amount of predicted processing effort in relative clauses correlates with the Accessibility Hierarchy of relativized grammatical rela- tions (Keenan and Comrie, 1977) on a Kaynian (1994) view of relative clause struc- ture. Results from three new sentence reading experiments confirm the findings of Keenan and Hawkins (1987) by demonstrating effects of the Accessibility Hierarchy (AH) on question-answering accuracy, but find only limited support for the AH in online reading times. These results carry significance for several different fields. Among them o for computer science, a way to calculate the informational contribution of the i-th word in a grammatical sentence. Formerly (Lounsbury 1954) this was only possible for finite languages, but in this dissertation the method is extended to infinite languages. o for linguistics, a formalized processing hypothesis that permits the derivation of numerical behavioral predictions from current syntactic proposals. In particular, by applying the mildly context-sensitive Minimalist Grammars, processing implications of two competing transformational analyses of relative clauses can be examined in detail. o for psycholinguistics, a new kind of cognitive explanation for sentence processing data that explicitly takes into account both categorical and probabilistic knowledge. This explanation is abstract in that it does not presuppose particular parsing operations or search strategies, but rather derives directly from the structure of a probabilistic grammar. This follows the competence hypothesis of Chomsky (1965) that our knowledge of language is directly used in comprehension. These consequences all follow from the thesis that there exists an information-theoretic notion of disambiguation work on a probabilistic grammar which can be taken as a cognitive model of human sentence processing difficulty. The dissertation argues for this thesis primarily by example. For instance, chapter 2 considers the processing predictions derived from probabilistic context-free grammars for well-studied psycholinguistic phenomena: garden-pathing, center embedding and the subject-object extraction asymmetry. The predictions follow from probabilistic grammars written in the style of Generalized Phrase Structure Grammar (Gazdar et al., 1985). A procedure for calculating these predictions is presented, and the predictions are compared with existing experimental evidence, including on-line reading times. Chapter 3 generalizes and refines these proposals with a different example. It presents another way of calculating behavioral predictions, and applies it to the more expressive Minimalist Grammars. This method leverages Lang's (1974, 1988) insight that an incomplete sentence specifies a related grammar of all possible grammatical continuations. In this chapter, sentence-level difficulty predictions are compared with repetition accuracy results on the AH obtained by Keenan and Hawkins (1987). Predictions derived from a grammar incorporating Kayne's (1994) proposal about relative clause structure are found to correlate well with the empirical data, whereas predictions derived from a similar grammar employing the more standard adjunction analysis do not. Chapter 4 discusses these and other linguistic proposals inherent in a Minimalist Grammar covering the various syntactic phenomena exemplified in Keenan and Hawkins' (1987) experimental stimuli. This discussion shows how the MG formalism can encode heterogenous ideas from categorial, feature-based, and transformational traditions. Chapter 5 presents the results of three human sentence understanding experiments that are motivated by the AH and the candidate theory of chapter 3. Among other results, it reports that relativizations from genitive are read more slowly than relativizations from non-genitive, as suggested by the AH. However this difficulty occurs at the offset, rather than the onset, of the relative clause. This poses a challenge for any ``eager'' probabilistic sentence processing theory, such as Thibadeau, Just and Carpenter (1982), MacDonald (1994), Jurafsky (1996) and Hale (2001). Chapter 6 situates the technical proposals of chapters 2 and 3 in a space of possible probabilistic sentence processing theories. Chapter 7 concludes that a new kind of grammar-processor relationship has been proposed. Appendix A documents the design and measures the performance of a parser for Minimalist Grammars written in the functional programming language ML. THE PROPOSED NOTION OF DISAMBIGUATION WORK The definition of disambiguation work offered in this dissertation is formalized as the reduction in the entropy of possible grammatical continuations brought about by lengthening a sentence fragment by one word. This proposal is founded on the possibility of viewing the nonterminal symbols in probabilistic grammars as random variables. For instance, in the rules given below, 0.87 NP -> the boy 0.13 NP -> the tall boy the nonterminal NP can be viewed as a random variable that has two alternative outcomes. Indeed, nonterminals generally in probabilistic context-free phrase structure grammars (PCFGs) can be viewed this way. Since their outcomes are discrete, their entropy is easily calculated. H ( NP ) = ( 0.87 * log_{2} 0.87 ) + ( 0.13 * log_{2} 0.13 ) = 0.56 There is just over half a bit of uncertainty about how NP is going to rewrite, because the outcome is so heavily weighted towards the first alternative. By applying a recursive relation due to Grenander (1967), the uncertainty about an entire language defined by a PCFG can be calculated -- this is simply the entropy of the start symbol. If this start symbol is S, then the entropy of all sentences in the (probabilistic) language is H(S). Calculating the entropy of possible grammatical continuations requires somehow constraining this sum, H(S), to just the derivations that generate some observed initial substring, which shall be called H(S|w_1, w_2, ... w_i). Only this latter value takes into account particular words that have ``been heard.'' Its calculation can be accomplished by calculating the unconditional entropy of a related grammar that describes all and only the grammatical continuations of the string w_1...w_i. Following Lang (1974,1988) and Lang and Billot (1989), parsing can be viewed as the intersection of a regular language (of which a single string is a particularly simple example) with a context-free language. This view of parsing is depicted below where L(G) is the language the grammar G, and the string w is immediately followed by a dot, meaning any terminal symbol, repeated any number of times (notated by the Kleene star). w(.)* intersect L(G) The result of this intersection is a new context-free grammar describing just the legal derivations. By generalizing the input from a single string to a regular set of strings, the grammatical continuations can be captured in the new, output grammar. These grammars are easily read off of chart parsers' internal data structures by attaching position indices to nonterminal names, thus distinguishing recognized constituents in different positions. It is the start symbol of this new, resultant grammar whose entropy is H(S|w_1, w_2, ... w_i). Any reduction observed as the string is lengthened to i+1 words constitutes disambiguation work that has been done because the comprehender has ruled out possible syntactic analyses. SOME RELEVANT LINGUISTIC STRUCTURES The dissertation applies this theory to the empirical domain of relative clauses. Relative clauses constitute a fascinating puzzle for sentence processing theories at least in part because their non-canonical word order suggests a deeper, more grammatically-oriented kind of processing. Much work (Bever, 1970; Wanner and Maratsos, 1978; Gibson,1998) has focused on these constructions, primarily on relativization from the grammatical relations Subject and Object. These are two points on a scale known as the Accessibility Hierarchy. The Accessibility Hierarchy (AH) is a cross-linguistic generalization about rela- tive clause formation in natural languages discovered by Keenan and Comrie (1977). The generalization is an implicational markedness hierarchy of grammatical relations that can be `relativized'. SUBJECT DIRECT OBJECT INDIRECT OBJECT OBLIQUE GENITIVE OCOMP Figure 3.5: The Accessibility Hierarchy of relativizable grammatical relations This hierarchy (figure 3.5) shows up in a variety of modern syntactic theories that have been influenced by Relational Grammar (Perlmutter and Postal, 1974). In Head- driven Phrase Structure Grammar (Pollard and Sag, 1994) the hierarchy corresponds to the order of elements on the SUBCAT list, and interacts with other principles in explanations of binding facts. The hierarchy also figures in Lexical-Functional Grammar (Bresnan, 1982b) where it is known as Syntactic Rank. Keenan and Comrie (1977) speculated that their typological generalization might have a basis in performance factors. This idea was supported by the results of a psycholin- guistic experiment done in 1974 that were not published until 1987. This experiment recorded repetition accuracy scores repeating back stimulus sentence while under the additional memory load of a digit-memory task. Stimuli were subject-modifying relative clauses embedded in one of four carrier sentence frames such as "they had forgotten that..." Examples of the first type are given below. subject extracted they had forgotten that the boy who told the story was so young direct object extracted they had forgotten that the letter which Dick wrote yesterday was so long indirect object extracted they had forgotten that the man who Ann gave the present to was old oblique extracted they had forgotten that the box which Pat brought with apples in was lost genitive subject extracted they had forgotten that the girl whose friend bought the cake was waiting genitive object they had forgotten that the man whose house Patrick bought was so ill The results showed that repetition accuracy across the AH declined with frequency of the relativization type in the world's languages. Another study reported in Keenan (1987) confirms that the type frequency is also reflected in English token frequencies. Toward an explanation of this result, two Minimalist Grammars were created, one expressing the Promotion Analysis of relative clauses recently defended by Kayne (1994) and Bianchi (1999) and the other expressing the more standard adjunction analysis (Chomsky 1977). These grammars encode their respective analyses either through complementation and successive movement to specifier in the Promotion Grammar (as shown schematically below) [DP the [ AgrD [CP I met [DP who [NP boy]]]]] [DP the [ AgrD [CP [DP who [NP boy]]i [IP I met ti]]]] [DP the [AgrP boy [ AgrD [CP [DP who tNP ]]i [IP I met ti ]]]] [DP AgrD +the [AgrP boy [ tAgr [CP [DP who tNP ]]i [IP I met ti]]]] (Bianchi, 1999, 79) or through adjunction of WH-moved modifier in the Adjunction Grammar. Estimating probabilistic versions of these grammars from the token frequency data, their summed word-by-word entropy reductions were compared with the repetition accuracy results collected by Keenan and Hawkins (1987). Predictions derived from the Promotion grammar correlated significantly with these scores whereas predictions derived from the Adjunction grammar were not statistically significant. Nor did significant correlations obtain between - the log-probability of the sentence on the grammar and the grammatical relation or - the repetition accuracy results and predictions derived from an equiprobably-weighted grammar These results collectively suggest that both the structure of the grammar and its numerical parameterization are important for deriving particular patterns of human processing difficulty. They also suggest that the configurational representation of grammatical relations in MGs is sufficient to distinguish points on the AH at intermediate parser states. NEW DATA ON HUMAN SENTENCE PROCESSING The proposed notion of disambiguation work is a word-by-word processing metric for any parser that recognizes syntactic structure. As such it makes word-by-word predictions, which must be pooled for comparison with whole sentence measurements like repetition accuracy scores. The dissertation also includes the results of three human sentence processing experiments examining individual implications of the (processing extension of) the AH. Following (Just, Carpenter, and Woolley 1982) word-by-word reading times were collected, which can be compared directly with the predictions of incremental theories such as entropy reduction. At this stage, however is quite clear that entropy reduction itself does not fully account for the pattern of results at this more detailed level. In fact, these results challenge any probabilistic sentence processing theory that assumes a principle of eagerness or immediacy by demonstrating a delayed effect of reading a marked (or infrequent) construction. EXPERIMENT 1 The first experiment examines the role of genitivity in the processing of relative clauses. Does a relative clause's being extracted from a genitive context lead to increased processing difficulty over nongenitive subject- or object-extracted relative clauses, as predicted by the AH? type_____example________________________________________________________________________ SU The hairdresser's daughter, who insulted the beautician's sister, got in an accident. DO The beautician's sister, who the hairdresser's daughter insulted, got in an accident. GenS The hairdresser, whose daughter insulted the beautician's sister, got in an accident. GenO The beautician, whose sister the hairdresser's daughter insulted, got in an accident. The empirical finding is that, indeed this is so. This increased difficulty manifests itself in slower reading times at the main verb, after the entire relative clause has been read. ``Eager'' sentence processing theories like entropy reduction that predict a slowdown at the earliest point a construction can be identified cannot account for this apparent delay. Intuitively this is because a uniquely-identified construction has no alternatives whose elimination requires information processing work. EXPERIMENT 2 The second experiment looks at the directness of object-extraction. If relativization is from indirect object, as opposed to direct object, is comprehension more difficult? type___example________________________________________________________________________ SU The secretary who sent the student to the administrator talked to the librarian. DO The student who the secretary sent to the administrator talked to the librarian. IO The administrator to whom the secretary sent the student talked to the librarian. Although sensitive enough to replicate the well-known subject/object asymmetry, experiment 2, contra the AH, does not find a corresponding direct/indirect object asymmetry. EXPERIMENT 3 The third experiment examines obliqueness. It compares the comprehension difficulty of relative clauses extracted from oblique as opposed to other grammatical relations. type___example________________________________________________________ SU The officer who pacified the captor for the hostage held a knife. DO The captor who the officer pacified for the hostage held a knife. OBL The hostage for whom the officer pacified the captor held a knife. Indeed, experiment 3 finds that extraction from oblique is significantly harder than extraction from direct object. Interestingly, as in experiment 1, the slowdown appears on the main verb, after the entire relative clause construction has been read. CONCLUSION The major conclusion of this work is that a notion of disambiguation work can be defined on probabilistic grammars. If these grammars are taken as models of human language competence, this definition can be used as a cognitive hypothesis. References Bever, Thomas G. 1970. The cognitive basis for linguistic structures. In J.R. Hayes, editor, Cognition and the Development of Language. Wiley, New York, pages 279-362. Bianchi, Valentina. 1999. Consequences of Antisymmetry: headed relative clauses. Mouton de Gruyter. Billot, Sylvie and Bernard Lang. 1989. The structure of shared forests in ambiguous parsing. In Proceedings of the 1989 Meeting of the Association for Computational Linguistics. Bresnan, Joan, editor. 1982. The Mental Representation of Grammatical Rela- tions. MIT Press, Cambridge, MA. Chomsky, Noam. 1977. On Wh-Movement. In Peter Culicover, Thomas Wasow, and Adrian Akmajian, editors, Formal Syntax. Academic Press, New York, pages 71-132. Gazdar, Gerald, Ewan Klein, Geoffrey Pullum, and Ivan Sag. 1985. Generalized Phrase Structure Grammar. Harvard University Press, Cambridge, MA. Gibson, Edward. 1998. Linguistic complexity: locality of syntactic dependen- cies. Cognition, 68:1-76. Grenander, Ulf. 1967. Syntax-controlled probabilities. Technical report, Brown University Division of Applied Mathematics, Providence, RI. Hale, John. 2001. A Probabilistic Earley Parser as a Psycholinguistic Model. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics. Jurafsky, Daniel. 1996. A probabilistic model of lexical and syntactic access and disambiguation. Cognitive Science, 20:137-194. Kayne, Richard S. 1994. The Antisymmetry of Syntax. MIT Press. Keenan, Edward L., editor. 1987. Universal Grammar: 15 Essays. Croom Helm, London. Keenan, Edward L. and Bernard Comrie. 1977. Noun phrase accessibility and universal grammar. Linguistic Inquiry, 8(1):63-99. Keenan, Edward L. and Sarah Hawkins. 1987. The psychological validity of the Accessibility Hierarchy. In Edward L. Keenan, editor, Universal Grammar: 15 Essays, pages 60-85, London. Croom Helm. Lang, Bernard. 1974. Deterministic techniques for efficient non-deterministic parsers. In J. Loeckx, editor, Proceedings of the 2nd Colloquium on Au- tomata, Languages and Programming, number 14 in Springer Lecture Notes in Computer Science, pages 255-269, Saarbru"ucken. Lang, Bernard. 1988. Parsing incomplete sentences. In Proceedings of the 12th International Conference on Computational Linguistics, pages 365-371. Lounsbury, Floyd G. 1954. Transitional probability, linguistic structure and systems of habit-family hierarchies. In C. E. Osgood and T. A. Sebeok, ed- itors, Psycholinguistics: a survey of theory and research. Indiana University Press. MacDonald, Maryellen C. 1994. Probabilistic constraints and syntactic ambi- guity resolution. Language and Cognitive Processes, pages 157-201. Perlmutter, David and Paul Postal. 1974. Lectures on Relational Grammar. LSA Linguistic Institute, UMass Amherst. Pollard, Carl J. and Ivan A. Sag. 1994. Head-driven phrase structure grammar. University of Chicago Press, Chicago. Stabler, Edward P. 1997. Derivational minimalism. In Christian Retore, editor, Logical Aspects of Computational Linguistics, pages 68-95. Springer. Thibadeau, Robert, Marcel A. Just, and Patricia Carpenter. 1982. A model fo the time course and content of reading. Cognitive Science, 6:157-203. Wanner, Eric and Michael Maratsos. 1978. An ATN approach to comprehen- sion. In Morris Halle, Joan Bresnan, and George A. Miller, editors, Linguis- tic Theory and Psychological Reality. MIT Press, Cambridge, Massachusetts, chapter 3, pages 119-161.