Abstract of

Grammar, Uncertainty and Sentence Processing

by John T. Hale

A dissertation submitted to The Johns Hopkins University
in conformity with the requirements for the degree of Doctor of Philosophy.

Baltimore, Maryland
August, 2003


INTRODUCTION, SIGNIFICANCE AND OUTLINE


   Toward a probabilistic theory of human sentence processing, this dissertation
proposes a definition of computational work done in the course of analyzing sentences
generated by formal grammars. It applies the idea of entropy from information theory
to the set of derivations compatible with an initial substring of a sentence.  Given
a probabilistic grammar, this permits the set of such compatible derivations to be
viewed as a random variable, and the change in derivational uncertainty from word
to word to be calculated.

   This definition of computational work is examined as a cognitive model of
human sentence processing difficulty. To apply the model, a variety of existing syntac-
tic proposals for English sentences are cast as probabilistic Generalized Phrase Struc-
ture Grammars (Gazdar et al., 1985) and probabilistic Minimalist Grammars (Sta-
bler, 1997).  It is shown that the amount of predicted processing effort in relative
clauses correlates with the Accessibility Hierarchy of relativized grammatical rela-
tions (Keenan and Comrie, 1977) on a Kaynian (1994) view of relative clause struc-
ture.  Results from three new sentence reading experiments confirm the findings of
Keenan and Hawkins (1987) by demonstrating effects of the Accessibility Hierarchy (AH)
on question-answering accuracy, but find only limited support for the AH in online
reading times.

   These results carry significance for several different fields. Among them

       o for computer science, a way to calculate the informational contribution
       of the i-th word in a grammatical sentence. Formerly (Lounsbury 1954) this was only
       possible for finite languages, but in this dissertation the method is
       extended to infinite languages.

       o for linguistics, a formalized processing hypothesis that permits the
       derivation of numerical behavioral predictions from current syntactic proposals.
       In particular, by applying the mildly context-sensitive Minimalist Grammars,
       processing implications of two competing transformational analyses of
       relative clauses can be examined in detail.

       o for psycholinguistics, a new kind of cognitive explanation for sentence processing
       data that explicitly takes into account both categorical and probabilistic
       knowledge. This explanation is abstract in that it does not presuppose particular
       parsing operations or search strategies, but rather derives directly from
       the structure of a probabilistic grammar. This follows the competence hypothesis
       of Chomsky (1965) that our knowledge of language is directly used in comprehension.


   These consequences all follow from the thesis that there exists an
information-theoretic notion of disambiguation work on a probabilistic grammar which
can be taken as a cognitive model of human sentence processing difficulty.

   The dissertation argues for this thesis primarily by example. For instance, chapter 2
considers the processing predictions derived from probabilistic context-free grammars
for well-studied psycholinguistic phenomena: garden-pathing, center embedding and the
subject-object extraction asymmetry.  The predictions follow from probabilistic grammars
written in the style of Generalized Phrase Structure Grammar (Gazdar et al., 1985).
A procedure for calculating these predictions is presented, and the predictions are
compared with existing experimental evidence, including on-line reading times.

   Chapter 3 generalizes and refines these proposals with a different example.  It presents
another way of calculating behavioral predictions, and applies it to the more expressive
Minimalist Grammars. This method leverages Lang's (1974, 1988) insight that an incomplete
sentence specifies a related grammar of all possible grammatical continuations.  In this
chapter, sentence-level difficulty predictions are compared with repetition accuracy results
on the AH obtained by Keenan and Hawkins (1987). Predictions derived from a grammar
incorporating Kayne's (1994) proposal about relative clause structure are found to correlate
well with the empirical data, whereas predictions derived from a similar grammar employing
the more standard adjunction analysis do not.

   Chapter 4 discusses these and other linguistic proposals inherent in a Minimalist Grammar
covering the various syntactic phenomena exemplified in Keenan and Hawkins' (1987)
experimental stimuli. This discussion shows how the MG formalism can encode heterogenous
ideas from categorial, feature-based, and transformational traditions.

   Chapter 5 presents the results of three human sentence understanding experiments that
are motivated by the AH and the candidate theory of chapter 3. Among other results, it
reports that relativizations from genitive are read more slowly than relativizations
from non-genitive, as suggested by the AH. However this difficulty occurs at the offset,
rather than the onset, of the relative clause. This poses a challenge for any ``eager''
probabilistic sentence processing theory, such as Thibadeau, Just and Carpenter (1982),
MacDonald (1994), Jurafsky (1996) and Hale (2001).

   Chapter 6 situates the technical proposals of chapters 2 and 3 in a space of possible
probabilistic sentence processing theories. Chapter 7 concludes that a new kind of
grammar-processor relationship has been proposed. Appendix A documents the design and
measures the performance of a parser for Minimalist Grammars written in the functional
programming language ML.


THE PROPOSED NOTION OF DISAMBIGUATION WORK

   The definition of disambiguation work offered in this dissertation is formalized as the
reduction in the entropy of possible grammatical continuations brought about by lengthening
a sentence fragment by one word. This proposal is founded on the possibility of viewing
the nonterminal symbols in probabilistic grammars as random variables. For instance, in
the rules given below,

0.87 NP -> the boy
0.13 NP -> the tall boy


the nonterminal NP can be viewed as a random variable that has two alternative outcomes.
Indeed, nonterminals generally in probabilistic context-free phrase structure grammars
(PCFGs) can be viewed this way. Since their outcomes are discrete, their entropy is easily
calculated.

	 H ( NP )  = ( 0.87 * log_{2} 0.87 ) + ( 0.13 * log_{2} 0.13 )
		   = 0.56

There is just over half a bit of uncertainty about how NP is going to rewrite, because the
outcome is so heavily weighted towards the first alternative. By applying a recursive
relation due to Grenander (1967), the uncertainty about an entire language defined by a
PCFG can be calculated -- this is simply the entropy of the start symbol.  If this start
symbol is S, then the entropy of all sentences in the (probabilistic) language is H(S).

Calculating the entropy of possible grammatical continuations requires somehow constraining
this sum, H(S), to just the derivations that generate some observed initial substring,
which shall be called H(S|w_1, w_2, ... w_i).  Only this latter value takes into account
particular words that have ``been heard.'' Its calculation can be accomplished by
calculating the unconditional entropy of a related grammar that describes all and only
the grammatical continuations of the string w_1...w_i.

   Following Lang (1974,1988) and Lang and Billot (1989), parsing can be viewed as the
intersection of a regular language (of which a single string is a particularly simple
example) with a context-free language.  This view of parsing is depicted below where
L(G) is the language the grammar G, and the string w is immediately followed by a dot,
meaning any terminal symbol, repeated any number of times (notated by the Kleene star).


                               w(.)* intersect L(G)


The result of this intersection is a new context-free grammar describing just the legal
derivations.  By generalizing the input from a single string to a regular set of strings,
the grammatical continuations can be captured in the new, output grammar. These grammars
are easily read off of chart parsers' internal data structures by attaching position
indices to nonterminal names, thus distinguishing recognized constituents in different
positions.

It is the start symbol of this new, resultant grammar whose entropy is
H(S|w_1, w_2, ...  w_i).  Any reduction observed as the string is lengthened to i+1 words
constitutes disambiguation work that has been done because the comprehender has ruled out
possible syntactic analyses.


SOME RELEVANT LINGUISTIC STRUCTURES


   The dissertation applies this theory to the empirical domain of relative clauses.
Relative clauses constitute a fascinating puzzle for sentence processing theories at
least in part because their non-canonical word order suggests a deeper, more
grammatically-oriented kind of processing. Much work (Bever, 1970;
Wanner and Maratsos, 1978; Gibson,1998) has focused on these constructions, primarily
on relativization from the grammatical relations Subject and Object. These are two points
on a scale known as the Accessibility Hierarchy.

   The Accessibility Hierarchy (AH) is a cross-linguistic generalization about rela-
tive clause formation in natural languages discovered by Keenan and Comrie (1977). The
generalization is an implicational markedness hierarchy of grammatical relations that can
be `relativized'.


          SUBJECT  DIRECT OBJECT  INDIRECT OBJECT  OBLIQUE  GENITIVE  OCOMP

     Figure 3.5: The Accessibility Hierarchy of relativizable grammatical relations


   This hierarchy (figure 3.5) shows up in a variety of modern syntactic theories
that have been influenced by Relational Grammar (Perlmutter and Postal, 1974). In Head-
driven Phrase Structure Grammar (Pollard and Sag, 1994) the hierarchy corresponds to the
order of elements on the SUBCAT list, and interacts with other principles in explanations of
binding facts. The hierarchy also figures in Lexical-Functional Grammar (Bresnan, 1982b)
where it is known as Syntactic Rank.

   Keenan and Comrie (1977) speculated that their typological generalization might
have a basis in performance factors. This idea was supported by the results of a psycholin-
guistic experiment done in 1974 that were not published until 1987.  This experiment
recorded repetition accuracy scores repeating back stimulus sentence while under the
additional memory load of a digit-memory task. Stimuli were subject-modifying relative
clauses embedded in one of four carrier sentence frames such as "they had forgotten
that..." Examples of the first type are given below.

   subject extracted
      they had forgotten that the boy who told the story was so young 

   direct object extracted
      they  had  forgotten  that  the  letter which  Dick  wrote  yesterday  was  so long

   indirect object extracted
      they had forgotten that the man who Ann gave the present to was old

   oblique extracted
      they had forgotten that the box which Pat brought with apples in was lost

   genitive subject extracted
      they had forgotten that the girl whose friend bought the cake was waiting

   genitive object
       they had forgotten that the man whose house Patrick bought was so ill 


The results showed that repetition accuracy across the AH declined with frequency of the
relativization type in the world's languages. Another study reported in Keenan (1987)
confirms that the type frequency is also reflected in English token frequencies.


   Toward an explanation of this result, two Minimalist Grammars were created, one
expressing the Promotion Analysis of relative clauses recently defended by Kayne (1994)
and Bianchi (1999) and the other expressing the more standard adjunction analysis (Chomsky
1977).  These grammars encode their respective analyses either through complementation
and successive movement to specifier in the Promotion Grammar (as shown schematically
below)

        [DP  the [ AgrD  [CP  I met [DP  who [NP  boy]]]]]


        [DP  the [ AgrD  [CP  [DP  who [NP  boy]]i [IP I met ti]]]]


        [DP  the [AgrP boy [ AgrD  [CP  [DP  who tNP ]]i [IP I met ti ]]]]


        [DP  AgrD +the [AgrP boy [ tAgr [CP  [DP  who tNP ]]i [IP I met ti]]]]


                                                                      (Bianchi, 1999, 79)


or through adjunction of WH-moved modifier in the Adjunction Grammar.

   Estimating probabilistic versions of these grammars from the token frequency data, their
summed word-by-word entropy reductions were compared with the repetition accuracy results
collected by Keenan and Hawkins (1987). Predictions derived from the Promotion grammar
correlated significantly with these scores whereas predictions derived from the Adjunction
grammar were not statistically significant. Nor did significant correlations obtain
between

     - the log-probability of the sentence on the grammar and the grammatical relation
or
     - the repetition accuracy results and predictions derived from an
       equiprobably-weighted grammar

   These results collectively suggest that both the structure of the grammar and its
numerical parameterization are important for deriving particular patterns of human
processing difficulty. They also suggest that the configurational representation of
grammatical relations in MGs is sufficient to distinguish points on the AH at intermediate
parser states.


NEW DATA ON HUMAN SENTENCE PROCESSING


   The proposed notion of disambiguation work is a word-by-word processing metric for any
parser that recognizes syntactic structure. As such it makes word-by-word predictions,
which must be pooled for comparison with whole sentence measurements like
repetition accuracy scores.

   The dissertation also includes the results of three human sentence processing experiments
examining individual implications of the (processing extension of) the AH.  Following
(Just, Carpenter, and Woolley 1982) word-by-word reading times were collected, which can
be compared directly with the predictions of incremental theories such as entropy reduction.
At this stage, however is quite clear that entropy reduction itself does not fully account
for the pattern of results at this more detailed level.

   In fact, these results challenge any probabilistic sentence processing theory that
assumes a principle of eagerness or immediacy by demonstrating a delayed effect of reading
a marked (or infrequent) construction.


EXPERIMENT 1

The first experiment examines the role of genitivity in the processing of relative
clauses.  Does a relative clause's being extracted from a genitive context lead to
increased processing difficulty over nongenitive subject- or object-extracted relative
clauses, as predicted by the AH?


type_____example________________________________________________________________________

SU   The hairdresser's daughter, who insulted the beautician's sister, got in an accident.
DO   The beautician's sister, who the hairdresser's daughter insulted, got in an accident.
GenS The hairdresser, whose daughter insulted the beautician's sister, got in an accident.
GenO The beautician, whose sister the hairdresser's daughter insulted, got in an accident.

The empirical finding is that, indeed this is so. This increased difficulty manifests
itself in slower reading times at the main verb, after the entire relative clause has been
read. ``Eager'' sentence processing theories like entropy reduction that predict a slowdown
at the earliest point a construction can be identified cannot account for this apparent
delay. Intuitively this is because a uniquely-identified construction has no alternatives
whose elimination requires information processing work.


EXPERIMENT 2

The second experiment looks at the directness of object-extraction.  If relativization
is from indirect object, as opposed to direct object, is comprehension more difficult?


type___example________________________________________________________________________

SU     The secretary who sent the student to the administrator talked to the librarian.
DO     The student who the secretary sent to the administrator talked to the librarian.
IO     The administrator to whom the secretary sent the student talked to the librarian.

Although sensitive enough to replicate the well-known subject/object
asymmetry, experiment 2, contra the AH,  does not find a corresponding direct/indirect
object asymmetry.


EXPERIMENT 3


The third experiment examines obliqueness. It compares the comprehension difficulty of
relative clauses extracted from oblique as opposed to other grammatical relations.


type___example________________________________________________________

SU   The officer who pacified the captor for the hostage held a knife.
DO   The captor who the officer pacified for the hostage held a knife.
OBL  The hostage for whom the officer pacified the captor held a knife.

Indeed, experiment 3 finds that extraction from oblique is
significantly harder than extraction from direct object.  Interestingly,
as in experiment 1, the slowdown appears on the main verb, after
the entire relative clause construction has been read.


CONCLUSION

The major conclusion of this work is that a notion of disambiguation work
can be defined on probabilistic grammars. If these grammars are taken as
models of human language competence, this definition can be used as
a cognitive hypothesis.

References


Bever,  Thomas G.   1970.   The cognitive basis for linguistic structures.   In
    J.R. Hayes, editor, Cognition and the Development of Language. Wiley, New
    York, pages 279-362.

Bianchi,  Valentina.   1999.   Consequences  of  Antisymmetry:  headed  relative
    clauses. Mouton de Gruyter.

Billot,  Sylvie and Bernard Lang.   1989.   The structure of shared forests in
    ambiguous parsing.  In Proceedings of the 1989 Meeting of the Association
    for Computational Linguistics.

Bresnan, Joan, editor. 1982. The Mental Representation of Grammatical Rela-
    tions. MIT Press, Cambridge, MA.

Chomsky, Noam. 1977. On Wh-Movement. In Peter Culicover, Thomas Wasow,
    and Adrian Akmajian, editors, Formal Syntax. Academic Press, New York,
    pages 71-132.

Gazdar, Gerald, Ewan Klein, Geoffrey Pullum, and Ivan Sag. 1985. Generalized
    Phrase Structure Grammar. Harvard University Press, Cambridge, MA.

Gibson, Edward.  1998.  Linguistic complexity:  locality of syntactic dependen-
    cies. Cognition, 68:1-76.

Grenander, Ulf. 1967. Syntax-controlled probabilities. Technical report, Brown
    University Division of Applied Mathematics, Providence, RI.

Hale, John.  2001.  A Probabilistic Earley Parser as a Psycholinguistic Model.
    In Proceedings of the Second Meeting of the North American Chapter of the
    Association for Computational Linguistics.

Jurafsky, Daniel.  1996.  A probabilistic model of lexical and syntactic access
    and disambiguation. Cognitive Science, 20:137-194.

Kayne, Richard S. 1994. The Antisymmetry of Syntax. MIT Press.

Keenan, Edward L., editor.  1987.  Universal Grammar:  15 Essays.  Croom
    Helm, London.

Keenan, Edward L. and Bernard Comrie. 1977. Noun phrase accessibility and
    universal grammar. Linguistic Inquiry, 8(1):63-99.

Keenan, Edward L. and Sarah Hawkins. 1987. The psychological validity of the
    Accessibility Hierarchy. In Edward L. Keenan, editor, Universal Grammar:
    15 Essays, pages 60-85, London. Croom Helm.

Lang, Bernard.  1974.  Deterministic techniques for efficient non-deterministic
    parsers.  In J. Loeckx, editor, Proceedings of the 2nd  Colloquium on Au-
    tomata, Languages and Programming, number 14 in Springer Lecture Notes
    in Computer Science, pages 255-269, Saarbru"ucken.

Lang, Bernard. 1988. Parsing incomplete sentences. In Proceedings of the 12th
    International Conference on Computational Linguistics, pages 365-371.

Lounsbury, Floyd G.  1954.  Transitional probability, linguistic structure and
    systems of habit-family hierarchies. In C. E. Osgood and T. A. Sebeok, ed-
    itors, Psycholinguistics: a survey of theory and research. Indiana University
    Press.

MacDonald, Maryellen C.  1994.  Probabilistic constraints and syntactic ambi-
    guity resolution. Language and Cognitive Processes, pages 157-201.

Perlmutter, David and Paul Postal.  1974.  Lectures on Relational Grammar.
    LSA Linguistic Institute, UMass Amherst.

Pollard, Carl J. and Ivan A. Sag. 1994. Head-driven phrase structure grammar.
    University of Chicago Press, Chicago.

Stabler, Edward P. 1997. Derivational minimalism. In Christian Retore, editor,
    Logical Aspects of Computational Linguistics, pages 68-95. Springer.

Thibadeau, Robert, Marcel A. Just, and Patricia Carpenter. 1982. A model fo
    the time course and content of reading. Cognitive Science, 6:157-203.

Wanner, Eric and Michael Maratsos.  1978.  An ATN approach to comprehen-
    sion. In Morris Halle, Joan Bresnan, and George A. Miller, editors, Linguis-
    tic Theory and Psychological Reality. MIT Press, Cambridge, Massachusetts,
    chapter 3, pages 119-161.