Synonymy and Contextual Disambiguation of Words

Synonymy and Contextual Disambiguation of Words

Eric Foxley and Godwin M Gwei\u1\d

Computer Science Department

Nottingham University
NOTTINGHAM NG7 2RD, UK

Abstract

Synonymy occurs when several different words can represent similar

meanings.
Ambiguity occurs when a single word
in a given context
may have several different meanings.
This paper describes computer developments
which provide tools to assist in
both of these situations, and from which computer tools can be developed to assist the authoring of text, and the writing of interactive computer systems.

In text authoring, we may wish to vary our vocabulary by the use of synonyms to arouse the interest of the reader, or to add emphasis to a topic; and we will generally wish to avoid ambiguity by the choice of non-polysemous words, or by the addition of enough context clues to resolve the ambiguity.

In interactions with computers, the aspects of input and output are distinct. Where the user gives input to the computer, it should able to recognise the user's vocabulary, and accept freely generated citations representing the information required. Any ambiguous construction entered by the user should be queried. When giving output to the user, the computer may either use synonyms to make the conversation more varied, or may use only one from any group of synonyms to encourage the user into a more restricted vocabulary; and computer output should be chosen to be non-ambiguous.

The paper describes the development of a suite of computer programs to determine and reduce ambiguity in text, and to enable the computer to correctly relate a variety of synonyms to a single concept.

Keywords

Ambiguity, Concept, Natural language, Polysemy, Roget, Synonymy, Thesaurus.

Contents

1. Introduction

We will divide our discussion here between the applications to text authoring, and to consultative computing.

1.1. Synonymy and ambiguity in text authoring

In text authoring, we are concerned with the preparation of natural language documents. These may be for printing, or they may be computer documents, such as lessons for computer assisted learning (CAL) systems. One of the problems encountered when writing documents relates to the choice of vocabulary. It is possible that ideas do not come to mind in the most appropriate terms; one term may come to mind, but the author might have the feeling that it is not necessarily the best term for a particular situation. When this happens, authors will refer to external sources for help, typically synonym dictionaries. This provides the author with alternative terms that are synonymous with the original term, and experience will then help in recognising the most appropriate term from the given alternatives. For less experienced authors, the set of alternative terms might not be very helpful. Each of the terms might have to be looked up in an ordinary dictionary to determine its meaning before a decision can be made. This search is tedious even when the set of alternatives is small.

A related problem in choosing the vocabulary for a document may be that we particularly wish to use an interesting and varied vocabulary. The variety is partly to add interest for the reader, and partly to add emphasis by using more than one word of similar meaning.

In producing any technical document, we would wish to be unambiguous wherever possible. We would thus like a program which could tell us which words in the text are ambiguous (the meaning of some polysemous words\u\d may be resolvable by context, so that they are no longer ambiguous), and if possible would suggest less ambiguous alternatives.

1.2. Synonymy and ambiguity in interactive computing

In ordinary conversation, synonyms are usually resolved by experience in a given domain. For example, in a class-room situation, teachers always need to resolve synonyms while handling student queries, or marking student scripts. If a teacher expects the word large as part of a response, for example, either of the words big or huge would in general be an acceptable substitute. In trying to use a computer in activities involving conversation with human users, we also need to include mechanisms for handling synonymy in the way people would expect when participating in a human conversation. In a command system, for example, a user should with equal convenience be able to ask to delete a file, or remove it, or erase it. In computer assisted learning, the answer bigger should be allowed even if the course author had specified the word larger.

For computer input, this implies that the communication interface of any computer package should accept a variety of different terms from the user to represent any given concept. If a given term is ambiguous in spite of any context information which is available, the computer should request clarification. For output, if the computer can also vary its own vocabulary, this will add to its user friendliness compared with any system which repeats and expects exactly the same word every time a given concept is being discussed.

1.3. Synonymy in general

Synonyms arise as an artifact of the evolution of natural language; a variety of terms may often be used to supply a group of similar meanings. Two points need to be made. Firstly few pure synonyms exist in the sense that two words can represent exactly the same meaning; groups of "synonymous" words usually carry subtle differences of emphasis. Secondly, a pair of words may be synonymous in one context, but not in another. This poses problems in any mechanical detection of synonyms. However, the need to detect and utilise synonyms already plays a role in many computer applications: information retrieval systems [ref 1] Automatic Information Organisation and Retrieval, G. Salton, %I McGraw-Hill %C New York (1968) machine translation [ref 2] Synonymy and Semantic Classification, K. Sparck Jones, %R Ph.D. Thesis, Cambridge University %C England (1964) [ref 3] Agricola Terram Dimovit Aratro, M. Masterman, R. Needham, K. Sparck Jones, B. Mayoh, %R ML 92, Cambridge Language Unit %C Cambridge, England (1957) on-line help systems [ref 4] [ref 5] Ingredients of intelligent user interfaces, E. L. Rissland, International Journal of Man-Machine Studies 21; pp 377-388 (1984) [ref 6] Talking to UNIX in English: An overview of UC, R. Wilensky, Y. Arens, D. Chin, Communications of the ACM 27; 6; pp 574-593 (1984) and user interfaces [ref 7] A Flexible Synonym Interface with application examples in CAL and Help Enviro, nments. G. M. Gwei, E. Foxley, The Computer Journal 30; pp 551-557 6; (1987) The authors would suggest additionally that synonyms can also be helpful in all computing environments which expect any form of information from users. For example, the synonyms to command names can be used to provide users with a more flexible vocabulary than is permitted by a typical command language. Another example is in CAL environments in which script marking is performed by searching for keywords (see [ref 8] Towards Automatic Teaching, G. M. Gwei, %R MSc. Thesis, University of Aston in Birmingham (Sept. 1983) for example). The effectiveness of such systems can be improved if they searched additionally for synonyms of the relevant keywords.

2. Synonym sources

A number of synonym dictionaries exist, typified by those named after Webster, Crabb, and Roget. Webster's Synonym Dictionary [ref 9] Edit refs file? [ny]Edit refs file? [ny]takes each word (in a lexicographic order) and lists its synonyms followed by its antonyms. Crabb's English Synonyms [ref 10] Edit refs file? [ny]Edit refs file? [ny]similarly lists synonyms to words, but additionally gives examples of appropriate contexts for each synonym in the list. Unlike these two, Roget's Thesaurus [ref 11] Edit refs file? [ny]Edit refs file? [ny]is not word-based \(em it is based on concepts; it takes a concept at a time and lists terms which express that concept. There is one paragraph for each concept; each concept paragraph is sub-divided into different parts of speech, and within a part of speech, words are grouped together by the degree of synonymity between them.

Roget's Thesaurus has a richer vocabulary than either Crabb's or Webster's dictionary. We will concentrate on Roget's, primarily because it is concept-based and contains a rich vocabulary. The work below is based on machine readable version of Roget which was obtained from Longman.

3. Using a paper copy of Roget's Thesaurus

Roget's Thesaurus is made up of approximately a thousand paragraphs, each of which is dedicated to a single concept. Each paragraph concentrates on the numerous terms that express the given concept, and is labelled with a headword which gives an instant idea of the concept concerned. The contents list at the front of the thesaurus gives the headword for each concept, and shows the theoretical relationship between them. In theory, one could look up a meaning here, and determine the paragraph in which to find the relevant words. In practice, it is easier to think of a related term (word or phrase), and then to look that word or phrase up in the index which forms the second half of the printed thesaurus. The index contains a lexicographically ordered list of most of the words words and phrases that occur in the separate paragraphs of the thesaurus\u\d. Each such entry is followed by a list of the numbers and headwords of each concept paragraph in which it occurs. Thus, the first procedure in using Roget's thesaurus manually involves deciding on a word which might be used to express the concept being considered. The user can then obtain help by means of the following steps.

The problems involved in searching for synonyms in this way include the following.

  • (a) Consolidating separate morphologically related variants.\u\d The particular part of speech adopted by the user for the original input word may fail to appear in the index. Such circumstances necessitate the recognition of other variants of a given word. Similarly, it may be necessary for a user to be aware of variant spellings for the same word (such as American/British spellings).

    Term Number of Number of
    references distinct concepts
    _
    charge 40 29
    head 35 29
    round 35 29
    close 35 30
    run 36 30
    drop 40 31
    turn 39 31
    stand 37 32
    clear 36 33
    line 39 33
    catch 39 35
    pass 40 37
    strain 49 41
    cut 53 45
    set 56 47

    Table 1: List of the most ambiguous terms in Roget's Thesaurus

  • (b) Choosing the appropriate paragraph for polysemous words, i.e. words with more than one inherent meaning. Any term that has multiple meanings (polysemous or ambiguous words) will appear in more than one paragraph. The diagram in Figure 1 shows along the horizontal axis the number of different references which may occur in Roget's Thesaurus for a given term; and plotted vertically, the number of terms with that number of references. The result is an approximately hyperbolic curve. It will be seen from the hyperbolic curve (a common characteristic of many linguistic distributions) that most terms occur in more than one paragraph, i.e. have some inherent ambiguity. The number of paragraph references made by a term reflects the number of abstract concepts for which that term can be a possible expression. We will define the ambiguity factor of a given word to be the number of distinct references to it in Roget's Thesaurus. The actual ambiguity in a given occurrence depends, of course, on context. Table 1 shows the words with the highest ambiguity factor in the thesaurus. The values in the column headed "distinct concepts" correspond to the number of separate meanings (i.e. different concept paragraphs) associated with a given word; and those in the "references" column include a count of the different parts of speech for a given meaning. For example, close is associated with 30 distinct concepts, and within some concepts, the different parts of speech for close\u\d remain unchanged. The number of references also gives a degree of ambiguity associated with a given term. Seeking synonyms to a word that has a high ambiguity factor can be very tedious. Before locating the proper synonyms, the user needs to select the pertinent meaning of the word by comparing the concept in mind with the headword of each of the referenced paragraphs.
  • (c) Locating the right synonyms. Each paragraph in the Roget's Thesaurus is subdivided into parts of speech. Within each part of speech, terms are grouped together by synonymity (the degree of similarity in meaning). Thus, obtaining the best synonyms involves locating the pertinent group within the relevant paragraph.

    The computer system described below gives assistance in finding synonyms in a way which represents a considerable improvement on the manual use of the thesaurus mentioned above. We then explore other ways of utilising the semantic classification available in a concept-based thesaurus.

    4. Automatic synonym generation

    We have implemented a synonym-generation program synonym based on our thesaurus [ref 12] The Roget Environment (Synonym Generation), G. M. Gwei, %R Internal report, Computer Science Group, Nottingham University (Dec. 1984) In this section we describe some design considerations and features of the implementation.

    4.1. Searching mechanism

    By definition, variants of the same stem have similar meanings. Consequently, it is sufficient to establish a similarity in the meanings of any two variants from two separate stems to establish synonymy between the stems concerned. For example, it would be na\*:ive to preclude the synonyms of angular from those of angle. Less obvious examples include leaving out the synonyms of encode while considering codify, or solar while dealing with sun.

    The printed version of Roget's Thesaurus contains an index to facilitate searching. We obtained from publishers Longman the actual body of the thesaurus (a file of size approximately 3 Mbytes) without the index supplied in the printed version. The generation of a full inverted index was non-trivial, but the end result was a much more comprehensive index than that in the printed copy of the thesaurus.

    The search phase involves heuristics for deriving acceptable words and phrases from any given input. The heuristics rules (see [ref 13] for details) include:

  • the use of standard conjugation rules;
  • the manipulation of suffixes and prefixes; and
  • the correction of spellings and the conversion from British to American spelling (and vice-versa).

    For example, given malfunctionithe heuristics for prefixes would automatically include nonfunctioning, not functioning etc as search terms. Similarly, prefixes such as be-, en-, a- etc which mainly introduce other parts of speech can simply be added to a given word to form others. The rules for deriving other parts of speech through suffix manipulation (see Appendix) are based on an associative model derived from Porter's [ref 14] An algorithm for suffix stripping, M. F. Porter, Program 14; 3; pp 130-137 (July 1980) algorithm on word conflation. After manipulating prefixes and suffixes of an input term, all resulting terms are sorted, and each term is then searched from the inverted index. Any derived term not found in the search is discarded. As an example, starting with the call synonym code, the prefix routine would suggest the list in Table 2.

    code
    accode
    acode
    aerocode
    aircode
    becode
    concode
    encode
    extracode
    subcode
    supercode
    ultracode
    welcode

    Table 2: Possible derivations of code

    Using the suffix rules, the system would search and display all acceptable suffix derivations from the terms in Table 2 that can be found in the inverted index. All phrases beginning with any of the acceptable terms would also be given in the display. Table 3 shows the display that would be obtained for the word code as parameter.

    Found word/phrase Paragraph Number(s)
    _
    "code of duty" 917
    "code of honour" 917, 929
    "code" 523, 525, 525, 530, 547, 586,62,81, 693, 929
    "coded" 523, 525
    "codification" 62, 953, 953
    "codified law" 953
    "codified" 525
    "codify" 62
    "encode" 520, 525
    "encoder" 520
    _
    No. of references 14 (12 unique)

    Table 3: Paragraph references for terms derived from code

    Each number corresponds to a paragraph of the thesaurus in which the term on the left appears; it will be observed that a given term may be found in several paragraphs, and that a given paragraph may appear against several terms.

    Seven of the terms in Table 3 arise from the use of affixes (prefixes and suffixes), and two terms are common phrases beginning with the word code. A summary of the effect of acceptable affixes and phrases on the most ambiguous words in the thesaurus is shown in Table 4.

    Excluding affixes Including affixes
    Term _ _ _ _
    words phrases words phrases
    _
    charge 40 53 47 57
    head 35 57 132 184
    round 35 68 56 82
    close 35 82 56 107
    run 36 169 110 280
    drop 40 92 51 98
    turn 39 193 55 208
    stand 37 142 79 190
    clear 36 71 52 79
    line 39 60 64 82
    catch 39 80 44 85
    pass 40 118 96 183
    strain 49 60 55 66
    cut 53 135 70 151
    set 56 216 74 233

    Table 4: Number of references involving the most ambiguous words

    4.2. Additional information to help users

    In manual usage of Roget, a user looks at headwords to decide which paragraph is most relevant. In our computer implementation, we wish to provide users with more help via the use of relevance and confidence measures. These measures depend on the number of occurrences, the degree of suffix/prefix manipulation used in deriving the term for each occurrence, and whether the occurrence is a phrase containing a derivation or the derivation by itself. To compare the relevance measures, we assign a value to each of the derived terms according to the procedure used in its derivation. The value assigned to each derivation depends on the error associated with the affix (prefix or prefix) used; these error values were obtained by experimenting on the separate affix categories [ref 15] The experiment yielded error values which approximate the probability that the use of a given affix would result in a word with a different morphological root. Given $"Error" sub aff$, the percentage error associated with a given affix category, the value given to a term found after using that affix would be The value assigned to a phrase containing any derivation is half the value associated with the derivation alone \(em we assumed a 50% chance that other words in the phrase may change the meaning of a given word.

    To guide the user in choosing pertinent paragraphs we opted to summarise the values from the search as follows.

  • (1) Confidence. Since a user specifies a particular variant as input, we should be more confident in paragraphs containing that variant than in paragraphs containing other derivations. Each paragraph is assigned a confidence value corresponding to the best (least error) variant of the given input that is found in that paragraph.
  • (2) Relevance of each paragraph. The paragraphs considered considered relevant to a particular input can be ordered by defining a relevance value for each one. For example, a paragraph containing several variants of the input is likely to be more relevant to the user than a paragraph with fewer variants. Let us define

    (from equation 1) as the value i sup th variant of the input as found in a given paragraph; and let

    be the number of different variants of the input found in the

    paragraph. From these values we can obtain

    , the overall score to be associated with the

    paragraph, from the following equation:

    If Max is the highest individual score obtained in a search phase, and

    , the confidence in the

    paragraph, then the relative relevance of the

    paragraph is defined by the following equation:

    The relevance value for a paragraph determines its ordering among all the paragraphs considered, while the confidence value for a paragraph is a measure of the certainty of its meaning (relative to the word given).

    Label Para. Relative Confidence Headword
    No. Relevance
    _
    a 525 100% 100% "Concealment"
    b 62 47% 100% "Arrangement: reduction to order"
    c 929 44% 100% "Probity"
    d 523 38% 100% "Latency"
    e 81 29% 100% "Rule"
    f 530 29% 100% "Secret"
    g 547 29% 100% "Indication"
    h 586 29% 100% "Writing"
    i 693 29% 100% "Precept"
    j 520 21% 75% "Interpretation"
    k 917 14% 50% "Duty"
    l 953 6% 30% "Legality"
    _
    Choose a letter, from "a" to "l";
    or any combination of the above letters;
    or "A" for all; or "N" for none;
    or "%<x>" for all paragraphs with relative relevance \o'>_' <x>%
    or "%c<x>" for all paragraphs with confidence \o'>_' <x>%
    where <x> is a number in the range 1 to 100
    Type your choice :

    Table 5: Information to facilitate the choice of pertinent synonyms of code

    In addition to the factors above, we opted to also display the headwords of all the paragraphs considered. Each headword gives the user an instant idea of the concept covered by the corresponding paragraph. It also allows the user to mentally discard irrelevant paragraphs.

    Continuing with the output of the command synonym code, the information in Table 5 is displayed to the user after the evaluation of the relevance and confidence values. Thus, paragraph 525 on Concealment, with a relative relevance of 100%, and a confidence of 100%, is considered to be the most relevant (for synonyms to code). The user might however see a more pertinent headword that better matches his/her original context. The choice is left open with this interactive method.

    4.3. Contextual synonymy

    As depicted in Figure 1 (section 3), many terms have more than one reference corresponding to distinct meanings and separate parts of speech. In requesting synonyms to a term, a user may not be interested in all potential meanings of the term (polysemy!). The headword information in Table 5 can enable a user to choose synonyms of the desired meaning (from the pertinent paragraphs). However, in this mode of operation the amount of information displayed can be bulky \(em especially when several derivations of the term exist or for words with high ambiguity factors (as shown in Table 4).

    Bulky output can be avoided by specifying the context within which to locate synonyms. Before requesting synonyms to a polysemous term, a user may already know some simple and common characteristics associated with the meaning desired. Such characteristics can include the following.

  • (a) A variant of the original term often associated with the desired meaning. For example, redness for the meaning of red associated with colour.
  • (b) Other terms with meanings similar to the desired meaning (or a guess of one synonym). For example, collection for the meaning of set associated with "assemblage of entities".
  • (c) A typical domain within which the term usually assumes the desired meaning. For example, mathematics for the mathematical meanings for set.
  • (d) Other terms common in the domain wherein the original terms assume the desired meaning. For example, food/drink, for the meaning of glass associated with cup.
  • (e) Common phrases (including the term) in which the original term assumes the desired meaning. For example, red with anger for the meaning of red associated with resentment.

    These characteristics (which are all exploited in Roget's classification) can be used to discard less relevant concepts while searching for synonyms. For example, mathematics can be specified as a context parameter for synonyms to set using the command

    synonym set -c mathematics

    The number of paragraphs considered would now be reduced to 3 (on "NUMERATION","TRUTH", and "REASONING") as opposed to 233 potential paragraphs (Table 4) for set. In this mode, synonym searches for the input parameter as before (but hidden); assigns weights to found paragraphs as above; carries out a similar procedure for an exact antonym of the input (if one exists and can be derived); performs a similar procedure for the context delimiter and its exact antonym; and combines the weights arising from the separate terms using a set of coefficients. The coefficients approximate the relationships (revealed by Roget's classification) that exist between the given parameter and the context term. The relationships considered are approximated by values as indicated in Table 6.

    value relationship between terms
    _
    4 terms are derivations of each other
    3 terms are synonymous (co-occur in a synonym group)
    2 both terms express aspects of a single (wide!) domain (co-occur in one paragraph)
    1 both terms express aspects of related domains (co-occur in inter-related paragraphs)
    0 unknown relationship
    -1 one term and the antonym of the other express aspects of related domains
    -2 one term and the antonym of the other express aspects of a single domain
    -3 both terms are antonymous (the antonym of one term is synonymous to the other term)
    -4 one term is an exact antonym of the other term (agglutinated)

    Table 6: Values associated with different relationships between terms

    The non-negative values approximate the degree by which the context parameter testifies a particular meaning of the given term, and the negative values approximate the degree by which the context parameter rejects a particular meaning. For example, the call

    synonym code -c computer

    would list only synonyms from paragraph 62 on "Arrangement: reduction to order". In the background however, 12 paragraphs would be considered for code (Table 3), 2 for decode, 8 for computers, and none for the antonyms to computers. Using the coefficients above to combine the separate weights would result in the rejection of all the other paragraphs except paragraph 62 (in which a meaning overlap between the terms occurs).

    In summary, the heuristics for deriving antonyms operate as follows.

  • (a) If a term is an agglutinated (compound) word containing a negating prefix, then extract the negating prefix. For example malfunction to function
  • (b) If a term is an agglutinated word but not containing a negating prefix, then try replacing its prefix by any possible negating prefixes (if any replacement results in a word which can be shown to exist). For example, encode to decode.
  • (c) Affix any possible negating prefixes to a word (if any affixation results in an existing word). For example, common to uncommon.
  • (d) Append/extract the suffix "-less" to/from a word (if the resulting word exists). For example, use to useless and vice-versa.

    In any of the steps above, suffixes can also be manipulated. For example, useless would also be considered as a pure antonym to useful.

    5. Disambiguating polysemous words within natural language context

    In a typical utterance, just one of the potential meanings of a polysemous word is intended. Disambiguation is the process of determining that intended meaning.

    Many of the approaches reported in the literature use other words that co-occur with the polysemous word in an utterance to disambiguate. Milne's [ref 16] Resolving Lexical Ambiguity in a Deterministic Parser, R. Milne, Computational Linguistics 12; 1; (1986) approach makes use of syntactic tokens and morphological analysis to assign the right part of speech to ambiguous words. Lesk [ref 17] M. E. Lesk, Automatic Sense Disambiguation Using Machine Readable Dictionaries: How to Tell a Pine Cone from an Ice Cream Cone, Proceedings of 1986 ACM SIGDOC conference %C Toronto Canada (June 1986) counts the overlaps in the dictionary definitions of the keywords in an utterance, and chooses the definition(s) with the highest overlaps as the intended meaning for the ambiguous word(s). Schank [ref 18] Identification of Conceptualizations Underlying Natural Language, R. Schank, Computer Models of Thought and Language %I W. H. Freeman and Co %C San Francisco (1973) conceptualises by using any characteristics of other words in a sentence that can discard some of the meanings associated with the ambiguous word in context. Jones [ref 19] Synonymy and Semantic Classification, K. Sparck Jones, %R Ph.D. Thesis, Cambridge University %C England (1964) argues that semantic affinity usually exists between the words used in any discourse, thus disambiguation can be achieved by adopting the meaning(s) that recur(s) among the words used in the discourse.

    5.1. Adopted approach

    Whilst reading, a human reader usually disambiguates word meanings by using the context within which the words appear. We can extend the synonym command to attempt word disambiguation in a similar manner. That is, by developing a method of choosing context parameters from an utterance, synonym can be made to search for the meaning (paragraph headword) of a given term demarcated by the context parameters (as in section 4.3).

    The selection of context parameters from any utterance operates as follows.

    Keyword clues

    Human readers can be assisted in understanding any written utterance by external circumstances; but good writers do not rely on this extra-ordinary ability. Many utterances contain most of the information required in their interpretation, i.e. the words used in any inherently ambiguous words. However, some of words are more likely to be context delimiters than others. For example, in an utterance such as

    time flies\u\d like an arrow

    the word an is less likely to be a useful context clue in the disambiguation of the word flies. Thus, such words can be eliminated from the utterance while searching for context clues. After the elimination of unclassified words, each of the remaining words can potentially be used as the context parameter; or some of the keywords may be more significant than others; or a combination of some of the words may jointly prescribe the context parameter. In an utterance such as

    the man with reading glasses loves wine

    for example, reading is more significant in determining the intended meaning of glasses than any of the other words \(em omitting the word reading from the utterance can have a drastic effect on the disambiguation of the word glasses, even to human readers. From utterances of this type, it seems likely that where more than one potential context parameter prevails, those closer (in sequence) to the ambiguous word may be more significant. This significance is approximated in our model by

    where

    is the distance (word count) of the

    term from the ambiguous word in the utterance. In the utterance above, the significance of man, reading, loves, and wine, would respectively be 1/3, 1/1, 1/1 and 1/2. However, only the significance values of reading and wine would be useful (from Roget's classification) in the disambiguation of glasses.

    Direct clues

    Many documents (especially technical documents) contain direct clues to help human readers disambiguate. Typical disambiguation clues in Roget's Thesaurus which we have used include the following.

  • (i) The special characters in writing "/", "(" ")". An entry in the thesaurus may contain "archive/library" or "archive (library)" specifying that the separate words are used in the same sense. Such clues can be used directly by synonym; the alternative term given would be taken as the context parameter for the preceding term.
  • (ii) The special conjunctions "or" and "and". Such conjunctions in Roget give the same clues as the special characters above. That is, the term following any such conjunction can be taken as the context parameter for the preceding term. However, more caution is needed in the use of these conjunctions than in the use of the symbols of the preceding paragraph. The degree of caution is indicated in Table 8 later.

    Any of these clues can potentially shorten the disambiguation process \(em whenever a direct clue is found it takes precedence over all other clues that may be present. However, if no relationship can be established between the clue and any meaning associated with the ambiguous word, all other keywords would be taken as clues (as above).

    The role of unclassified words

    In some utterances, unclassified words can be the main source of semantic clues. Our approach utilises such clues in the following ways.

  • (i) As part of a phrase. Whenever an unclassified word can form part of a phrase with adjacent words, the phrase obtains the significance value for the position of the unclassified word. For example, in an utterance such as
    the house was set on fire
    "set on" and "on fire" would be considered as significant disambiguation clues even though "on" would be eliminated as unclassified.
  • (ii) As determiners. Whenever, any of the words the, a, an, some, any precede a polysemous word, the disambiguation procedure is weighted in favour of paragraphs containing noun/adjectival forms of the portmanteau word.

    5.2. Performance \(em using the disambiguation model

    The disambiguation program was tested on a sample of 400 sentences \(em each sentence contained at least one polysemous word. The sentences were chosen from examples cited in the literature on disambiguation and suggestions from colleagues. Table 7 shows some of the sentences alongside with the performance of the program on each sentence. The program diagnosed the intended meaning correctly in 304 sentences (76%); reported the absence of adequate clues in 76 (19%) sentences; and reported the wrong meaning in 25 sentences (5%). Of the meanings diagnosed incorrectly, 80% contained at least one other polysemous word. On average, it took 2 seconds to disambiguate a word in a sentence context.

    cB | lB cB | lB
    cB | lB | cB | lB
    l | l | c | l.
    program ouput time
    Sentence _ _ in
    meaning (headword) confidence secs
    _
    time flies* like an arrow
    fruit flies* like a banana
    house flies* like a banana
    the man with reading glasses* loves wine
    the man with glasses* loves wine
    his glasses* were filled with sherry
    this book is the fruit* of great effort
    they all had fruits* after dinner
    the function* is in honour of his birthday Celebration 100% 6.7
    the function* has more than one argument
    the function* in his honour had arguments Celebration 64% 7.0
    the function* has many arguments

    Note: "*" indicates the word being disambiguated in each sentence.

    Table 7: Attributes of the disambiguation model

    To justify the existence of direct clues, we used the collection of all descriptive names for commands in the Unix operating system (in the file /usr/lib/whatis on Unix 4.2BSD). The results obtained are summarised in Table 8. In the light of the experiment, 69% of the occurrence of the special characters "/" ,"(" and ")", and the special conjunctions "or" and "and" can be justifiably used as disambiguation clues for polysemous words.

    Symbol Frequency % intended as
    disambiguation clues
    _
    / 36 83
    () 8 62.5
    or 27 66.67
    and 51 66

    Table 8: Evidence of direct clues in technical documents

    5.3. Comparison of the approach with others

    The approach above relies on the co-occurrence of word derivations in a synonym group; in a concept paragraph; or in any of a set of reciprocally related paragraphs (all on similar abstract concepts) of Roget's Thesaurus. The approach compares with other disambiguation schemes in the following ways.

  • (1) Morphological analysis. The derivation of morphologically related words augments the searching procedure. The assignment of lower weight values to each derivation ensures that peculiar meanings (parts of speech) associated with the affixes actually adopted in an utterance can still facilitate disambiguation as in Milne's approach [ref 20] Resolving Lexical Ambiguity in a Deterministic Parser, R. Milne, Computational Linguistics 12; 1; (1986)
  • (2) Overlap in word definitions. The headword of each paragraph in Roget's Thesaurus usually gives a near-definition of all the terms in that paragraph. Thus, all the words in any paragraph are likely to contain several overlapping words in their definitions (for the meaning referenced in the particular paragraph). The overlap count is likely to be higher for terms belonging to a synonym group than for terms belonging to one paragraph or for terms belonging to separate but related paragraphs. Thus, the use of coefficients as a measure of contexts closely mimics the approach adopted by Lesk [ref 21] M. E. Lesk, Automatic Sense Disambiguation Using Machine Readable Dictionaries: How to Tell a Pine Cone from an Ice Cream Cone, Proceedings of 1986 ACM SIGDOC conference %C Toronto Canada (June 1986)
  • (3) Common characteristics of objects. Schank's [ref 22] Identification of Conceptualizations Underlying Natural Language, R. Schank, Computer Models of Thought and Language %I W. H. Freeman and Co %C San Francisco (1973) approach relies on the collection of objects with corresponding characteristics to ease disambiguion. Each paragraph of the thesaurus is typically dedicated to a wide abstract topic. Ideally, each paragraph is subdivided into synonym groups, with each group concentrating on one aspect of the topic. Thus, the characteristics of any object has a high likelihood of occurring in the same paragraph as the object itself. Common phrases usually arise from the pertinent characteristics of many objects \(em such common phrases usually occur within the same pararaph as the object they describe. For example, paragraph 301 of the thesaurus (with "Food: eating and drinking" as headword) lists the following synonyms groups containing the word glass.

  • The excerpt (by itself) contains much information about the characteristics of the object glass (meaning cup in this case) including the idea of being refilled (filled again). The information is enough to facilitate the disambiguation of the classical quotation

    the old man's glasses were filled with sherry

    (which is also cited in Schank's paper) \(em sherry also occurs in paragraph 301.

  • (4) Semantic classification. The individual synonym groups in Roget's Thesaurus correspond to the semantic classification in Jones's [ref 23] Synonymy and Semantic Classification, K. Sparck Jones, %R Ph.D. Thesis, Cambridge University %C England (1964) approach. Her approach also involves minimising the distance between the word uses in a given sentence; the location of separate terms in the thesaurus approximates this semantic distance. That is, the semantic distance increases in the order

  • Thus, the assignment of coefficients to measure semantic distance represents a more efficient implementation of the method advocated by Jones.

    Our approach also weights the clues given by each word in an utterance (depending on the position of the word relative to the ambiguous word). Such consideration is absent from any of the approaches cited above.

    6. Possible applications of the Roget tools

    We use the word environment here to represent a computer system into which the Roget-based commands described above have been built.

    6.1. Helping authors choose appropriate terms

    The synonym generation process described in section 4 above can be used to simplify an author's task of finding the most relevant term to express a given concept. A suitable combination of flags would present an author with a small set of apposite alternatives; and it is even possible to accept incorrect English as in misfunction and suggests correct alternatives (such as dysfunction, functionless, malfunction, not function, no function, nonfunctional, nonfunctioning) by utilising the embedded heuristics. Thus the system can diagnose that a given user word is incorrect, can determine what the user probably intended, and can suggest terms which are more likely to correctly express that intention, without losing the root word from the user.

    The disambiguation routines can also be used on an authoring package in the following ways.

  • (1) The supply of contextual synonyms to polysemous words. The occurrence of polysemous words in a text can be reported via the use of a simple search procedure. An author can then be warned by a display of the ambiguity factor of each such word. Using the context of the sentence in which each ambiguous word is found, the Roget-based system can suggest less ambiguous alternatives. For example, in the sentence "the function was for his birthday", the tools above can
  • (2) Warning on persistent ambiguity. Sentences can be analysed as in (1), but failing to find disambiguation clues in a sentence, the author can be prompted to modify the sentence. For example, a sentence like "they have his glasses", would cause a warning that glasses has a high ambiguity factor and that there are insufficient disambiguation clues for its resolution.

    6.2. Towards computer understanding of word meanings

    An obvious goal of all natural language systems is to represent word meanings on computers. This process can be an adjunct (or preprocessor) to any package which analyses users' terms. For human understanding, an ordinary dictionary would be used. Unfortunately, it will be very difficult to instruct a computer to consult and analyse meanings from dictionaries in the same way as humans. However, if the universe of discourse can be represented by a set of concepts in a similar way to Roget, and if the set is complete, then meanings can be represented in terms of these concepts.

    6.3. Applications in conversational interfaces

    Most computer systems that provide some form of conversation usually do so with a limited and predefined vocabulary. Such systems can be improved by building a synonym front end [ref 24] A Flexible Synonym Interface with application examples in CAL and Help Enviro, nments. G. M. Gwei, E. Foxley, The Computer Journal 30; pp 551-557 6; (1987) (also see Wilensky et al [ref 25] Talking to UNIX in English: An overview of UC, R. Wilensky, Y. Arens, D. Chin, Communications of the ACM 27; 6; pp 574-593 (1984) for an example). The front end can recognise synonyms from any user by consulting the synonym command to convert the terms in the user's query into terms which are known by the original package. In this way, the user can enjoy a great flexibility in vocabulary in addition to the facilities offered by the package.

    This method can also be used in non-interactive environments. For example, a routine that performs script marking in a tutorial system by searching for keywords (see Gwei [ref 26] Towards Automatic Teaching, G. M. Gwei, %R MSc. Thesis, University of Aston in Birmingham (Sept. 1983) for example) can be modified to consult synonym first. Thus marking would involve searching users' responses for synonyms to keywords of a model answer (for further details see Gwei [ref 27] ).

    7. Conclusions

    The possibilities offered by the above developments are wide-ranging and offer many new facilities. The major drawback at the moment in the use of the system as an authoring tool in a technical or research environment is that the thesaurus on which this work has been based is too general purpose for technical use; there is a need for additional thesaurus paragraphs representing technical concepts related to the subject area involved. Many technical areas use words from ordinary English, but with very specific technical meanings; this usage needs to be reflected by a technical thesaurus. Roget was chosen simply because of its availability. Even with the context considerations above, terms used in a specific technical context have little chance of being treated correctly. It is easy to imagine one computer system in which erase and delete were identical operations, but another system in which the first merely erased the contents of the file, and the second actualit would be necessary to develop domain specific thesauri as replacements for, or as add-ons to, the generalised environment above.

    A lesser drawback of the present system, which could be overcome by a more careful implementation, arises because the searching mechanism above performs prefix and suffix manipulation only on the first and the last word of a phrase respectively. This results in the omission of some potentially acceptable phrases. For example, starting with "pay off", any paragraph containing "pays off" or "paying off" should be given a reasonable measure. Such derivations or inflexions are not considered by the present searching mechanism.

    The prefix and suffix manipulation routine does not handle all irregularities of inflexion in English grammar. For example, the system above cannot derive went from go, nor sought from seek. The inability to derive valid but peculiar variants of some words can lead to inaccurate ordering of the paragraphs under consideration. Thus a paragraph containing a word such as went but no other variant of the verb go would wrongly be omitted if go were the input word. Such problems can be tackled by developing more sophisticated heuristics which involve the sources and derivation routes of such words.

    Software availability

    The software developed as part of the above project relies heavily on our on-line copy of Roget's thesaurus. Longman are unwilling at the moment to allow it to be distributed to any other site. If readers know of a more freely available thesaurus, we would be happy to amend the software to use it, and then to distribute the system.

    Acknowledgements

    We wish to express our appreciation to: Longman for supplying us with an on-line version of Roget's Thesaurus; Dave Allsopp who contributed ideas to the options for the package; Andy Cheese and Mary Gwei, who both made useful comments on the original draft of this paper; and to various members of the Computer Science Department who contributed in one form or another.

    Notes

  • \u\d This research was supported in part by the Cameroon Government \(em B S grant (Cameroon Embassy, London).
  • \u\d words with more than one inherent meaning.
  • \u\d The choice of the particular words and phrases chosen to appear in the index is an editorial decision, and changes significantly between editions.
  • \u\d We will define one word to be a variant of another if the two are morphologically related by inflexion or some other derivation process.
  • \u\d When used to express the concept of ending for instance, close can be both a noun similar to conclusion, and a verb similar to terminate.
  • \u\d In order to determine the intended meaning of flies in the sentence, an automatic syntax analyser/parser would need to try all possible sentence constructions, and would in addition require a special lexical data base. [ref 28]

    APPENDIX : Deriving other parts of speech from a given root

    The various inflexions (and some suffixal derivations) of a given word can be derived via considerations which include the following.

  • (1) Affixing valid suffixes (like -able; -ing; -er; -ed; -ly; -ness; -s; -y etc) to the original word. For example, bindable and binding are derivable from bind in this way.
  • (2) As in (1) but after the last letter of the original word has been duplicated. For example, transmitter and transmitted are derivable from transmit by this method.
  • (3) As in (1) but after the last letter (typically "y" or "e") has been stripped from the original word. For example, removing and removable are derivable from remove in this way.
  • (4) Replacing a frequently occurring suffix (like -ceed; -eive; -mit etc) with other suitable variants (like -cess; -eptive; -mission etc in this case). For example, successory and succession are derivable from succeed in this way.
  • (5) Replacing the letters which typically provide for variant spellings of most words (like centre and center, localise and localize, colour and color etc). Such substitutions also facilitate the derivation of irregular parts of speech for words in this category. For example, colour to coloration, or meter to metrification.

    This outline can guide the search for a given word (and its variants) from a list of words in a lexicon/dictionary; the search should accept only those terms derivable via any of the schemes above. By storing a list of valid suffixes both requirements (1) and (2) can be achieved. All the considerations above can be facilitated by a procedure involving the details in Table 9 below.

    In conjunction with the above, other forms of a word can be derived by using prefixes as follows:

  • (1) Neutral prefixes (like a-, be-, en-) can be affixed to a word. For example, ablaze, befit, enable are respectively derived from blaze, fit, able in this way.
  • (2) Neutral prefixes can be stripped from a root. This operates in the opposite sense to the above rule.
  • (3) Prefixes of similar meanings can replace each other. This can often permit a suffix change. For example, incessant and unceasing are derivable from each other in this way. Similarly, aeroplane and airplane can be derived from each other via prefix substitution. That is, any set of prefixes that have a similar meaning can substitute for each other to form variants of a word.

    Table 9: Suffix substitution rules

    Given Unique Possible Minimum
    suffix substitute suffixes word length
    _
    -ain -an -anation,-anatory 5
    -ain -ic -icit,-icable 5
    -ance,-ancy -abl -able,-ability 6
    -ance,-ancy -ate -ate 6
    -ate -ab -able,-ability 4
    -ate -ac -ace,-acy 4
    -ate -an -ancy,-ant 4
    -ay -aid -aid 3
    -ce,-cy -ct -ctive,-ction 4
    -ce -se -se 4
    -de,-dy -si -sion,-sive 4
    -e -y -ab -able -ability 4
    -e -y -ag -age -aging 4
    -e -y -al -ally -allable 4
    -e -y -anc -ance -ancy 4
    -e -y -ant -antion -antive 4
    -e -y -ia -ial -ian 4
    -e -y -ib -ible -ibility 4
    -e -y -ic -ics -icy 4
    -e -y -ie -ier -ied 4

    Given Unique Possible Minimum
    suffix substitute suffixes word length
    _
    -e -y -il -ily -ility 4
    -e -y -in -ing -iness 4
    -e -y -io -ion -ious 4
    -e -y -is -ist -ise 4
    -e -y -it -ity -itive 4
    -e -y -iv -ive -iving 4
    -e -y -iz -ize -izing 4
    -e -y -or -ory -ors 4
    -e -y -ou -our -ous 4
    -e -y -ua -ual -uate 4
    -e -y -ul -ular -ully 4
    -e -y -ur -ure -ury 4
    -eau -aux -aux 6
    -ede,-eed,-edy -ess -essive,-ession 4
    -eem -empt -emptive,-emption 4
    -eive -eip -eipient,-eipt 6
    -eive -eit -eits,-eitive 6
    -eive -ept -eption,-eption 6
    -er -re -re,-res 4
    -ex -ic -icate,-ices 4
    -f,-fe -ve -ves,-ving 4

    Given Unique Possible Minimum
    suffix substitute suffixes word length
    _
    -ia -ic -ic,-icity 6
    -ia -is -ist,-isy 5
    -ia -ite -ites,-ited 5
    -ie -ying -ying 3
    -ind -ound -ound,-oundation 5
    -laim -lam -lamation,-lamatory 5
    -le -il -ility 5
    -le -ul -ular 5
    -mit -miss -mission,-missive 5
    -nce,-ncy -nd -nd,-nds 5
    -nce,-ncy -nt -nt,-nts 5
    -nd,-nt -nc -nce,-ncy 5
    -nd,-nt -ns -nsive,-nsion 5
    -nd -nt -nt,-nts 5
    -nt -nd -nd,-nds 5
    -olve -ub -uble,-ubility 4
    -olve -ut -ute,-ution 4
    -orb -orpt -orption,-orptive 5
    -our -or -oration,-orimetry 6
    -pt -psi -psion,-psions 5
    -re -er -er,-ers 4

    Given Unique Possible Minimum
    suffix substitute suffixes word length
    _
    -rer -re -re,-red 5
    -ribe -ript -riptive,-ription 6
    -se,-ze -st -st,-sts 5
    -se,-ze -ti -tive,-tion 5
    -se -ce -ce,-ces 5
    -se -ze -ze,-zed 5
    -tain -ic -ics,-icity 5
    -tain -ten -tenance,-tenatory 5
    -um -a -a 3
    -ume -ump -umption,-umptive 5
    -us -i -i 4
    -vert -version -version,-versions 5
    -x -ct -ction,-ctive 4
    -ze -se -se,-ses 5

    References


    Notes converted from troff to HTML by an Eric Foxley shell script, email errors to me!