Eric Foxley and Godwin M Gwei\u1\d
Computer Science Department
Synonymy occurs when several different words can represent similar
In text authoring, we may wish to vary our vocabulary by the use of synonyms to arouse the interest of the reader, or to add emphasis to a topic; and we will generally wish to avoid ambiguity by the choice of non-polysemous words, or by the addition of enough context clues to resolve the ambiguity.
In interactions with computers, the aspects of input and output are distinct. Where the user gives input to the computer, it should able to recognise the user's vocabulary, and accept freely generated citations representing the information required. Any ambiguous construction entered by the user should be queried. When giving output to the user, the computer may either use synonyms to make the conversation more varied, or may use only one from any group of synonyms to encourage the user into a more restricted vocabulary; and computer output should be chosen to be non-ambiguous.
The paper describes the development of a suite of computer programs to determine and reduce ambiguity in text, and to enable the computer to correctly relate a variety of synonyms to a single concept.
Ambiguity, Concept, Natural language, Polysemy, Roget, Synonymy, Thesaurus.
We will divide our discussion here between the applications to text authoring, and to consultative computing.
In text authoring, we are concerned with the preparation of natural language documents. These may be for printing, or they may be computer documents, such as lessons for computer assisted learning (CAL) systems. One of the problems encountered when writing documents relates to the choice of vocabulary. It is possible that ideas do not come to mind in the most appropriate terms; one term may come to mind, but the author might have the feeling that it is not necessarily the best term for a particular situation. When this happens, authors will refer to external sources for help, typically synonym dictionaries. This provides the author with alternative terms that are synonymous with the original term, and experience will then help in recognising the most appropriate term from the given alternatives. For less experienced authors, the set of alternative terms might not be very helpful. Each of the terms might have to be looked up in an ordinary dictionary to determine its meaning before a decision can be made. This search is tedious even when the set of alternatives is small.
A related problem in choosing the vocabulary for a document may be that we particularly wish to use an interesting and varied vocabulary. The variety is partly to add interest for the reader, and partly to add emphasis by using more than one word of similar meaning.
In producing any technical document, we would wish to be unambiguous wherever possible. We would thus like a program which could tell us which words in the text are ambiguous (the meaning of some polysemous words\u\d may be resolvable by context, so that they are no longer ambiguous), and if possible would suggest less ambiguous alternatives.
In ordinary conversation, synonyms are usually resolved by experience in a given domain. For example, in a class-room situation, teachers always need to resolve synonyms while handling student queries, or marking student scripts. If a teacher expects the word large as part of a response, for example, either of the words big or huge would in general be an acceptable substitute. In trying to use a computer in activities involving conversation with human users, we also need to include mechanisms for handling synonymy in the way people would expect when participating in a human conversation. In a command system, for example, a user should with equal convenience be able to ask to delete a file, or remove it, or erase it. In computer assisted learning, the answer bigger should be allowed even if the course author had specified the word larger.
For computer input, this implies that the communication interface of any computer package should accept a variety of different terms from the user to represent any given concept. If a given term is ambiguous in spite of any context information which is available, the computer should request clarification. For output, if the computer can also vary its own vocabulary, this will add to its user friendliness compared with any system which repeats and expects exactly the same word every time a given concept is being discussed.
Synonyms arise as an artifact of the evolution of natural language; a variety of terms may often be used to supply a group of similar meanings. Two points need to be made. Firstly few pure synonyms exist in the sense that two words can represent exactly the same meaning; groups of "synonymous" words usually carry subtle differences of emphasis. Secondly, a pair of words may be synonymous in one context, but not in another. This poses problems in any mechanical detection of synonyms. However, the need to detect and utilise synonyms already plays a role in many computer applications: information retrieval systems [ref 1] Automatic Information Organisation and Retrieval, G. Salton, %I McGraw-Hill %C New York (1968) machine translation [ref 2] Synonymy and Semantic Classification, K. Sparck Jones, %R Ph.D. Thesis, Cambridge University %C England (1964) [ref 3] Agricola Terram Dimovit Aratro, M. Masterman, R. Needham, K. Sparck Jones, B. Mayoh, %R ML 92, Cambridge Language Unit %C Cambridge, England (1957) on-line help systems [ref 4] [ref 5] Ingredients of intelligent user interfaces, E. L. Rissland, International Journal of Man-Machine Studies 21; pp 377-388 (1984) [ref 6] Talking to UNIX in English: An overview of UC, R. Wilensky, Y. Arens, D. Chin, Communications of the ACM 27; 6; pp 574-593 (1984) and user interfaces [ref 7] A Flexible Synonym Interface with application examples in CAL and Help Enviro, nments. G. M. Gwei, E. Foxley, The Computer Journal 30; pp 551-557 6; (1987) The authors would suggest additionally that synonyms can also be helpful in all computing environments which expect any form of information from users. For example, the synonyms to command names can be used to provide users with a more flexible vocabulary than is permitted by a typical command language. Another example is in CAL environments in which script marking is performed by searching for keywords (see [ref 8] Towards Automatic Teaching, G. M. Gwei, %R MSc. Thesis, University of Aston in Birmingham (Sept. 1983) for example). The effectiveness of such systems can be improved if they searched additionally for synonyms of the relevant keywords.
A number of synonym dictionaries exist, typified by those named after Webster, Crabb, and Roget. Webster's Synonym Dictionary [ref 9] Edit refs file? [ny]Edit refs file? [ny]takes each word (in a lexicographic order) and lists its synonyms followed by its antonyms. Crabb's English Synonyms [ref 10] Edit refs file? [ny]Edit refs file? [ny]similarly lists synonyms to words, but additionally gives examples of appropriate contexts for each synonym in the list. Unlike these two, Roget's Thesaurus [ref 11] Edit refs file? [ny]Edit refs file? [ny]is not word-based \(em it is based on concepts; it takes a concept at a time and lists terms which express that concept. There is one paragraph for each concept; each concept paragraph is sub-divided into different parts of speech, and within a part of speech, words are grouped together by the degree of synonymity between them.
Roget's Thesaurus has a richer vocabulary than either Crabb's or Webster's dictionary. We will concentrate on Roget's, primarily because it is concept-based and contains a rich vocabulary. The work below is based on machine readable version of Roget which was obtained from Longman.
Roget's Thesaurus is made up of approximately a thousand paragraphs, each of which is dedicated to a single concept. Each paragraph concentrates on the numerous terms that express the given concept, and is labelled with a headword which gives an instant idea of the concept concerned. The contents list at the front of the thesaurus gives the headword for each concept, and shows the theoretical relationship between them. In theory, one could look up a meaning here, and determine the paragraph in which to find the relevant words. In practice, it is easier to think of a related term (word or phrase), and then to look that word or phrase up in the index which forms the second half of the printed thesaurus. The index contains a lexicographically ordered list of most of the words words and phrases that occur in the separate paragraphs of the thesaurus\u\d. Each such entry is followed by a list of the numbers and headwords of each concept paragraph in which it occurs. Thus, the first procedure in using Roget's thesaurus manually involves deciding on a word which might be used to express the concept being considered. The user can then obtain help by means of the following steps.
The problems involved in searching for synonyms in this way include the following.
| Term | Number of | Number of |
| references | distinct concepts | |
| _ | ||
| charge | 40 | 29 |
| head | 35 | 29 |
| round | 35 | 29 |
| close | 35 | 30 |
| run | 36 | 30 |
| drop | 40 | 31 |
| turn | 39 | 31 |
| stand | 37 | 32 |
| clear | 36 | 33 |
| line | 39 | 33 |
| catch | 39 | 35 |
| pass | 40 | 37 |
| strain | 49 | 41 |
| cut | 53 | 45 |
| set | 56 | 47 |
The computer system described below gives assistance in finding synonyms in a way which represents a considerable improvement on the manual use of the thesaurus mentioned above. We then explore other ways of utilising the semantic classification available in a concept-based thesaurus.
We have implemented a synonym-generation program synonym based on our thesaurus [ref 12] The Roget Environment (Synonym Generation), G. M. Gwei, %R Internal report, Computer Science Group, Nottingham University (Dec. 1984) In this section we describe some design considerations and features of the implementation.
By definition, variants of the same stem have similar meanings. Consequently, it is sufficient to establish a similarity in the meanings of any two variants from two separate stems to establish synonymy between the stems concerned. For example, it would be na\*:ive to preclude the synonyms of angular from those of angle. Less obvious examples include leaving out the synonyms of encode while considering codify, or solar while dealing with sun.
The printed version of Roget's Thesaurus contains an index to facilitate searching. We obtained from publishers Longman the actual body of the thesaurus (a file of size approximately 3 Mbytes) without the index supplied in the printed version. The generation of a full inverted index was non-trivial, but the end result was a much more comprehensive index than that in the printed copy of the thesaurus.
The search phase involves heuristics for deriving acceptable words and phrases from any given input. The heuristics rules (see [ref 13] for details) include:
For example, given malfunctionithe heuristics for prefixes would automatically include nonfunctioning, not functioning etc as search terms. Similarly, prefixes such as be-, en-, a- etc which mainly introduce other parts of speech can simply be added to a given word to form others. The rules for deriving other parts of speech through suffix manipulation (see Appendix) are based on an associative model derived from Porter's [ref 14] An algorithm for suffix stripping, M. F. Porter, Program 14; 3; pp 130-137 (July 1980) algorithm on word conflation. After manipulating prefixes and suffixes of an input term, all resulting terms are sorted, and each term is then searched from the inverted index. Any derived term not found in the search is discarded. As an example, starting with the call synonym code, the prefix routine would suggest the list in Table 2.
| code |
| accode |
| acode |
| aerocode |
| aircode |
| becode |
| concode |
| encode |
| extracode |
| subcode |
| supercode |
| ultracode |
| welcode |
Using the suffix rules, the system would search and display all acceptable suffix derivations from the terms in Table 2 that can be found in the inverted index. All phrases beginning with any of the acceptable terms would also be given in the display. Table 3 shows the display that would be obtained for the word code as parameter.
| Found word/phrase | Paragraph Number(s) |
| _ | |
| "code of duty" | 917 |
| "code of honour" | 917, 929 |
| "code" | 523, 525, 525, 530, 547, 586,62,81, 693, 929 |
| "coded" | 523, 525 |
| "codification" | 62, 953, 953 |
| "codified law" | 953 |
| "codified" | 525 |
| "codify" | 62 |
| "encode" | 520, 525 |
| "encoder" | 520 |
| _ | |
| No. of references | 14 (12 unique) |
Each number corresponds to a paragraph of the thesaurus in which the term on the left appears; it will be observed that a given term may be found in several paragraphs, and that a given paragraph may appear against several terms.
Seven of the terms in Table 3 arise from the use of affixes (prefixes and suffixes), and two terms are common phrases beginning with the word code. A summary of the effect of acceptable affixes and phrases on the most ambiguous words in the thesaurus is shown in Table 4.
| Excluding affixes | Including affixes | |||
| Term | _ | _ | _ | _ |
| words | phrases | words | phrases | |
| _ | ||||
| charge | 40 | 53 | 47 | 57 |
| head | 35 | 57 | 132 | 184 |
| round | 35 | 68 | 56 | 82 |
| close | 35 | 82 | 56 | 107 |
| run | 36 | 169 | 110 | 280 |
| drop | 40 | 92 | 51 | 98 |
| turn | 39 | 193 | 55 | 208 |
| stand | 37 | 142 | 79 | 190 |
| clear | 36 | 71 | 52 | 79 |
| line | 39 | 60 | 64 | 82 |
| catch | 39 | 80 | 44 | 85 |
| pass | 40 | 118 | 96 | 183 |
| strain | 49 | 60 | 55 | 66 |
| cut | 53 | 135 | 70 | 151 |
| set | 56 | 216 | 74 | 233 |
In manual usage of Roget, a user looks at headwords to decide which paragraph is most relevant. In our computer implementation, we wish to provide users with more help via the use of relevance and confidence measures. These measures depend on the number of occurrences, the degree of suffix/prefix manipulation used in deriving the term for each occurrence, and whether the occurrence is a phrase containing a derivation or the derivation by itself. To compare the relevance measures, we assign a value to each of the derived terms according to the procedure used in its derivation. The value assigned to each derivation depends on the error associated with the affix (prefix or prefix) used; these error values were obtained by experimenting on the separate affix categories [ref 15] The experiment yielded error values which approximate the probability that the use of a given affix would result in a word with a different morphological root. Given $"Error" sub aff$, the percentage error associated with a given affix category, the value given to a term found after using that affix would be The value assigned to a phrase containing any derivation is half the value associated with the derivation alone \(em we assumed a 50% chance that other words in the phrase may change the meaning of a given word.
To guide the user in choosing pertinent paragraphs we opted to summarise the values from the search as follows.
If Max is the highest individual score obtained in a search phase, and
The relevance value for a paragraph determines its ordering among all the paragraphs considered, while the confidence value for a paragraph is a measure of the certainty of its meaning (relative to the word given).
| Label | Para. | Relative | Confidence | Headword |
| No. | Relevance | |||
| _ | ||||
| a | 525 | 100% | 100% | "Concealment" |
| b | 62 | 47% | 100% | "Arrangement: reduction to order" |
| c | 929 | 44% | 100% | "Probity" |
| d | 523 | 38% | 100% | "Latency" |
| e | 81 | 29% | 100% | "Rule" |
| f | 530 | 29% | 100% | "Secret" |
| g | 547 | 29% | 100% | "Indication" |
| h | 586 | 29% | 100% | "Writing" |
| i | 693 | 29% | 100% | "Precept" |
| j | 520 | 21% | 75% | "Interpretation" |
| k | 917 | 14% | 50% | "Duty" |
| l | 953 | 6% | 30% | "Legality" |
| _ | ||||
| Choose a letter, from "a" to "l"; | ||||
| or any combination of the above letters; | ||||
| or "A" for all; or "N" for none; | ||||
| or "%<x>" for all paragraphs with relative relevance \o'>_' <x>% | ||||
| or "%c<x>" for all paragraphs with confidence \o'>_' <x>% | ||||
| where <x> is a number in the range 1 to 100 | ||||
| Type your choice : | ||||
In addition to the factors above, we opted to also display the headwords of all the paragraphs considered. Each headword gives the user an instant idea of the concept covered by the corresponding paragraph. It also allows the user to mentally discard irrelevant paragraphs.
Continuing with the output of the command synonym code, the information in Table 5 is displayed to the user after the evaluation of the relevance and confidence values. Thus, paragraph 525 on Concealment, with a relative relevance of 100%, and a confidence of 100%, is considered to be the most relevant (for synonyms to code). The user might however see a more pertinent headword that better matches his/her original context. The choice is left open with this interactive method.
As depicted in Figure 1 (section 3), many terms have more than one reference corresponding to distinct meanings and separate parts of speech. In requesting synonyms to a term, a user may not be interested in all potential meanings of the term (polysemy!). The headword information in Table 5 can enable a user to choose synonyms of the desired meaning (from the pertinent paragraphs). However, in this mode of operation the amount of information displayed can be bulky \(em especially when several derivations of the term exist or for words with high ambiguity factors (as shown in Table 4).
Bulky output can be avoided by specifying the context within which to locate synonyms. Before requesting synonyms to a polysemous term, a user may already know some simple and common characteristics associated with the meaning desired. Such characteristics can include the following.
These characteristics (which are all exploited in Roget's classification) can be used to discard less relevant concepts while searching for synonyms. For example, mathematics can be specified as a context parameter for synonyms to set using the command
synonym set -c mathematics
The number of paragraphs considered would now be reduced to 3 (on "NUMERATION","TRUTH", and "REASONING") as opposed to 233 potential paragraphs (Table 4) for set. In this mode, synonym searches for the input parameter as before (but hidden); assigns weights to found paragraphs as above; carries out a similar procedure for an exact antonym of the input (if one exists and can be derived); performs a similar procedure for the context delimiter and its exact antonym; and combines the weights arising from the separate terms using a set of coefficients. The coefficients approximate the relationships (revealed by Roget's classification) that exist between the given parameter and the context term. The relationships considered are approximated by values as indicated in Table 6.
| value | relationship between terms | |
| _ | ||
| 4 | terms are derivations of each other | |
| 3 | terms are synonymous (co-occur in a synonym group) | |
| 2 | both terms express aspects of a single (wide!) domain (co-occur in one paragraph) | |
| 1 | both terms express aspects of related domains (co-occur in inter-related paragraphs) | |
| 0 | unknown relationship | |
| -1 | one term and the antonym of the other express aspects of related domains | |
| -2 | one term and the antonym of the other express aspects of a single domain | |
| -3 | both terms are antonymous (the antonym of one term is synonymous to the other term) | |
| -4 | one term is an exact antonym of the other term (agglutinated) | |
The non-negative values approximate the degree by which the context parameter testifies a particular meaning of the given term, and the negative values approximate the degree by which the context parameter rejects a particular meaning. For example, the call
synonym code -c computer
would list only synonyms from paragraph 62 on "Arrangement: reduction to order". In the background however, 12 paragraphs would be considered for code (Table 3), 2 for decode, 8 for computers, and none for the antonyms to computers. Using the coefficients above to combine the separate weights would result in the rejection of all the other paragraphs except paragraph 62 (in which a meaning overlap between the terms occurs).
In summary, the heuristics for deriving antonyms operate as follows.
In any of the steps above, suffixes can also be manipulated. For example, useless would also be considered as a pure antonym to useful.
In a typical utterance, just one of the potential meanings of a polysemous word is intended. Disambiguation is the process of determining that intended meaning.
Many of the approaches reported in the literature use other words that co-occur with the polysemous word in an utterance to disambiguate. Milne's [ref 16] Resolving Lexical Ambiguity in a Deterministic Parser, R. Milne, Computational Linguistics 12; 1; (1986) approach makes use of syntactic tokens and morphological analysis to assign the right part of speech to ambiguous words. Lesk [ref 17] M. E. Lesk, Automatic Sense Disambiguation Using Machine Readable Dictionaries: How to Tell a Pine Cone from an Ice Cream Cone, Proceedings of 1986 ACM SIGDOC conference %C Toronto Canada (June 1986) counts the overlaps in the dictionary definitions of the keywords in an utterance, and chooses the definition(s) with the highest overlaps as the intended meaning for the ambiguous word(s). Schank [ref 18] Identification of Conceptualizations Underlying Natural Language, R. Schank, Computer Models of Thought and Language %I W. H. Freeman and Co %C San Francisco (1973) conceptualises by using any characteristics of other words in a sentence that can discard some of the meanings associated with the ambiguous word in context. Jones [ref 19] Synonymy and Semantic Classification, K. Sparck Jones, %R Ph.D. Thesis, Cambridge University %C England (1964) argues that semantic affinity usually exists between the words used in any discourse, thus disambiguation can be achieved by adopting the meaning(s) that recur(s) among the words used in the discourse.
Whilst reading, a human reader usually disambiguates word meanings by using the context within which the words appear. We can extend the synonym command to attempt word disambiguation in a similar manner. That is, by developing a method of choosing context parameters from an utterance, synonym can be made to search for the meaning (paragraph headword) of a given term demarcated by the context parameters (as in section 4.3).
The selection of context parameters from any utterance operates as follows.
Human readers can be assisted in understanding any written utterance by external circumstances; but good writers do not rely on this extra-ordinary ability. Many utterances contain most of the information required in their interpretation, i.e. the words used in any inherently ambiguous words. However, some of words are more likely to be context delimiters than others. For example, in an utterance such as
time flies\u\d like an arrow
the word an is less likely to be a useful context clue in the disambiguation of the word flies. Thus, such words can be eliminated from the utterance while searching for context clues. After the elimination of unclassified words, each of the remaining words can potentially be used as the context parameter; or some of the keywords may be more significant than others; or a combination of some of the words may jointly prescribe the context parameter. In an utterance such as
the man with reading glasses loves wine
for example, reading is more significant in determining the intended meaning of glasses than any of the other words \(em omitting the word reading from the utterance can have a drastic effect on the disambiguation of the word glasses, even to human readers. From utterances of this type, it seems likely that where more than one potential context parameter prevails, those closer (in sequence) to the ambiguous word may be more significant. This significance is approximated in our model by
Many documents (especially technical documents) contain direct clues to help human readers disambiguate. Typical disambiguation clues in Roget's Thesaurus which we have used include the following.
Any of these clues can potentially shorten the disambiguation process \(em whenever a direct clue is found it takes precedence over all other clues that may be present. However, if no relationship can be established between the clue and any meaning associated with the ambiguous word, all other keywords would be taken as clues (as above).
In some utterances, unclassified words can be the main source of semantic clues. Our approach utilises such clues in the following ways.
The disambiguation program was tested on a sample of 400 sentences \(em each sentence contained at least one polysemous word. The sentences were chosen from examples cited in the literature on disambiguation and suggestions from colleagues. Table 7 shows some of the sentences alongside with the performance of the program on each sentence. The program diagnosed the intended meaning correctly in 304 sentences (76%); reported the absence of adequate clues in 76 (19%) sentences; and reported the wrong meaning in 25 sentences (5%). Of the meanings diagnosed incorrectly, 80% contained at least one other polysemous word. On average, it took 2 seconds to disambiguate a word in a sentence context.
| cB | lB cB | lB | |||
| cB | lB | cB | lB | |||
| l | l | c | l. | |||
| program ouput | time | ||
| Sentence | _ | _ | in |
| meaning (headword) | confidence | secs | |
| _ | |||
| time flies* like an arrow | |||
| fruit flies* like a banana | |||
| house flies* like a banana | |||
| the man with reading glasses* loves wine | |||
| the man with glasses* loves wine | |||
| his glasses* were filled with sherry | |||
| this book is the fruit* of great effort | |||
| they all had fruits* after dinner | |||
| the function* is in honour of his birthday | Celebration | 100% | 6.7 |
| the function* has more than one argument | |||
| the function* in his honour had arguments | Celebration | 64% | 7.0 |
| the function* has many arguments | |||
Note: "*" indicates the word being disambiguated in each sentence.
To justify the existence of direct clues, we used the collection of all descriptive names for commands in the Unix operating system (in the file /usr/lib/whatis on Unix 4.2BSD). The results obtained are summarised in Table 8. In the light of the experiment, 69% of the occurrence of the special characters "/" ,"(" and ")", and the special conjunctions "or" and "and" can be justifiably used as disambiguation clues for polysemous words.
| Symbol | Frequency | % intended as |
| disambiguation clues | ||
| _ | ||
| / | 36 | 83 |
| () | 8 | 62.5 |
| or | 27 | 66.67 |
| and | 51 | 66 |
The approach above relies on the co-occurrence of word derivations in a synonym group; in a concept paragraph; or in any of a set of reciprocally related paragraphs (all on similar abstract concepts) of Roget's Thesaurus. The approach compares with other disambiguation schemes in the following ways.
the old man's glasses were filled with sherry
(which is also cited in Schank's paper) \(em sherry also occurs in paragraph 301.
Our approach also weights the clues given by each word in an utterance (depending on the position of the word relative to the ambiguous word). Such consideration is absent from any of the approaches cited above.
We use the word environment here to represent a computer system into which the Roget-based commands described above have been built.
The synonym generation process described in section 4 above can be used to simplify an author's task of finding the most relevant term to express a given concept. A suitable combination of flags would present an author with a small set of apposite alternatives; and it is even possible to accept incorrect English as in misfunction and suggests correct alternatives (such as dysfunction, functionless, malfunction, not function, no function, nonfunctional, nonfunctioning) by utilising the embedded heuristics. Thus the system can diagnose that a given user word is incorrect, can determine what the user probably intended, and can suggest terms which are more likely to correctly express that intention, without losing the root word from the user.
The disambiguation routines can also be used on an authoring package in the following ways.
An obvious goal of all natural language systems is to represent word meanings on computers. This process can be an adjunct (or preprocessor) to any package which analyses users' terms. For human understanding, an ordinary dictionary would be used. Unfortunately, it will be very difficult to instruct a computer to consult and analyse meanings from dictionaries in the same way as humans. However, if the universe of discourse can be represented by a set of concepts in a similar way to Roget, and if the set is complete, then meanings can be represented in terms of these concepts.
Most computer systems that provide some form of conversation usually do so with a limited and predefined vocabulary. Such systems can be improved by building a synonym front end [ref 24] A Flexible Synonym Interface with application examples in CAL and Help Enviro, nments. G. M. Gwei, E. Foxley, The Computer Journal 30; pp 551-557 6; (1987) (also see Wilensky et al [ref 25] Talking to UNIX in English: An overview of UC, R. Wilensky, Y. Arens, D. Chin, Communications of the ACM 27; 6; pp 574-593 (1984) for an example). The front end can recognise synonyms from any user by consulting the synonym command to convert the terms in the user's query into terms which are known by the original package. In this way, the user can enjoy a great flexibility in vocabulary in addition to the facilities offered by the package.
This method can also be used in non-interactive environments. For example, a routine that performs script marking in a tutorial system by searching for keywords (see Gwei [ref 26] Towards Automatic Teaching, G. M. Gwei, %R MSc. Thesis, University of Aston in Birmingham (Sept. 1983) for example) can be modified to consult synonym first. Thus marking would involve searching users' responses for synonyms to keywords of a model answer (for further details see Gwei [ref 27] ).
The possibilities offered by the above developments are wide-ranging and offer many new facilities. The major drawback at the moment in the use of the system as an authoring tool in a technical or research environment is that the thesaurus on which this work has been based is too general purpose for technical use; there is a need for additional thesaurus paragraphs representing technical concepts related to the subject area involved. Many technical areas use words from ordinary English, but with very specific technical meanings; this usage needs to be reflected by a technical thesaurus. Roget was chosen simply because of its availability. Even with the context considerations above, terms used in a specific technical context have little chance of being treated correctly. It is easy to imagine one computer system in which erase and delete were identical operations, but another system in which the first merely erased the contents of the file, and the second actualit would be necessary to develop domain specific thesauri as replacements for, or as add-ons to, the generalised environment above.
A lesser drawback of the present system, which could be overcome by a more careful implementation, arises because the searching mechanism above performs prefix and suffix manipulation only on the first and the last word of a phrase respectively. This results in the omission of some potentially acceptable phrases. For example, starting with "pay off", any paragraph containing "pays off" or "paying off" should be given a reasonable measure. Such derivations or inflexions are not considered by the present searching mechanism.
The prefix and suffix manipulation routine does not handle all irregularities of inflexion in English grammar. For example, the system above cannot derive went from go, nor sought from seek. The inability to derive valid but peculiar variants of some words can lead to inaccurate ordering of the paragraphs under consideration. Thus a paragraph containing a word such as went but no other variant of the verb go would wrongly be omitted if go were the input word. Such problems can be tackled by developing more sophisticated heuristics which involve the sources and derivation routes of such words.
The software developed as part of the above project relies heavily on our on-line copy of Roget's thesaurus. Longman are unwilling at the moment to allow it to be distributed to any other site. If readers know of a more freely available thesaurus, we would be happy to amend the software to use it, and then to distribute the system.
We wish to express our appreciation to: Longman for supplying us with an on-line version of Roget's Thesaurus; Dave Allsopp who contributed ideas to the options for the package; Andy Cheese and Mary Gwei, who both made useful comments on the original draft of this paper; and to various members of the Computer Science Department who contributed in one form or another.
The various inflexions (and some suffixal derivations) of a given word can be derived via considerations which include the following.
This outline can guide the search for a given word (and its variants) from a list of words in a lexicon/dictionary; the search should accept only those terms derivable via any of the schemes above. By storing a list of valid suffixes both requirements (1) and (2) can be achieved. All the considerations above can be facilitated by a procedure involving the details in Table 9 below.
In conjunction with the above, other forms of a word can be derived by using prefixes as follows:
| Given | Unique | Possible | Minimum |
| suffix | substitute | suffixes | word length |
| _ | |||
| -ain | -an | -anation,-anatory | 5 |
| -ain | -ic | -icit,-icable | 5 |
| -ance,-ancy | -abl | -able,-ability | 6 |
| -ance,-ancy | -ate | -ate | 6 |
| -ate | -ab | -able,-ability | 4 |
| -ate | -ac | -ace,-acy | 4 |
| -ate | -an | -ancy,-ant | 4 |
| -ay | -aid | -aid | 3 |
| -ce,-cy | -ct | -ctive,-ction | 4 |
| -ce | -se | -se | 4 |
| -de,-dy | -si | -sion,-sive | 4 |
| -e -y | -ab | -able -ability | 4 |
| -e -y | -ag | -age -aging | 4 |
| -e -y | -al | -ally -allable | 4 |
| -e -y | -anc | -ance -ancy | 4 |
| -e -y | -ant | -antion -antive | 4 |
| -e -y | -ia | -ial -ian | 4 |
| -e -y | -ib | -ible -ibility | 4 |
| -e -y | -ic | -ics -icy | 4 |
| -e -y | -ie | -ier -ied | 4 |
| Given | Unique | Possible | Minimum |
| suffix | substitute | suffixes | word length |
| _ | |||
| -e -y | -il | -ily -ility | 4 |
| -e -y | -in | -ing -iness | 4 |
| -e -y | -io | -ion -ious | 4 |
| -e -y | -is | -ist -ise | 4 |
| -e -y | -it | -ity -itive | 4 |
| -e -y | -iv | -ive -iving | 4 |
| -e -y | -iz | -ize -izing | 4 |
| -e -y | -or | -ory -ors | 4 |
| -e -y | -ou | -our -ous | 4 |
| -e -y | -ua | -ual -uate | 4 |
| -e -y | -ul | -ular -ully | 4 |
| -e -y | -ur | -ure -ury | 4 |
| -eau | -aux | -aux | 6 |
| -ede,-eed,-edy | -ess | -essive,-ession | 4 |
| -eem | -empt | -emptive,-emption | 4 |
| -eive | -eip | -eipient,-eipt | 6 |
| -eive | -eit | -eits,-eitive | 6 |
| -eive | -ept | -eption,-eption | 6 |
| -er | -re | -re,-res | 4 |
| -ex | -ic | -icate,-ices | 4 |
| -f,-fe | -ve | -ves,-ving | 4 |
| Given | Unique | Possible | Minimum |
| suffix | substitute | suffixes | word length |
| _ | |||
| -ia | -ic | -ic,-icity | 6 |
| -ia | -is | -ist,-isy | 5 |
| -ia | -ite | -ites,-ited | 5 |
| -ie | -ying | -ying | 3 |
| -ind | -ound | -ound,-oundation | 5 |
| -laim | -lam | -lamation,-lamatory | 5 |
| -le | -il | -ility | 5 |
| -le | -ul | -ular | 5 |
| -mit | -miss | -mission,-missive | 5 |
| -nce,-ncy | -nd | -nd,-nds | 5 |
| -nce,-ncy | -nt | -nt,-nts | 5 |
| -nd,-nt | -nc | -nce,-ncy | 5 |
| -nd,-nt | -ns | -nsive,-nsion | 5 |
| -nd | -nt | -nt,-nts | 5 |
| -nt | -nd | -nd,-nds | 5 |
| -olve | -ub | -uble,-ubility | 4 |
| -olve | -ut | -ute,-ution | 4 |
| -orb | -orpt | -orption,-orptive | 5 |
| -our | -or | -oration,-orimetry | 6 |
| -pt | -psi | -psion,-psions | 5 |
| -re | -er | -er,-ers | 4 |
| Given | Unique | Possible | Minimum |
| suffix | substitute | suffixes | word length |
| _ | |||
| -rer | -re | -re,-red | 5 |
| -ribe | -ript | -riptive,-ription | 6 |
| -se,-ze | -st | -st,-sts | 5 |
| -se,-ze | -ti | -tive,-tion | 5 |
| -se | -ce | -ce,-ces | 5 |
| -se | -ze | -ze,-zed | 5 |
| -tain | -ic | -ics,-icity | 5 |
| -tain | -ten | -tenance,-tenatory | 5 |
| -um | -a | -a | 3 |
| -ume | -ump | -umption,-umptive | 5 |
| -us | -i | -i | 4 |
| -vert | -version | -version,-versions | 5 |
| -x | -ct | -ction,-ctive | 4 |
| -ze | -se | -se,-ses | 5 |
Notes converted from troff to HTML by an Eric Foxley shell script, email errors to me!