Lexical Realisation Unit
(→LRUs are variable) |
(→How to express a LRU) |
||
Line 18: | Line 18: | ||
== How to express a LRU == | == How to express a LRU == | ||
− | To assure readability and to allow the reference to all instances of the same LRU, the LRU is represented, in the UNLarium, through a lemma, i.e., a canonical (citation) form, which is the entry form normally given in ordinary dictionaries and glossaries. | + | To assure readability and to allow the reference to all instances of the same LRU, the LRU is represented, in the UNLarium, through a '''lemma''', i.e., a canonical (citation) form, which is the entry form normally given in ordinary dictionaries and glossaries. The lemma is the form of the singular, for nouns; of the masculine singular, for adjectives; and the infinitive, for verbs. The lemma should follow the spelling and the capitalization rules of the target language. In English, only proper names should bring the initial upper case, whereas in German all nouns should be written this way. |
− | The lemma should follow the spelling and the capitalization rules of the target language. In English, only proper names should bring the initial upper case, whereas in German all nouns should be written this way. | + | |
== Lexicalisation processes == | == Lexicalisation processes == | ||
As languages have different lexicalisation processes, a single definition may correspond to several different LRUs, which are said to be '''synonyms'''. The definition “pass from physical life and lose all bodily attributes and functions necessary to sustain life”, for instance, may be realised in English by several different LRUs: “die”, “croak”, “decease”, “drop dead”, “buy the farm”, “cash in one's chips”, “give-up the ghost”, “kick the bucket”, “pass away”, “perish”, “snuff it”, “pop off”, “expire”, “conk”, “exit”, “choke”, “go”, “pass”, etc. In such cases, all realisations should be informed in the UNLarium. | As languages have different lexicalisation processes, a single definition may correspond to several different LRUs, which are said to be '''synonyms'''. The definition “pass from physical life and lose all bodily attributes and functions necessary to sustain life”, for instance, may be realised in English by several different LRUs: “die”, “croak”, “decease”, “drop dead”, “buy the farm”, “cash in one's chips”, “give-up the ghost”, “kick the bucket”, “pass away”, “perish”, “snuff it”, “pop off”, “expire”, “conk”, “exit”, “choke”, “go”, “pass”, etc. In such cases, all realisations should be informed in the UNLarium. |
Revision as of 14:07, 7 January 2010
In the UNLarium framework, a lexical realisation unit (or simply LRU) is any discrete, recurring and standardized unit of meaning of a given natural language. It can be a morpheme (a root, an affix), a simple word or a multi-word expression (compounds, collocations, idioms). The set of LRUs constitutes the vocabulary or the lexicon of a language.
Contents |
LRUs are standardized lexical realisations for concepts
The UNLarium is first and foremost a generation-driven framework, which has been developed mainly to provide resources for generating natural language texts out of UNL documents. In that sense, UNLarium entries should correspond to the most likely realisations, in a given language, of a given concept. The expression “realisation" stands here for a mixture of wording and phrasing, i.e., the manner in which a concept is articulated in a given language. For instance, the concept “the natural satellite of the Earth” is realised, in English, by the word “Moon”; in French, by “lune”; in German, by “Mond”; in Russian, by “луна”; in Spanish, by “luna”; in Chinese, by 月; and so on. Your first task in the UNLarium is exactly to find out linguistic realisations for concepts, which will be always presented by their corresponding definitions in English.
LRUs, however, are not simply linguistic realisations; they are lexical realisations. This means that LRUs should correspond to the units of the vocabulary of a language, i.e., to a lexical item. Let’s come back to our previous example. Apart from “Moon”, the concept “the natural satellite of the Earth” can be realised, in English, by the very expression “the natural satellite of the Earth”, which is indeed very frequent (2.130.000 occurrences in Google). This expression, however, is a “definition” rather than a “lexical realisation” for the concept, and should not correspond to a LRU.
The differences between definitions and lexical items, or between “defining” and “naming” a concept, are fairly subjective, and are normally ascribed to the compositionality (or analyticity) of the candidate term: if the meaning of the compound can be reduced to the combination of the meaning of its components, it is said to be simply a definition; otherwise, i.e., if there is a sort of semantic surplus, a supplementary (or even complementary) sense added to the simple combination, the term is considered a lexical item. Consider, for instance, the case of “sweet and sour”. In this case, we have clearly a lexical item and therefore a LRU, because the phrase now encompasses the sense of “sauce” that cannot be derived from the combination of the meanings of “sweet”, “and”, “sour”, even if, in the past, the expression was merely a conjunction of two simple adjectives. As a matter of fact, the process of converting definitions into lexical items is very common and one of the most frequent strategies employed to increase and improve the vocabulary of a language.
Finally, LRUs have also to be standardized lexical realisations for concepts. This means that LRUs must have been already consolidated, which can be verified in dictionaries and glossaries, or by the rate of recurrence in the Web. If the structure is reasonably constant and convincingly recurring, it is a LRU; otherwise, it is not. But be careful: the frequency condition applies after the lexical (combinatorial) one. Notice that “the natural satellite of the Earth” is not a LRU, in spite of its 2.130.000 occurrences, whereas “sweet and sour”, with 1.710.000 occurrences, is a LRU.
LRUs are not words
A common misunderstanding in natural language description is that, due to our writing conventions, especially in the Western tradition, we tend to reduce the lexicon of a language to a list of “words”, which are normally understood as the smallest free forms, or the strings of alphabetic characters isolated by blank spaces. Unfortunately, it is not that simple. The vocabulary of a language is made not only of words, but of parts of words (roots, stems, affixes, particles) and of multi-word expressions (compounds, collocations, idioms). In English, one of the most frequent lexical realisations for the concept “contrary of” is the prefix “un-“, which is a bound morpheme (i.e., a semantic unit that does not have an independent existence); in the same way, the concept “allow or plan for a certain possibility” is frequently realised by the phrasal verb “to take (sth) into account”, which is a complex structure that does not figure as a separate entry in most English dictionaries (it is normally listed inside the verb “to take”). So, it is important to understand that “lexical realisation unit”, here, means not only “words”, in the common sense, but any part of the vocabulary of a language.
LRUs are not roots
In synthetic (inflected) languages, such as the Indo-European ones, a single LRU may be articulated to several different forms in order to express different grammatical categories, such as number, gender, tense and case. The definition “pass from physical life and lose all bodily attributes and functions necessary to sustain life”, for instance, is present, in English, in the forms "to die" (infinitive), "die" (present tense except 3rd person singular), "dies" (3rd person singular of the present tense), "dying" (gerund), "died" (past tense), "dead" (past participle), "will die" (future), etc. Only the forms "die", "dy-" and "d-", however, actually realise the intended definition, which does not comprise any particular value for tense or mood. Particles (such as "to"), affixes (such as the inflectional suffixes "-s", "-ing", "-ed") and co-verbs (such as the auxiliary "will") convey different notions and should be isolated in order to obtain the real LRU. Nevertheless, this does not mean that LRUs are roots. As "lexical realisations" of definitions, LRUs may correspond either to single-rooted forms (such as “die”) or to multiple-root compounds (such as “kick the bucket” or “give up the ghost”).
LRUs are variable
Apart form inflections, LRUs may vary in many different senses: spelling (such as in the allomorphs “die” and “dy” above), discontinuity (as in “take […] into account”) and order (as in German separable verbs such as “angekommen” and “kommen” […] “an”). These variants are said to be simply instances of the same LRU, which should be treated as a class rather than a single element, even in case of radical changes (such as the forms of the irregular verb "to be" in English: “be”, “am”, “are”, “was”, etc). The possible variations of a given LRU will be informed through inflectional and subcategorization rules in a different field.
How to express a LRU
To assure readability and to allow the reference to all instances of the same LRU, the LRU is represented, in the UNLarium, through a lemma, i.e., a canonical (citation) form, which is the entry form normally given in ordinary dictionaries and glossaries. The lemma is the form of the singular, for nouns; of the masculine singular, for adjectives; and the infinitive, for verbs. The lemma should follow the spelling and the capitalization rules of the target language. In English, only proper names should bring the initial upper case, whereas in German all nouns should be written this way.
Lexicalisation processes
As languages have different lexicalisation processes, a single definition may correspond to several different LRUs, which are said to be synonyms. The definition “pass from physical life and lose all bodily attributes and functions necessary to sustain life”, for instance, may be realised in English by several different LRUs: “die”, “croak”, “decease”, “drop dead”, “buy the farm”, “cash in one's chips”, “give-up the ghost”, “kick the bucket”, “pass away”, “perish”, “snuff it”, “pop off”, “expire”, “conk”, “exit”, “choke”, “go”, “pass”, etc. In such cases, all realisations should be informed in the UNLarium. There are cases, however, in which the definition cannot be lexically realised [by a single lexical unit] in the target language. This happens in two situations:
- When the concept is underspecified, i.e., too broad (or vague) to be realised. The concept of “red entity”, for instance, may be coextensive with several different English LRUs (“blood”, “cherry”, “ruby”, “ketchup”, “Spiderman”, etc), but these are rather subordinate terms (or hyponyms), in the sense they only include and partly match the intended sense. And the expression “red entity” itself is too compositional and too occasional to be considered already lexicalized (Google brings only 8,040 occurrences for this bigram).
- When the concept is overspecified, i.e., too narrow (or specific) to be realised. Consider, for instance, the definition “a person who is ready to forgive any transgression a first time and then to tolerate it for a second time, but never for a third time”. This definition does not lead to any LRU in English, French or Russian, even though it corresponds to a single word (“ilunga”) in Tshiluba, a language spoken in the Republic of Congo. We may obviously express the concept in any language, but we have to do it through a periphrasis (as we have done for English) or through a superordinate term (or hypernym), such as “forgiver”, “excuser”, “pardoner”, which are again fairly accurate.
In both cases, there will be no realisation to be informed, and it is important to indicate, in the UNLarium, that the concept has not been lexicalized yet, which means that it can be expressed in the target language only by means of definitions (periphrases) and other semantically related (and inaccurate) lexical units (such as hyponyms or hypernyms).