Lexical Realisation Unit

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(From concepts to LUs)
Line 10: Line 10:
  
 
There are cases, however, in which the definition cannot be lexically realised [by a single lexical unit] in the target language. This happens in two situations:
 
There are cases, however, in which the definition cannot be lexically realised [by a single lexical unit] in the target language. This happens in two situations:
*When the concept is '''underspecified''', i.e., too broad (or vague) to be realised. The concept of “red entity”, for instance, may be coextensive with several different English LUs (“blood”, “cherry”, “ruby”, “ketchup”, “Spiderman”, etc), but they are rather '''subordinate terms''' (or '''hyponyms'''), in the sense they only include and partly match the intended sense. And the expression “red entity” itself is too occasional to be considered already lexicalized (Google brings only 8,040 occurrences for this bigram).  
+
*When the concept is '''underspecified''', i.e., too broad (or vague) to be realised. The concept of “red entity”, for instance, may be coextensive with several different English LUs (“blood”, “cherry”, “ruby”, “ketchup”, “Spiderman”, etc), but these are rather '''subordinate terms''' (or '''hyponyms'''), in the sense they only include and partly match the intended sense. And the expression “red entity” itself is too occasional to be considered already lexicalized (Google brings only 8,040 occurrences for this bigram, which is too low compared to the 51.300.000 occurrences of "entity" and 897.000.000 occurrences of "red").  
 
*When the concept is '''overspecified''', i.e., too narrow (or specific) to be realised. Consider, for instance, the definition “a person who is ready to forgive any transgression a first time and then to tolerate it for a second time, but never for a third time”. This definition does not lead to any LU in English, French or Russian, even though it corresponds to a single word (“ilunga”) in Tshiluba, a language spoken in the Republic of Congo. We may obviously express the concept in any language, but we have to do it through a periphrasis (as we have done for English) or through a '''superordinate term''' (or '''hypernym'''), such as “forgiver”, “excuser”, “pardoner”, which are again fairly accurate.  
 
*When the concept is '''overspecified''', i.e., too narrow (or specific) to be realised. Consider, for instance, the definition “a person who is ready to forgive any transgression a first time and then to tolerate it for a second time, but never for a third time”. This definition does not lead to any LU in English, French or Russian, even though it corresponds to a single word (“ilunga”) in Tshiluba, a language spoken in the Republic of Congo. We may obviously express the concept in any language, but we have to do it through a periphrasis (as we have done for English) or through a '''superordinate term''' (or '''hypernym'''), such as “forgiver”, “excuser”, “pardoner”, which are again fairly accurate.  
 
In both cases, there will be no realisation to be informed, and it is important to indicate, in the UNLarium, that the concept has not been lexicalized yet, which means that it can be expressed in the target language only by means of periphrases and other semantically related (and inaccurate) lexical units (such as hyponyms or hypernyms).
 
In both cases, there will be no realisation to be informed, and it is important to indicate, in the UNLarium, that the concept has not been lexicalized yet, which means that it can be expressed in the target language only by means of periphrases and other semantically related (and inaccurate) lexical units (such as hyponyms or hypernyms).
  
 
== How to express a LU ==
 
== How to express a LU ==
In the UNLarium, the LU is expressed by its canonical (citation) form, i.e., the word or expression as it would normally appear in ordinary dictionaries and glossaries, which is the '''unmarked''' (generic, basic, default) form, such as the singular, for nouns; the masculine singular, for adjectives; the infinitive, for verbs; and so on. Accordingly, you should use “foot” for both “foot” and “feet”; “run” for “run”, “runs”, “ran”, “running”; “beau” (=beautiful, in French) for “beau” (masculine singular), “beaux” (masculine plural), “belle” (feminine singular), “belles” (feminine plural); etc.
+
In the UNLarium, the LU is expressed by its canonical (citation) form, i.e., the word or expression as it would appear in ordinary dictionaries and glossaries. This will be normally the '''unmarked''' (generic, basic, default) form. The concept of "markedness", which was carved by the Prague School in the 1930's, is too technical to be detailed here. In principle, it will always be the case of the singular, for nouns; the masculine singular, for adjectives; the infinitive, for verbs; and so on. In case of doubt, just take a look in a good general-purpose dictionary. It will bring “foot” (for both “foot” and “feet”), “run” (for “run”, “runs”, “ran”, “running”),  “beau” (=beautiful, in French, for “beau” (masculine singular), “beaux” (masculine plural), “belle” (feminine singular), “belles” (feminine plural)), etc. In any case, the LU must always realise the concept: the definition "a female lion" should correspond to "lioness", and not to "lion", even if "lion" is generally said to be the unmarked form (it can refer to the general species, which include both the male and the female).
  
 
== The role of LUs ==
 
== The role of LUs ==
LUs are not actually essential to the UNLarium. They just provide a humanly-readable label or reading for [[Base form|base forms]], which are the starting points and the keystones for the generation.
+
LUs are not actually essential to the UNLarium. They just provide a humanly-readable label or reading for [[Base form|base forms]], which are the starting points and the keystones for the generation, and which can be quite misleading, due to the process of morphological analysis.

Revision as of 17:47, 6 January 2010

A lexical unit (or simply LU) is any stable, labelable and recurring unit of meaning in a given natural language. It can be a morpheme (a root, an affix), a simple word or a multi-word expression (compounds, collocations, idioms). The set of LUs constitute the vocabulary or the lexicon of a language.

Contents

From concepts to LUs

The UNLarium is first and foremost a generation-driven framework, which has been developed mainly to provide resources for generating natural language texts out of UNL documents. In that sense, dictionary entries should correspond to the most likely lexical realisation, in a given language, of a definition for a concept. For instance, the definition “the natural satellite of the Earth” is realised, in English, by the word “Moon”; in French, by “lune”; in German, by “Mond”; in Russian, by “луна”; in Spanish, by “luna”; in Chinese, by 月; and so on. Your first task in the UNLarium is exactly to find out “lexical realisations” for those concept definitions, which will be always presented in English. These lexical realisations are the lexical units (LU).

LUs are not only words

The expression "lexical unit" is used here to avoid a common misunderstanding in natural language description. Due to writing conventions, especially in the Western tradition, we tend to reduce the lexicon of a language to a list of “words”, which are normally understood as the smallest free forms, or the strings of alphabetic characters isolated by blank spaces. Unfortunately, it is not that simple. The vocabulary of a language is made not only of words, but of parts of words (roots, stems, affixes, particles) and of multi-word expressions (compounds, collocations, idioms). In English, one of the most frequent lexical realisations for the concept “contrary of” is the prefix “un-“, which is a bound morpheme (i.e., a semantic unit that does not have an independent existence); in the same way, the concept “to die” is frequently realised by the idiom “to kick the bucket”, which is a complex structure that does not figure as a separate entry in most English dictionaries (it is normally listed inside the verb “to kick”). So, it is important to understand that “lexical realisation”, here, means not only “words”, in the common sense, but any lexical unit, i.e., any reasonably constant unit of a language, regardless of its length and number of morphemes. For us, the most important requisite, which is however still quite subjective, is the rate of recurrence. If the sequence is convincingly recurring, it is a LU; otherwise, it is not.

Lexicalisation processes

As languages have different lexicalisation processes, a single definition may correspond to several different LUs, which are said to be synonyms. The definition “pass from physical life and lose all bodily attributes and functions necessary to sustain life”, for instance, may be realised in English by several different LUs: “to die”, “to croak”, “to decease”, “to drop dead”, “to buy the farm”, “to cash in one's chips”, “to give-up the ghost”, “to kick the bucket”, “to pass away”, “to perish”, “to snuff it”, “to pop off”, “to expire”, “to conk”, “to exit”, “to choke”, “to go”, “to pass”, etc. In such cases, all realisations should be informed in the UNLarium.

There are cases, however, in which the definition cannot be lexically realised [by a single lexical unit] in the target language. This happens in two situations:

  • When the concept is underspecified, i.e., too broad (or vague) to be realised. The concept of “red entity”, for instance, may be coextensive with several different English LUs (“blood”, “cherry”, “ruby”, “ketchup”, “Spiderman”, etc), but these are rather subordinate terms (or hyponyms), in the sense they only include and partly match the intended sense. And the expression “red entity” itself is too occasional to be considered already lexicalized (Google brings only 8,040 occurrences for this bigram, which is too low compared to the 51.300.000 occurrences of "entity" and 897.000.000 occurrences of "red").
  • When the concept is overspecified, i.e., too narrow (or specific) to be realised. Consider, for instance, the definition “a person who is ready to forgive any transgression a first time and then to tolerate it for a second time, but never for a third time”. This definition does not lead to any LU in English, French or Russian, even though it corresponds to a single word (“ilunga”) in Tshiluba, a language spoken in the Republic of Congo. We may obviously express the concept in any language, but we have to do it through a periphrasis (as we have done for English) or through a superordinate term (or hypernym), such as “forgiver”, “excuser”, “pardoner”, which are again fairly accurate.

In both cases, there will be no realisation to be informed, and it is important to indicate, in the UNLarium, that the concept has not been lexicalized yet, which means that it can be expressed in the target language only by means of periphrases and other semantically related (and inaccurate) lexical units (such as hyponyms or hypernyms).

How to express a LU

In the UNLarium, the LU is expressed by its canonical (citation) form, i.e., the word or expression as it would appear in ordinary dictionaries and glossaries. This will be normally the unmarked (generic, basic, default) form. The concept of "markedness", which was carved by the Prague School in the 1930's, is too technical to be detailed here. In principle, it will always be the case of the singular, for nouns; the masculine singular, for adjectives; the infinitive, for verbs; and so on. In case of doubt, just take a look in a good general-purpose dictionary. It will bring “foot” (for both “foot” and “feet”), “run” (for “run”, “runs”, “ran”, “running”), “beau” (=beautiful, in French, for “beau” (masculine singular), “beaux” (masculine plural), “belle” (feminine singular), “belles” (feminine plural)), etc. In any case, the LU must always realise the concept: the definition "a female lion" should correspond to "lioness", and not to "lion", even if "lion" is generally said to be the unmarked form (it can refer to the general species, which include both the male and the female).

The role of LUs

LUs are not actually essential to the UNLarium. They just provide a humanly-readable label or reading for base forms, which are the starting points and the keystones for the generation, and which can be quite misleading, due to the process of morphological analysis.

Software