UNL2010
m (UNLization Guidelines moved to UNL2010) |
|||
Line 1: | Line 1: | ||
− | + | The specifications here stated were first derived from the fully-manual UNLization experience of translating the integral text of ''Cratylus'', by Plato, from English into UNL (the Cratylus Project), and have been continuously extended and amended in order to be as comprehensive as possible. | |
+ | |||
+ | As a general UNLization policy, we have tried to follow the [http://www.undl.org UNL 2005 Specification] (version of June 7, 2005) as close as possible. This is to say that the whole text has been treated as a semantic network, where paragraphs and sentences are represented as hypernodes, which in turn are represented as sets of binary relations between annotated nodes (representing words, either simple, compound or complex — the so-called UWs; and clauses, either subordinate, embedded or coordinate — the so-called SCOPEs). Nevertheless, these guidelines should not be taken as the UNL Specifications themselves, as long as 1) they are rather experimental and tentative; 2) they differ, in several points, from the current version of the Specifications; 3) they do not follow some of the existing UNLization policies; and 4) they are not been approved yet by the UNL Community. | ||
+ | |||
+ | Finally, we ought to stress that these standards are tentative and provisional, and they are subject to improvements and changes as soon as they were proved not to be the most adequate ones. In order to provide such enhancements, we would invite UNL Community members and other people interested in UNL to criticize them, to propose alternatives and to help us build a formalism as effective as possible. | ||
== PREMISES == | == PREMISES == | ||
− | These | + | These specifications are derived from three main premises: |
− | Information conveyed by natural language | + | Information conveyed by natural language can be represented by a natural language independent hyper-graph structure. |
Texts can be treated as a set of semantic nodes interlinked by semantic relations and modified by semantic attributes. Nodes can be either simple (UWs) or complex (SCOPES, i.e., sub-graphs, such as clauses). | Texts can be treated as a set of semantic nodes interlinked by semantic relations and modified by semantic attributes. Nodes can be either simple (UWs) or complex (SCOPES, i.e., sub-graphs, such as clauses). | ||
The UNL representation is an interpretation rather than a translation of a given text. | The UNL representation is an interpretation rather than a translation of a given text. | ||
Line 10: | Line 14: | ||
This means that, whenever possible, all the semantic valencies of the original text should be saturated, including anaphora, ellipses, presuppositions and implicatures. Pronouns and pro-forms, for instance, are expected to be replaced by their antecedents, and should not be represented in UNL, except in case of exophoric reference (indefinite pronouns, interrogative pronouns and personal pronouns that are not co-indexed to any existing antecedent). | This means that, whenever possible, all the semantic valencies of the original text should be saturated, including anaphora, ellipses, presuppositions and implicatures. Pronouns and pro-forms, for instance, are expected to be replaced by their antecedents, and should not be represented in UNL, except in case of exophoric reference (indefinite pronouns, interrogative pronouns and personal pronouns that are not co-indexed to any existing antecedent). | ||
− | == | + | == THREE-LAYERED REPRESENTATION == |
− | + | The basic assumption of the UNL approach is that the meaning conveyed by natural language can be formally represented through three different types of semantic units: UWs, attributes and relations. This three-layered representation model is the cornerstone of UNL and its most distinctive feature over other semantic networks, which normally proposes only two levels: edges and vertices. | |
− | |||
− | == | + | === UNIVERSAL WORDS (UWs) === |
− | + | {{Main|UW}} | |
− | + | Universal Words, or simply UWs, are the words of UNL, and correspond to the nodes - to be interlinked by relations or modified by attributes - in a UNL graph. They are labels for relatively stable units of knowledge (the concepts) that can be associated to natural language open lexical categories (noun, verb, adjective and adverb). The set of UWs is relatively open and is listed in the UNL Dictionary. Additionally, UWs are organized in a hierarchy (the UNL Ontology), are defined in the UNL Knowledge Base (UNLKB) and exemplified in the UNL Example Base (UNLEB), which are the lexical databases for UNL. | |
− | + | ||
− | + | The first commitment – not to imitate English – can | |
+ | be understood in two different senses. The most easily | ||
+ | achievable is that XUNL should no longer use English | ||
+ | words, or that KVs should be made out of languageindependent | ||
+ | symbols, such as Arabic numerals. In this | ||
+ | case, KVs would not be as readily legible as UWs, but | ||
+ | would be shorter, less deceptive and actually universal. | ||
+ | Additionally, human readability could be easily | ||
+ | provided by editing facilities as these existing in this | ||
+ | very computer where this text is being typed, which | ||
+ | automatically converts Roman characters into | ||
+ | machine-tractable codes. Indeed, there is no actual | ||
+ | need for middle-level representations (such as UWs | ||
+ | and MDs) in the current state of the art of humanmachine | ||
+ | interfaces. | ||
+ | However, the language-independency commitment | ||
+ | must also be understood in a far much deeper and | ||
+ | much more intricate way. It is not only a matter of | ||
+ | labeling, but of choosing what is supposed to be | ||
+ | labeled. Spelling differences (‘color’ and ‘colour’) and | ||
+ | synonyms (‘freedom’ and ‘liberty’) should clearly not | ||
+ | be represented as different lexical items in XUNL. The | ||
+ | set of KVs should be equivalent to the set of synonyms | ||
+ | of a given language instead of to the whole set of | ||
+ | words of that language. In this sense, KVs would be | ||
+ | very akin to the concept of synset, devised by the | ||
+ | WordNet (Fellbaum, 1998). | ||
+ | Moreover, XUNL should comprise only lexical roots | ||
+ | (monomorphemic stems), i.e., the set of atomic lexical | ||
+ | items necessary and sufficient to generate the whole | ||
+ | set of words of a given language. For instance, there is | ||
+ | no need, in XUNL, for a word like “beautiful” or | ||
+ | “beautifully”, provided that we have “beauty” and | ||
+ | some derivation rules. This is to say that the XUNL | ||
+ | lexicon should be generative, instead of enumerative. | ||
+ | Finally, XUNL should include only semes (Pottier, | ||
+ | 1960), i.e., the semantic elementary particles of lexical | ||
+ | meaning. Natural language words should be | ||
+ | represented as complex semantic structures to be | ||
+ | analyzed in XUNL. Accordingly, a verb like “to fly” | ||
+ | should be rather represented as “to travel through air”, | ||
+ | (or even more radically as “to change location through | ||
+ | air”), and a noun like “chair” should be represented as | ||
+ | “a seat for one person, with a support for the back”. In | ||
+ | other words: the XUNL dictionary should be | ||
+ | semasiological rather than onomasiological. Natural | ||
+ | language lexical items should not be simply translated | ||
+ | in XUNL but truly defined in relation to a core | ||
+ | minimum vocabulary, as simple and small as possible. | ||
+ | |||
+ | They represent lexemes from open classes | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | Nevertheless, it poses several problems to the UNLization as the distinction between what is supposed to be represented by each unit is not always clear. In order to avoid superposition and to facilitate the enconversion process, we have tried to clearly identify the scope of each unit using the following procedures: | ||
+ | UWs represent lexemes from open classes | ||
+ | ATTRIBUTES represent bound morphemes, closed classes and context-dependent information: | ||
+ | |||
+ | |||
+ | *nouns, including proper nouns, abbreviations and acronyms; | ||
**adjectives; | **adjectives; | ||
**full verbs; | **full verbs; | ||
**adverbs (adjuncts, conjuncts and disjuncts); and | **adverbs (adjuncts, conjuncts and disjuncts); and | ||
**numbers (to be always represented as Arabic numerals) | **numbers (to be always represented as Arabic numerals) | ||
− | + | ||
**grammatical categories (gender, number, tense, aspect, mood, voice, etc) | **grammatical categories (gender, number, tense, aspect, mood, voice, etc) | ||
**determiners (articles and demonstratives); | **determiners (articles and demonstratives); | ||
Line 35: | Line 101: | ||
**speech acts (.@request, .@suggestion, .@offer, etc); | **speech acts (.@request, .@suggestion, .@offer, etc); | ||
**other context-dependent information (such as politeness, metaphor, irony, etc); | **other context-dependent information (such as politeness, metaphor, irony, etc); | ||
+ | *RELATIONS represent syntactic relations (specifier, complement, adjunct) with their corresponding semantic value. | ||
+ | |||
Pronouns and pro-forms are expected to be replaced by their antecedents and not to be represented in UNL, except in case of exophoric reference (indefinite pronouns, interrogative pronouns and personal pronouns that are not coindexed to any existing antecedent). | Pronouns and pro-forms are expected to be replaced by their antecedents and not to be represented in UNL, except in case of exophoric reference (indefinite pronouns, interrogative pronouns and personal pronouns that are not coindexed to any existing antecedent). | ||
− | + | == UNIVERSAL WORDS == | |
− | == RELATIONS == | + | |
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | == ATTRIBUTES == | ||
+ | |||
+ | |||
+ | == RELATIONS == | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | RELATIONS == | ||
The set of relations is exactly the same as defined in the UNL 2005 Specifications. | The set of relations is exactly the same as defined in the UNL 2005 Specifications. | ||
Line 80: | Line 167: | ||
== INVITATION == | == INVITATION == | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− |
Revision as of 14:59, 14 July 2010
The specifications here stated were first derived from the fully-manual UNLization experience of translating the integral text of Cratylus, by Plato, from English into UNL (the Cratylus Project), and have been continuously extended and amended in order to be as comprehensive as possible.
As a general UNLization policy, we have tried to follow the UNL 2005 Specification (version of June 7, 2005) as close as possible. This is to say that the whole text has been treated as a semantic network, where paragraphs and sentences are represented as hypernodes, which in turn are represented as sets of binary relations between annotated nodes (representing words, either simple, compound or complex — the so-called UWs; and clauses, either subordinate, embedded or coordinate — the so-called SCOPEs). Nevertheless, these guidelines should not be taken as the UNL Specifications themselves, as long as 1) they are rather experimental and tentative; 2) they differ, in several points, from the current version of the Specifications; 3) they do not follow some of the existing UNLization policies; and 4) they are not been approved yet by the UNL Community.
Finally, we ought to stress that these standards are tentative and provisional, and they are subject to improvements and changes as soon as they were proved not to be the most adequate ones. In order to provide such enhancements, we would invite UNL Community members and other people interested in UNL to criticize them, to propose alternatives and to help us build a formalism as effective as possible.
Contents |
PREMISES
These specifications are derived from three main premises:
Information conveyed by natural language can be represented by a natural language independent hyper-graph structure.
Texts can be treated as a set of semantic nodes interlinked by semantic relations and modified by semantic attributes. Nodes can be either simple (UWs) or complex (SCOPES, i.e., sub-graphs, such as clauses).
The UNL representation is an interpretation rather than a translation of a given text.
The main goal of the UNLization process is to represent the knowledge structure of the source text, which should be detached from its verbal structure. This means that the UNL representation should not be committed to replicate the lexical and the syntactic choices of the original, but should focus in representing, in a language-independent and non-ambiguous format, one of its possible readings, preferably the most conventional one.
The UNL representation should be as semantically complete as possible.
This means that, whenever possible, all the semantic valencies of the original text should be saturated, including anaphora, ellipses, presuppositions and implicatures. Pronouns and pro-forms, for instance, are expected to be replaced by their antecedents, and should not be represented in UNL, except in case of exophoric reference (indefinite pronouns, interrogative pronouns and personal pronouns that are not co-indexed to any existing antecedent).
THREE-LAYERED REPRESENTATION
The basic assumption of the UNL approach is that the meaning conveyed by natural language can be formally represented through three different types of semantic units: UWs, attributes and relations. This three-layered representation model is the cornerstone of UNL and its most distinctive feature over other semantic networks, which normally proposes only two levels: edges and vertices.
UNIVERSAL WORDS (UWs)
Template:Main Universal Words, or simply UWs, are the words of UNL, and correspond to the nodes - to be interlinked by relations or modified by attributes - in a UNL graph. They are labels for relatively stable units of knowledge (the concepts) that can be associated to natural language open lexical categories (noun, verb, adjective and adverb). The set of UWs is relatively open and is listed in the UNL Dictionary. Additionally, UWs are organized in a hierarchy (the UNL Ontology), are defined in the UNL Knowledge Base (UNLKB) and exemplified in the UNL Example Base (UNLEB), which are the lexical databases for UNL.
The first commitment – not to imitate English – can be understood in two different senses. The most easily achievable is that XUNL should no longer use English words, or that KVs should be made out of languageindependent symbols, such as Arabic numerals. In this case, KVs would not be as readily legible as UWs, but would be shorter, less deceptive and actually universal. Additionally, human readability could be easily provided by editing facilities as these existing in this very computer where this text is being typed, which automatically converts Roman characters into machine-tractable codes. Indeed, there is no actual need for middle-level representations (such as UWs and MDs) in the current state of the art of humanmachine interfaces. However, the language-independency commitment must also be understood in a far much deeper and much more intricate way. It is not only a matter of labeling, but of choosing what is supposed to be labeled. Spelling differences (‘color’ and ‘colour’) and synonyms (‘freedom’ and ‘liberty’) should clearly not be represented as different lexical items in XUNL. The set of KVs should be equivalent to the set of synonyms of a given language instead of to the whole set of words of that language. In this sense, KVs would be very akin to the concept of synset, devised by the WordNet (Fellbaum, 1998). Moreover, XUNL should comprise only lexical roots (monomorphemic stems), i.e., the set of atomic lexical items necessary and sufficient to generate the whole set of words of a given language. For instance, there is no need, in XUNL, for a word like “beautiful” or “beautifully”, provided that we have “beauty” and some derivation rules. This is to say that the XUNL lexicon should be generative, instead of enumerative. Finally, XUNL should include only semes (Pottier, 1960), i.e., the semantic elementary particles of lexical meaning. Natural language words should be represented as complex semantic structures to be analyzed in XUNL. Accordingly, a verb like “to fly” should be rather represented as “to travel through air”, (or even more radically as “to change location through air”), and a noun like “chair” should be represented as “a seat for one person, with a support for the back”. In other words: the XUNL dictionary should be semasiological rather than onomasiological. Natural language lexical items should not be simply translated in XUNL but truly defined in relation to a core minimum vocabulary, as simple and small as possible.
They represent lexemes from open classes
Nevertheless, it poses several problems to the UNLization as the distinction between what is supposed to be represented by each unit is not always clear. In order to avoid superposition and to facilitate the enconversion process, we have tried to clearly identify the scope of each unit using the following procedures:
UWs represent lexemes from open classes ATTRIBUTES represent bound morphemes, closed classes and context-dependent information:
- nouns, including proper nouns, abbreviations and acronyms;
- adjectives;
- full verbs;
- adverbs (adjuncts, conjuncts and disjuncts); and
- numbers (to be always represented as Arabic numerals)
- grammatical categories (gender, number, tense, aspect, mood, voice, etc)
- determiners (articles and demonstratives);
- adpositions (prepositions, postpositions and circumpositions);
- auxiliary and quasi-auxiliary verbs (auxiliaries, modals, coverbs, preverbs);
- interjections;
- conjunctions;
- adverbs (specifiers);
- text structure (.@entry, .@topic, .@qfocus, .@emphasis, .@relative, etc);
- speech acts (.@request, .@suggestion, .@offer, etc);
- other context-dependent information (such as politeness, metaphor, irony, etc);
- RELATIONS represent syntactic relations (specifier, complement, adjunct) with their corresponding semantic value.
Pronouns and pro-forms are expected to be replaced by their antecedents and not to be represented in UNL, except in case of exophoric reference (indefinite pronouns, interrogative pronouns and personal pronouns that are not coindexed to any existing antecedent).
UNIVERSAL WORDS
ATTRIBUTES
RELATIONS
RELATIONS == The set of relations is exactly the same as defined in the UNL 2005 Specifications.
ATTRIBUTES
The set of attributes has been substantially increased to represent information concerning grammatical categories, determiners, adpositions and conjunctions. The main additions are the following:
- gender: @male, @female
- degree: @more, @less, @equal, @most, @least, @plus, @minus, etc.
- demonstrative: @proximal, @medial, @distal
- preposition: @under, @below, @above, @after, @before, etc.
- conjunction: @before, @after, etc.
- relative (for the main entry of relative clauses): @relative
The decision to represent closed classes as attributes instead of UWs has led to a different way of representing several natural language phenomena:
- this X
- UNL Centre: mod(X, this)
- These guidelines: X.@proximal
- X is under Y
- UNL Centre: plc(X, under), obj(under, Y)
- These guidelines: plc(X, Y.@under)
- bigger than Y
- UNL centre: man(big, more), bas(big, Y)
- These guidelines: bas(big.@more, Y)
etc.
Additionally, the following general principles were adopted:
- interjections, filled pauses, phatic expressions and short answers should be represented by the null UW (to be represented as "00") together with the attribute indicating the corresponding speech act (.@confirmation, .@surprise, etc).
- the attribute .@entry (mandatory in every scope, including the main one) should be placed at the left (source) side of at least one relation;
- the difference between mentioning and using a word (which is a quite frequent situation in a metalinguistic text such as Cratylus) should be represented by the attribute .@mention (which is not the same as "quotation");
- attributes should be used in alphabetical order (“.@entry.@past” instead of “.@past.@entry”).
UNIVERSAL WORDS
The set of Universal Words, i.e., the UNL Dictionary, has undergone the most radical change, as we have been using the UNLWN30, a set of UWs automatically extracted out of the WordNet30. In this dictionary, UWs correspond to sets of synonyms (synsets) of English, and may have several different headwords. They are represented as 9-digit strings with the following format:
<POS><WORDNETID>
where <POS> = {1,2,3,4}, being 1 = noun, 2 = verb, 3 = adjective and 4 = adverb;
and <WORDNETID> is the synset ID in the WN3.0.
SCOPES
In order to enhance the possibility of knowledge extraction out of the UNL document, we have restricted the use of scopes only to cases involving semantic ambiguity, such as:
- electric [light orchestra], with scope, i.e., a "light orchestra" that is electric; or
- electric light orchestra, without scope, i.e., an orchestra that is both "light" and "electric".