UNL2010
(→THREE-LAYERED REPRESENTATION) |
(→THREE-LAYERED REPRESENTATION) |
||
Line 13: | Line 13: | ||
The basic assumption of the UNL approach is that the meaning conveyed by natural language can be formally represented through three different types of semantic units: UWs, attributes and relations. This three-layered representation model is the cornerstone of UNL and its most distinctive feature over other semantic networks, which normally propose only two levels: edges and vertices. | The basic assumption of the UNL approach is that the meaning conveyed by natural language can be formally represented through three different types of semantic units: UWs, attributes and relations. This three-layered representation model is the cornerstone of UNL and its most distinctive feature over other semantic networks, which normally propose only two levels: edges and vertices. | ||
− | === UNIVERSAL WORDS (UWs) === | + | === [[UW|UNIVERSAL WORDS (UWs)]] === |
− | + | ||
Universal Words, or simply UWs, are the words of UNL, and correspond to the nodes - to be interlinked by relations or modified by attributes - in a UNL graph. They are labels for relatively stable units of knowledge (the concepts) that can be associated to natural language open lexical categories (noun, verb, adjective and adverb). The set of UWs is relatively open and is listed in the UNL Dictionary. Additionally, UWs are organized in a hierarchy (the UNL Ontology), are defined in the UNL Knowledge Base (UNLKB) and exemplified in the UNL Example Base (UNLEB), which are the lexical databases for UNL. | Universal Words, or simply UWs, are the words of UNL, and correspond to the nodes - to be interlinked by relations or modified by attributes - in a UNL graph. They are labels for relatively stable units of knowledge (the concepts) that can be associated to natural language open lexical categories (noun, verb, adjective and adverb). The set of UWs is relatively open and is listed in the UNL Dictionary. Additionally, UWs are organized in a hierarchy (the UNL Ontology), are defined in the UNL Knowledge Base (UNLKB) and exemplified in the UNL Example Base (UNLEB), which are the lexical databases for UNL. | ||
Line 20: | Line 19: | ||
{| align=center cellpadding=5 | {| align=center cellpadding=5 | ||
− | +UW for the concept of "a piece of furniture with tableware for a meal laid out on it" | + | |+UW for the concept of "a piece of furniture with tableware for a meal laid out on it" |
!UNL Representation | !UNL Representation | ||
! | ! |
Revision as of 15:45, 14 July 2010
The specifications here stated are still experimental and tentative, and have been continuously extended and amended in order to be as comprehensive as possible. They follow the general strategies defined in the UNL 2005 Specification (version of June 7th, 2005), but introduce several important changes derived from different UNLization experiences (Cratylus, EOLSS, Le Petit Prince, IGLU) carried by the UNDL Foundation. Although formally adopted in the UNDL Foundation tools, projects and certificates, they should not be taken yet as the official specs, as they are still under construction and have not been widely discussed with the UNL Community.
Contents |
PREMISES
These specifications are derived from three main premises:
Information conveyed by natural language can be represented by a natural language independent hyper-graph structure.
Texts can be treated as a set of semantic nodes interlinked by semantic relations and modified by semantic attributes. Nodes can be either simple (UWs) or complex (SCOPES, i.e., sub-graphs, such as clauses).
The UNL representation is an interpretation rather than a translation of a given text.
The main goal of the UNLization process is to represent the knowledge structure of the source text, which should be detached from its verbal structure. This means that the UNL representation should not be committed to replicate the lexical and the syntactic choices of the original, but should focus in representing, in a language-independent and non-ambiguous format, one of its possible readings, preferably the most conventional one.
The UNL representation should be as semantically complete as possible.
This means that, whenever possible, all the semantic valencies of the original text should be saturated, including anaphora, ellipses, presuppositions and implicatures. Pronouns and pro-forms, for instance, are expected to be replaced by their antecedents, and should not be represented in UNL, except in case of exophoric reference (indefinite pronouns, interrogative pronouns and personal pronouns that are not co-indexed to any existing antecedent).
THREE-LAYERED REPRESENTATION
The basic assumption of the UNL approach is that the meaning conveyed by natural language can be formally represented through three different types of semantic units: UWs, attributes and relations. This three-layered representation model is the cornerstone of UNL and its most distinctive feature over other semantic networks, which normally propose only two levels: edges and vertices.
UNIVERSAL WORDS (UWs)
Universal Words, or simply UWs, are the words of UNL, and correspond to the nodes - to be interlinked by relations or modified by attributes - in a UNL graph. They are labels for relatively stable units of knowledge (the concepts) that can be associated to natural language open lexical categories (noun, verb, adjective and adverb). The set of UWs is relatively open and is listed in the UNL Dictionary. Additionally, UWs are organized in a hierarchy (the UNL Ontology), are defined in the UNL Knowledge Base (UNLKB) and exemplified in the UNL Example Base (UNLEB), which are the lexical databases for UNL.
UWs can be either simple (atomic) or complex (made out of other UWs). In the latter case, they are represented as hyper-nodes (i.e., sub-graphs). A simple UW is an integer which can also be represented, for better readability, as a unique character-string split into two different parts: a root and a suffix. The root can be a word, an expression, a phrase or even an entire sentence in any language. It should be interpreted as a label for a concept. The suffix, which is always introduced by a UNL relation, is used to disambiguate the root:
UNL Representation | NL Representation | |
---|---|---|
104379964 | table(icl>furniture) table(icl>mobilier) mesa(icl>mobiliario) Tisch(icl>Möbel) стол(icl>мебель) ... |
As language-independent units, UWs are equivalent to the sets of synonyms of a given language, approaching the concept of "synset", devised by the WordNet (Fellbaum, 1998). As a matter of fact, the current UNL Dictionary has been automatically extracted out of the WordNet 3.0, and UWs have been represented as 9-digit strings with the following format:
<POS><WORDNETID>
where <POS> = {1,2,3,4}, being 1 = noun, 2 = verb, 3 = adjective and 4 = adverb;
and <WORDNETID> is the synset ID in the WordNet30.
The current UNL dictionary is, however, only a starting point, as the set of UWs is supposed to be as comprehensive as the set of these different individual concepts depicted by different cultures, no matter how specific they are. In that sense, UWs are not to be considered semantic primitives, nor should represent only common concepts, nor should be derived from any particular language. They must include culture-dependent information and every relevant variation among similar concepts. Furthermore, the UNL Dictionary constitutes an open set, subject to permanent increase with new UWs, as UNL is supposed to incessantly incorporate new cultures and cultural changes.
ATTRIBUTES
ATTRIBUTES represent bound morphemes, closed classes and context-dependent information:
- nouns, including proper nouns, abbreviations and acronyms;
- adjectives;
- full verbs;
- adverbs (adjuncts, conjuncts and disjuncts); and
- numbers (to be always represented as Arabic numerals)
- grammatical categories (gender, number, tense, aspect, mood, voice, etc)
- determiners (articles and demonstratives);
- adpositions (prepositions, postpositions and circumpositions);
- auxiliary and quasi-auxiliary verbs (auxiliaries, modals, coverbs, preverbs);
- interjections;
- conjunctions;
- adverbs (specifiers);
- text structure (.@entry, .@topic, .@qfocus, .@emphasis, .@relative, etc);
- speech acts (.@request, .@suggestion, .@offer, etc);
- other context-dependent information (such as politeness, metaphor, irony, etc);
- RELATIONS represent syntactic relations (specifier, complement, adjunct) with their corresponding semantic value.
Pronouns and pro-forms are expected to be replaced by their antecedents and not to be represented in UNL, except in case of exophoric reference (indefinite pronouns, interrogative pronouns and personal pronouns that are not coindexed to any existing antecedent).
UNIVERSAL WORDS
ATTRIBUTES
RELATIONS
RELATIONS == The set of relations is exactly the same as defined in the UNL 2005 Specifications.
ATTRIBUTES
The set of attributes has been substantially increased to represent information concerning grammatical categories, determiners, adpositions and conjunctions. The main additions are the following:
- gender: @male, @female
- degree: @more, @less, @equal, @most, @least, @plus, @minus, etc.
- demonstrative: @proximal, @medial, @distal
- preposition: @under, @below, @above, @after, @before, etc.
- conjunction: @before, @after, etc.
- relative (for the main entry of relative clauses): @relative
The decision to represent closed classes as attributes instead of UWs has led to a different way of representing several natural language phenomena:
- this X
- UNL Centre: mod(X, this)
- These guidelines: X.@proximal
- X is under Y
- UNL Centre: plc(X, under), obj(under, Y)
- These guidelines: plc(X, Y.@under)
- bigger than Y
- UNL centre: man(big, more), bas(big, Y)
- These guidelines: bas(big.@more, Y)
etc.
Additionally, the following general principles were adopted:
- interjections, filled pauses, phatic expressions and short answers should be represented by the null UW (to be represented as "00") together with the attribute indicating the corresponding speech act (.@confirmation, .@surprise, etc).
- the attribute .@entry (mandatory in every scope, including the main one) should be placed at the left (source) side of at least one relation;
- the difference between mentioning and using a word (which is a quite frequent situation in a metalinguistic text such as Cratylus) should be represented by the attribute .@mention (which is not the same as "quotation");
- attributes should be used in alphabetical order (“.@entry.@past” instead of “.@past.@entry”).
SCOPES
In order to enhance the possibility of knowledge extraction out of the UNL document, we have restricted the use of scopes only to cases involving semantic ambiguity, such as:
- electric [light orchestra], with scope, i.e., a "light orchestra" that is electric; or
- electric light orchestra, without scope, i.e., an orchestra that is both "light" and "electric".