Universal Words

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(Types of UWs)
(Formal syntax)
Line 29: Line 29:
  
 
  '''UNL REPRESENTATION'''
 
  '''UNL REPRESENTATION'''
  <nowiki><UW>      ::= <integer></nowiki>
+
  <nowiki><PERMANENT UW>      ::= <integer></nowiki>
 +
<nowiki><TEMPORARY UW>      ::= """<ASCII character>+"""</nowiki>
  
 
  '''NL REPRESENTATION'''
 
  '''NL REPRESENTATION'''
  <nowiki><UW>      ::= <root>[<suffix>]</nowiki>
+
  <nowiki><TEMPORARY UW>      ::= """<UNICODE character>"""</nowiki>
  <nowiki><root>     ::= <character>+</nowiki>
+
<nowiki><PERMANENT UW>      ::= <root>[<suffix>]</nowiki>
  <nowiki><suffix>   ::= “(“ <suffix> [ “,” <suffix> ]… “)” | <relation> { “>” , “<” } <root></nowiki>
+
  <nowiki><root>               ::= <character>+</nowiki>
  <nowiki><relation> ::= {“agt”, "and", "aoj", ...}</nowiki>
+
  <nowiki><suffix>             ::= “(“ <suffix> [ “,” <suffix> ]… “)” | <relation> { “>” , “<” } <root></nowiki>
 +
  <nowiki><relation>           ::= {“agt”, "and", "aoj", ...}</nowiki>
  
 
where:<br>
 
where:<br>

Revision as of 19:54, 20 June 2011

Universal Words, or simply UWs, are the words of UNL, and correspond to the nodes - to be interlinked by relations or modified by attributes - in a UNL graph. They are labels for relatively stable units of knowledge (the concepts) that can be associated to natural language open lexical categories (noun, verb, adjective and adverb). The syntax of UWs is defined by the UNL Specs, but the set of UWs is relatively open, and includes permanent UWs - those listed in the UNL Dictionary - and temporary UWs. Additionally, permanent UWs may be organized in a hierarchy (the UNL Ontology), are defined in the UNL Knowledge Base and exemplified in the UNL Example Base, which are the lexical databases for UNL.

Contents

Types of UWs

There are basically two different types of UWs: permanent and temporary.
Permanent UWs are included in the UNL Dictionary and correspond to concepts of common use (common nouns, adjectives, adverbs and verbs).
Temporary UWs are are words that:

  • Are still candidates to be included in the UNL Dictionary ("Barack Obama", "Twitter");
  • Are too specific to be included in the UNL Dictionary ("Universal Networking Digital Language Foundation", "Léon Werth"); or
  • Are not translatable ("3.14159", "H2O", "www.undlfoundation.org").

The difference between permanent and temporary UWs is rather intuitive. Most named entities, for instance, are represented as temporary UWs, because it would not be feasible to include them all in the UNL Dictionary. Nevertheless, some named entities of widespread use (such as "William Shakespeare", "Romeo and Juliet", "Romeo", "Verona", "4th of July", "IBM", etc) have been already included in the UNL Dictionary and are treated as permanent UWs.

Structure of UWs

Temporary UWs are always represented between double quotes, and observe the source language spelling practices (concerning, for instance, capitalization). For the time being, they're also expected to be transliterated in Roman characters.

Permanent UWs can be either simple (atomic) or complex (made out of other UWs). In the latter case, they are represented as hyper-nodes, i.e., sub-hyper-graphs, and follow the syntax for UNL Sentences. A simple UW is an integer which can also be represented, for better readability, as a unique character-string split into two different parts: a root and a suffix. The root can be a word, an expression, a phrase or even an entire sentence in any language. It should be interpreted as a label for a concept. The suffix, which is always introduced by a UNL relation, is used to disambiguate the root.

As language-independent semantic units, UWs are equivalent to the sets of synonyms of a given language, approaching the concept of "synset" devised by the WordNet (Fellbaum, 1998). As a matter of fact, the current UNL Dictionary has been automatically extracted out of the WordNet 3.0, and UWs have been represented as 9-digit strings with the following format:

<POS><WORDNETID>

where <POS> = {1,2,3,4}, being 1 = noun, 2 = verb, 3 = adjective and 4 = adverb;
and <WORDNETID> is the synset ID in the WordNet30.

The current UNL dictionary is, however, only a starting point, as the set of UWs is supposed to be as comprehensive as the set of these different individual concepts depicted by different languages and cultures. In that sense, UWs are not to be considered semantic primitives, nor should represent only common concepts, nor should be derived from any particular language. They must include culture-dependent information and every relevant variation among similar concepts. Furthermore, the UNL Dictionary constitutes an open set, subject to permanent increase with new UWs, as UNL is supposed to incessantly incorporate new cultures and cultural changes.

Formal syntax

The syntax for permanent UWs is defined as follows:

UNL REPRESENTATION
<PERMANENT UW>       ::= <integer>
<TEMPORARY UW>       ::= """<ASCII character>+"""
NL REPRESENTATION
<TEMPORARY UW>       ::= """<UNICODE character>"""
<PERMANENT UW>       ::= <root>[<suffix>]
<root>               ::= <character>+
<suffix>             ::= “(“ <suffix> [ “,” <suffix> ]… “)” | <relation> { “>” , “<” } <root>
<relation>           ::= {“agt”, "and", "aoj", ...}

where:
+ to be repeated 1 or more times
< > variable
" " terminal symbol
::= ... is defined as ...
| or
[ ] optional element
{ } alternative element
... to be repeated more than 0 times

Examples

The UW for the concept of "a piece of furniture with tableware for a meal laid out on it" may be represented as follows:

UNL Representation NL Representation
104379964 table(icl>furniture)
table(icl>mobilier)
mesa(icl>mobiliario)
Tisch(icl>Möbel)
стол(icl>мебель)
...

Semantics

The basic assumption of the UNL approach is that the information conveyed by natural languages can be formally represented through three different types of semantic units: concepts, concept modifiers and binary relations between concepts. This three-layered representation model is the cornerstone of UNL and its most distinctive feature over other semantic networks, which normally proposes only two levels: edges and vertices. Nevertheless, it poses several problems to the UNL-ization as the distinction between what is supposed to be represented by each unit is not always clear.

The main difficulty concerns what is to be represented as a concept (and therefore as a UW) and what is to be represented as a relation between concepts. How many concepts (UWs) are there, for instance, in the sentence "Charles Dickens was the author of Oliver Twist"? Should "author" be represented as a concept or as a relation between "Charles Dickens" and "Oliver Twist"? Should the verb "to be" be represented as a concept or as a relation between "Charles Dickens" and "author"? Should the preposition "of" be represented as a concept or as a relation between "author" and "Oliver Twist"?

In order to avoid what can be an endless discussion, the UNL assumes that UWs must correspond to and only to concepts referred by natural language open lexical categories (noun, verb, adjective and adverb). Any other semantic content (such as the ones conveyed by articles, prepositions, conjunctions, etc.) should be represented either as attributes of UWs or as relations between UWs. This criterion is not language-biased: if a given semantic value proves to be conveyed, in any language, by a closed class, it should not be represented as a UW, regardless of its realisation in other languages.

Categories of UWs

Permanent UWs are classified in four different categories, depending on their semantic values:

It should be stressed that these categories are semantic rather than syntactic or morphological. They are related to the UWs and are not oriented to any particular language. In that sense, adjectival UWs (such as "300217728" = "delighting the senses or exciting intellectual or emotional admiration") tend to be associated to English adjectives ("beautiful"), but they can also be realised as prepositional phrases ("with beauty"), verbal phrases ("possessing beauty"), etc.

Additionally, it should be emphasized that the set of UWs is not derived from any particular language. In that sense, there will be many UWs that do not correspond to a single lexical item and will have to be represented by periphrases. The concept "a state of torment created by the sudden sight of one's own misery", for instance, is lexicalized in Czech ("litost"), but not in English. In principle, the set of UWs, which is the UNL Dictionary, is supposed to be as comprehensive as the set of these different individual concepts depicted by different cultures, no matter how specific they are. In that sense, UWs are not to be considered semantic primitives, nor should represent only common concepts. They must include culture-dependent information and every relevant variation among similar concepts. Furthermore, the UNL Dictionary constitutes an open set, subject to permanent increase with new UWs, as UNL is supposed to incessantly incorporate new cultures and cultural changes.

Software