Dictionary

From UNL Wiki

(Difference between revisions)

Revision as of 16:43, 11 February 2010

The UNL-NL dictionaries are bilingual dictionaries linking UWs to natural language (NL) words. They can be unidirectional (UNL-to-NL or NL-to-UNL) or bidirectional (NL-to-UNL-to-NL). UNL-to-NL dictionaries are used for NL-ization, while NL-to-UNL are used for UNL-ization. In what follows, we present the current specifications for UNL-NL dictionaries. They are not mandatory but are required from those interested in using UNL Centre's and UNDL Foundation's tools. The features marked with an * are only supported by UNDL Foundation's tools.

General syntax

In the UNL System, the UNL-NL dictionaries are plain text files with a single entry per line in the following format:

[NLW]  {ID}  “UW”  (ATTR , ... )  < FLG , FRE , PRI >; COMMENTS

Where:

NLW

The lexical item of the natural language. Its format should be decided by the dictionary builder. It can be:

a multiword expression: [United States of America]
a compound: [hot-dog]
a simple word: [happiness]
a simple morpheme: [happ]
a non-motivated linguistic entity: [g]
a complex structure (see below)*: [[bring] [back]]
a regular expression (see below)*: [colou{0,1}r]

ID: The unique identifier (primary-key) of the entry.

UW: The Universal Word of UNL. This field can be empty if a word does not need a UW. It can also be a regular expression.

ATTR

The list of features of the NLW. It can be:

a list of simple features: NOU, MCL, SNG
a list of attribute-value pairs*: pos=NOU, gen=MCL, num=SNG
a list of inflection rules (see below)*: IFX(PLR:=”oo”:”ee”)

Attributes should be separated by “,”.

FLG: The three-character language code according to ISO 639-3.

FRE: The frequency of NLW in natural texts. Used for natural language analysis (NL-UNL). It can range from 0 (less frequent) to 255 (most frequent).

PRI: The priority of the NLW. Used for natural language generation (UNL-NL). It can range from 0 to 255.

COMMENT: Any comment necessary to clarify the mapping between NL and UNL entries. It should end with the return code.

The features marked with * are not supported by the UNL Centre's tools

Formal syntax

<dictionary entry> ::= <NLW><ID><UW><FEATURE LIST>”<”<FLG>”,”<PRI>”,”<FRE>”>;”

<NLW>::= “[”(<SIMPLE NLW>|<COMPOUND NLW>|<REGULAR EXPRESSION>)”]”
<SIMPLE NLW> ::= <text>
<COMPOUND NLW> ::= (“[”<text>”]”)+
<ID> ::= “{”<positive integer>”}”
<UW> ::= “””<text>”””|<REGULAR EXPRESSION>
<FEATURE LIST> ::= “(”<FEATURE> (”,”<FEATURE>)+”)”
<FEATURE> ::= (<VALUE>|<ATTRIBUTE>”=”<VALUE>|<RULE LIST>|”#”<SUBNLWID><FEATURE LIST>)
<SUBNLWID> ::= [01..99]
<RULE LIST> ::= <RULE>(”;”<RULE>)*
<RULE> ::= <ATTRIBUTE>”(”<VALUE>”:=”<a-rule>(”;”<VALUE>”:=”<a-rule>)*”)”
<ATTRIBUTE> ::= <text>
<VALUE> ::= <text>(”&”<text>)*
<FLG> ::= ISO 639-3 language codes
<PRI> ::= [0..255]
<FRE> ::= [0..255]
<REGULAR EXPRESSION> ::= "/"<PERL COMPATIBLE REGULAR EXPRESSIONS>"/"

Where:
+ = 1 or more times
* = 0 or more times
| = alternative
Horizontal blank spaces are allowed and ignored except inside quoted text (string literals).

Complex structures as NLW*

In order to deal with multiple word expressions, the NLW can be represented as a complex structure comprising several sub-NLW entries. The syntax for complex NLWs is:

[[sub-NLW][sub-NLW]...[sub-NLW]]  {ID}  “UW”  (ATTR , ..., 01#(ATTR, ...), 02#(ATTR, ...), ...)  < FLG , FRE , PRI >; COMMENTS

Where:
[sub-NLW] is a part of the NLW;
01#(ATTR, ...) are the specific features for the first sub-NLW to appear in the NLW;
02#(ATTR, ...) are the specific features for the second sub-NLW to appear in the NLW;
and so on.
The first sub-NLW to appear in a NLW will be always the 01, the second the 02, and so on.
The feature list preceded by <number># will apply only to the corresponding sub-NLW.
The features outside the sub-NLW feature lists are shared by all sub-NLWs.

Example

[[bring] [back]] {12343} "to bring back(icl>to bring)" (pos=VER, 01#(IFX(ET0:=4>"ought")), 02#(pos=PRE)) <eng, 0, 0>;

In the entry above, the NLW has been split into two different sub-NLWs ([bring] and [back] with a blank space in between). Each of these sub-NLWs has different features, referred to in the embedded parentheses inside the feature list. The sub-NLW [bring], which was the first to appear, has the feature "IFX(ET0:=4>"ought")", while the sub-NLW [back], which was the second, has the feature "pos=PRE". The feature "pos=VER", which is outside the specific feature lists, is shared by both of them.

Inflection rules inside dictionary entries*

In order to deal with exceptions and irregular forms, dictionaries may contain rules, which must be included inside the feature list, as follows:

<RULE>            ::= <ATTRIBUTE>”(”<VALUE>”:=”<a-rule> (”;”<VALUE>”:=”<a-rule>)* ”)”
<ATTRIBUTE>       ::= <HYPER-ATTRIBUTE>|<SIMPLE ATTRIBUTE>
<HYPER ATTRIBUTE> ::= <TEXT>
<ATTRIBUTE>       ::= <TEXT>
<VALUE>           ::= <VALUE LIST>|<SIMPLE VALUE>
<VALUE LIST>      ::= <SIMPLE VALUE>("&"<SIMPLE VALUE>)*
<SIMPLE VALUE>    ::= <TEXT>
<TEXT>            ::= [a-zA-Z0-9]+

Where:
<ATTRIBUTE> is the attribute that will be used to call the rule
<VALUE> is the value of the attribute that will trigger the rule
<a-rule> is an affixation rule (described in a-rule) "" constant

to be repeated 0 or more times

+ to be repeated 1 or more times

Hyper-attributes and value lists

Inflection rules may be introduced by attributes or hyper-attributes, i.e., attributes that take other attributes as values. This latter case is used in case of morpheme overlapping (amalgam), i.e., when the inflectional morphology does not allow clear separation between specific attributes, such as in English verbal morphology, where tense, aspect and mood are generally conflated. Values of hyper-attributes are hence complex structures that may comprehend several different values concatenated through "&". These value lists must be, however, analysable, as they may be referred as separated entities inside the generation grammar, as exemplified below:

Scope of inflection rules

Inflection rules always apply over the field <NLW> and are used only in UNL-to-NL (i.e., generation) dictionaries. In NL-to-UNL (analysis) dictionaries, the recognition of irregular forms and allographs is made either by listing variants, by hyper-regularisation or by regular expressions, as indicated below:

UNL-to-NL Dictionary
- [bring] "bring" (POS=VER, TEN(PAS:="brought")) <eng,0,0>;
NL-to-UNL Dictionary
- First option: listing variants (recommmended)
  - [bring] "bring" (POS=VER, TEN=PRS) <eng,0,0>;
  - [brought] "bring" (POS=VER, TEN=PAS) <eng,0,0>;
- Second option: hyper-regularisation
  - [br] "bring" (POS=VER) <eng,0,0>;
- Third option: regular expression
  - [/br(ing|ought)/] "bring" (POS=VER) <eng,0,0>;

Examples of inflection rules

NUM(PLR:="men")

If the value of the attribute "number" (NUM) is "plural" (PLR) then replace the whole natural language word by "men"

POS(ORD:="1">"1st","2">"2nd","3">"3rd")

If the value of the attribute "part of speech" (POS) is "ordinal" (ORD) then:

if the last character of the string is "1", then replace "1" by "1st"; and

if the last character of the string is "2", then replace "2" by "2nd"; and

if the last character of the string is "3", then replace "3" by "3rd".

Regular expressions inside dictionary entries*

Both the NLW and the UW may be replaced by regular expressions. In both cases, regular expressions must be included between a pair of "/" and should comply with the PCRE - Perl Compatible Regular Expressions]. They should be represented as follows:

Regular expression in the field <NLW> (used in NL-to-UNL)

[/<RegEx>/] "<UW>" (<FEATURE LIST>) <FLG,FRE,PRI>;

Regular expression in the field <UW> (used in UNL-to-NL)

[<NLW>] "/<RegEx>/" (<FEATURE LIST>) <FLG,FRE,PRI>;

Examples

Regular expressions in the field <NLW>: [/colo(u)?r/] "color" (POS=NOU) <eng,0,0>; (NLW = {color, colour}); [/cit(y|ies)/] "city" (POS=NOU) <eng,0,0>; (NLW = {city, cities}); [/(\d){4}/] "" (ENT=YEAR) <eng,0,0>; (NLW = any sequence of four digits)
Regular expressions in the field <UW>: [city] "/city(.)*/" (POS=NOU) <eng,0,0>; (UW = any UW that starts by the string "city"); [city] "/(.)\(iof\>city\)/" (POS=NOU) <eng,0,0>; (UW = any UW that ends by the string "(iof>city)"

Examples of dictionary entries

[China]{24} "China(iof>Asian country)" (NOU, WRD, SNG, P0, F0) <eng,0,0>;
[choose]{106} "to choose(icl>to decide)" (POS=VER, LEX=WRD, INF=P1, FRA=F76, FLX(3PS&PRS&IND:=0>"s"; PAS:="chose"; PTP:="chosen"; GER:="choosing")) <eng,0,0>;
[clear-eyed]{25} "clear-eyed(icl>discerning)" (POS=ADJ, LEX=WRD, INF=P0, FRA=F0) <en,0,0>;
[Peter]{177}"Peter(iof>person)"(NOU)<eng,10,30>;
[kill]{5987}"kill(icl>do)"(TEN(PAS:=0>"ed"))<eng,70,80>;
[[bring] [back]]{2345}"bring back"(POS=VER,VA(01>02),01#(POS=VER,TEN(pas:=3>"ought")),02#(POS=PRE))<eng,50,34>;
[/br(ing|ought)/] "bring(icl>do)" (POS=VER) <eng,0,0>;
[[/br(ing|ought)/] [back]]{2345} "bring back(icl>do)" (POS=VER,01#(POS=VER),02#(POS=PRE))<eng,50,34>;
[/colo(u)?r/] "color" (POS=NOU) <eng,0,0>; (NLW = {color, colour})
[/cit(y|ies)/] "city" (POS=NOU) <eng,0,0>; (NLW = {city, cities})
[/(\d){4}/] "" (ENT=YEAR) <eng,0,0>; (NLW = any sequence of four digits)
[city] "/city(.)*/" (POS=NOU) <eng,0,0>; (UW = any UW that starts by the string "city")
[city] "/(.)\(iof\>city\)/" (POS=NOU) <eng,0,0>; (UW = any UW that ends by the string "(iof>city)"

@@ Line 114: / Line 114: @@
 Inflection rules may be introduced by attributes or hyper-attributes, i.e., attributes that take other attributes as values. This latter case is used in case of morpheme overlapping (amalgam), i.e., when the inflectional morphology does not allow clear separation between specific attributes, such as in English verbal morphology, where tense, aspect and mood are generally conflated. Values of hyper-attributes are hence complex structures that may comprehend several different values concatenated through "&". These value lists must be, however, analysable, as they may be referred as separated entities inside the generation grammar, as exemplified below:
-{{#tree:id=FLX|openlevels=3|root=INFLEXION (FLX) (attribute)
+{{#tree:id=FLX|openlevels=0|root=INFLEXION (FLX) (attribute)
-*[[aspect]] (ASP) (attribute)
+*aspect (ASP) (attribute)
 **perfective (PFV) (value)
 **imperfective (NPFV) (value)
@@ Line 122: / Line 122: @@
 **perfect of result (RES) (value)
 **...
-*[[mood]] (MOO) (attribute)
+*mood(MOO) (attribute)
 **assumptive (AUM) (value)
 **conditional (CON) (value)
@@ Line 128: / Line 128: @@
 **deductive (DED) (value)
 **...
-*[[person]] (PER) (attribute)
+*person (PER) (attribute)
 **first person singular (1PS) (value)
 **first person plural (1PP) (value)
 **second person singular (2PS) (value)
 **...
-*[[tense]] (TNS) (attribute)
+*tense (TNS) (attribute)
 **present (PRS) (value)
 **past (PAS) (value)
@@ Line 139: / Line 139: @@
 **..
 }}
 === Scope of inflection rules ===

Dictionary

Revision as of 16:43, 11 February 2010

Contents

General syntax

Formal syntax

Complex structures as NLW*

Inflection rules inside dictionary entries*

Hyper-attributes and value lists

Scope of inflection rules

Examples of inflection rules

Regular expressions inside dictionary entries*

Examples

Examples of dictionary entries

Views

Personal tools

Search

UNL

Lingware

Software

UNL Program

Navigation

Toolbox

Print/export