English Disambiguation Grammar

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(Negative rules)
(UNLization)
 
(17 intermediate revisions by one user not shown)
Line 1: Line 1:
The English Disambiguation Grammar is used to control the [[tokenization]] of the English sentences. It comprises two different types of rules:
+
The English disambiguation grammars, or English d-grammars, are a part of the [[English grammar]] and are used to improve the results of the [[tokenization]] and to control the application of [[t-rule]]s. They follow the formalism described at [[UNL_Grammar_Specs#Disambiguation_Rules|UNL Grammar Specs]] and are used both in natural language analysis ([[UNLization]]) and in natural language generation ([[NLization]]).  
*Negative (blocking) rules, where the probability is equal to 0, prevent lexical choices
+
*Positive rules, where the probability is more than 0, force lexical choices
+
== Negative rules ==
+
The most important negative rules are used to avoid hyper-segmentation of temporary entries:
+
;Preventing the hyper-segmentation of temporary entries:
+
:"asdfg" must be tokenized as [asdfg] (one single temporary entry) instead of [as][dfg], which would be the case, because [as] is in the dictionary
+
:(^W,^" ",^PUT,^SBW,^DIGIT,^PFX,^SHEAD)(^W,^" ",^PUT,^SBW,^DIGIT,^SFX,^STAIL)=0;
+
::This rule states that words must be isolated by blank spaces, punctuation sign or SHEAD and STAIL, except in the case of subwords, prefixes and suffixes.
+
;Preventing the generation of two temporary words in sequence:
+
:"asdfg hijkl" must be represented as a single temporary word [asdfg hijkl] instead of [asdfg][ ][hijkl], which would be the case, because [ ] is in the dictionary
+
:(TEMP,^" ",^DIGIT,^W)(^" ",^STAIL)=0;
+
::(a temporary word, i.e., a word not found in the dictionary, must be followed by a blank space or the end of the sentence)
+
:(^" ",^PUT,^SHEAD)(TEMP,^" ",^W)=0;
+
::(a temporary word, i.e., a word not found in the dictionary, must be preceded by a blank space, a punctuation sign or the beginning of the sentence)
+
  
== Positive rules ==
+
== ENG->UNL Disambiguation Grammar ==
 +
In natural language analysis, the d-grammar is used to control the [[tokenization]] of the English sentences, i.e., to prevent wrong lexical choices and to induce the best matches. The d-grammar comprises two different types of disambiguation rules, or d-rules:
 +
*'''Negative''' (blocking) rules, where the probability is equal to 0, prevent lexical choices
 +
*:For instance, the rule (D)(BLK)(V)=0; informs that the sequence determiner+blank space+verb is not allowed, i.e., there cannot be a determiner before a verb.
 +
*'''Positive''' rules, where the probability is more than 0, force lexical choices
 +
*:For instance, the rule (['s],V)(BLK)(GER)=1; informs that the entry ['s] as a verb is to be preferred before a gerund (there are three entries ['s] in the dictionary: the contracted form of "is", the particle used to form the genitive and a plural suffix; if this rule is not stated, the system would simply select the first one appearing in the dictionary with the highest frequency)
 +
 
 +
=== How to use d-grammars ===
 +
D-grammars must be uploaded to or provided directly at the tab '''d-rules''' in [[IAN]].
 +
 
 +
=== Examples of disambiguation rules ===
 +
;TOKENIZATION OF TEMPORARY WORDS (used to control hyper-segmentation)
 +
(TEMP,^DIGIT,^W)(^BLK,^PUT,^STAIL)=0;
 +
:there must be a blank, a punctuation sign or the end of the sentence after a temporary word, i.e., a temporary word cannot be followed by other word, except for digits, as in "1st"
 +
(^BLK,^PUT,^SHEAD)(TEMP,^W)=0;
 +
:there must be a blank, a punctuation sign or the beginning of the sentence before a temporary word, i.e., a temporary word cannot be preceded by other word
 +
(TEMP)(PUT)(TEMP)=0; there cannot be two temporary words separated by punctuation mark
 +
;DETERMINERS X PRONOUNS (used to disambiguate pronouns from determiners, which come first in the dictionary)
 +
(D,^AFT)({PUT,^BLK|STAIL})=0;
 +
:determiners may not come at the end of the sentence or before a punctuation mark, except if their distribution is AFT, like "enough"
 +
(D,^AFT)(BLK)({V|P|AAV})=0;
 +
:determiners may not be precede verbs, prepositions or adjunct adverbs, except if their distribution is AFT
 +
;AUXILIARY VERBS X MAIN VERBS (have, be)
 +
(AUX)(BLK)(^V,^[not])=0;
 +
:an auxiliary verb must be followed by a verb or the words "not" or "to"
 +
(AUX)(BLK)([not])(BLK)(^V)=0;
 +
:if followed by "not", the auxiliary must be followed by a verb

Latest revision as of 22:51, 29 October 2012

The English disambiguation grammars, or English d-grammars, are a part of the English grammar and are used to improve the results of the tokenization and to control the application of t-rules. They follow the formalism described at UNL Grammar Specs and are used both in natural language analysis (UNLization) and in natural language generation (NLization).

ENG->UNL Disambiguation Grammar

In natural language analysis, the d-grammar is used to control the tokenization of the English sentences, i.e., to prevent wrong lexical choices and to induce the best matches. The d-grammar comprises two different types of disambiguation rules, or d-rules:

  • Negative (blocking) rules, where the probability is equal to 0, prevent lexical choices
    For instance, the rule (D)(BLK)(V)=0; informs that the sequence determiner+blank space+verb is not allowed, i.e., there cannot be a determiner before a verb.
  • Positive rules, where the probability is more than 0, force lexical choices
    For instance, the rule (['s],V)(BLK)(GER)=1; informs that the entry ['s] as a verb is to be preferred before a gerund (there are three entries ['s] in the dictionary: the contracted form of "is", the particle used to form the genitive and a plural suffix; if this rule is not stated, the system would simply select the first one appearing in the dictionary with the highest frequency)

How to use d-grammars

D-grammars must be uploaded to or provided directly at the tab d-rules in IAN.

Examples of disambiguation rules

TOKENIZATION OF TEMPORARY WORDS (used to control hyper-segmentation)
(TEMP,^DIGIT,^W)(^BLK,^PUT,^STAIL)=0;
there must be a blank, a punctuation sign or the end of the sentence after a temporary word, i.e., a temporary word cannot be followed by other word, except for digits, as in "1st"
(^BLK,^PUT,^SHEAD)(TEMP,^W)=0;
there must be a blank, a punctuation sign or the beginning of the sentence before a temporary word, i.e., a temporary word cannot be preceded by other word

(TEMP)(PUT)(TEMP)=0; there cannot be two temporary words separated by punctuation mark

DETERMINERS X PRONOUNS (used to disambiguate pronouns from determiners, which come first in the dictionary)
(D,^AFT)({PUT,^BLK|STAIL})=0; 
determiners may not come at the end of the sentence or before a punctuation mark, except if their distribution is AFT, like "enough"
(D,^AFT)(BLK)({V|P|AAV})=0; 
determiners may not be precede verbs, prepositions or adjunct adverbs, except if their distribution is AFT
AUXILIARY VERBS X MAIN VERBS (have, be)
(AUX)(BLK)(^V,^[not])=0; 
an auxiliary verb must be followed by a verb or the words "not" or "to"
(AUX)(BLK)([not])(BLK)(^V)=0; 
if followed by "not", the auxiliary must be followed by a verb
Software