English Disambiguation Grammar
Line 1: | Line 1: | ||
− | The English | + | The English disambiguation grammars, or English d-grammars, are a part of the [[English grammar]] and are used to improve the results of the [[tokenization]] and to control the application of [[t-rule]]s. They follow the formalism described at [[UNL_Grammar_Specs#Disambiguation_Rules|UNL Grammar Specs]] and are used both in natural language analysis ([[UNLization]]) and in natural language generation ([[NLization]]). |
+ | |||
+ | == UNLization == | ||
+ | In natural language analysis, the d-grammar is used to control the [[tokenization]] of the English sentences, i.e., to prevent wrong lexical choices and to induce the best matches. The d-grammar comprises two different types of rules: | ||
*'''Negative''' (blocking) rules, where the probability is equal to 0, prevent lexical choices | *'''Negative''' (blocking) rules, where the probability is equal to 0, prevent lexical choices | ||
*:For instance, the rule (D)(BLK)(V)=0; informs that the sequence determiner+blank space+verb is not allowed, i.e., there cannot be a determiner before a verb. | *:For instance, the rule (D)(BLK)(V)=0; informs that the sequence determiner+blank space+verb is not allowed, i.e., there cannot be a determiner before a verb. | ||
*'''Positive''' rules, where the probability is more than 0, force lexical choices | *'''Positive''' rules, where the probability is more than 0, force lexical choices | ||
*:For instance, the rule (['s],V)(BLK)(GER)=1; informs that the entry ['s] as a verb is to be preferred before a gerund (there are three entries ['s] in the dictionary: the contracted form of "is", the particle used to form the genitive and a plural suffix; if this rule is not stated, the system would simply select the first one appearing in the dictionary with the highest frequency) | *:For instance, the rule (['s],V)(BLK)(GER)=1; informs that the entry ['s] as a verb is to be preferred before a gerund (there are three entries ['s] in the dictionary: the contracted form of "is", the particle used to form the genitive and a plural suffix; if this rule is not stated, the system would simply select the first one appearing in the dictionary with the highest frequency) | ||
− | == Examples of disambiguation rules == | + | === File === |
+ | The English d-grammar for the [[Corpus500]] may be downloaded from [http://www.unlweb.net/resources/corpus500/eng_ana_dgrammar.txt eng_ana_dgrammar.txt]. The complete English d-grammar may be exported from the [[UNLarium]]: UNLWEB>UNLARIUM>GRAMMAR>ENGLISH>EXPORT. | ||
+ | === How to use d-grammars === | ||
+ | The d-grammar must be uploaded to or provided directly at the tab d-rules in [[IAN]]. | ||
+ | |||
+ | === Examples of disambiguation rules === | ||
;Preventing hyper-segmentation of temporary entries: | ;Preventing hyper-segmentation of temporary entries: |
Revision as of 15:08, 28 July 2012
The English disambiguation grammars, or English d-grammars, are a part of the English grammar and are used to improve the results of the tokenization and to control the application of t-rules. They follow the formalism described at UNL Grammar Specs and are used both in natural language analysis (UNLization) and in natural language generation (NLization).
Contents |
UNLization
In natural language analysis, the d-grammar is used to control the tokenization of the English sentences, i.e., to prevent wrong lexical choices and to induce the best matches. The d-grammar comprises two different types of rules:
- Negative (blocking) rules, where the probability is equal to 0, prevent lexical choices
- For instance, the rule (D)(BLK)(V)=0; informs that the sequence determiner+blank space+verb is not allowed, i.e., there cannot be a determiner before a verb.
- Positive rules, where the probability is more than 0, force lexical choices
- For instance, the rule (['s],V)(BLK)(GER)=1; informs that the entry ['s] as a verb is to be preferred before a gerund (there are three entries ['s] in the dictionary: the contracted form of "is", the particle used to form the genitive and a plural suffix; if this rule is not stated, the system would simply select the first one appearing in the dictionary with the highest frequency)
File
The English d-grammar for the Corpus500 may be downloaded from eng_ana_dgrammar.txt. The complete English d-grammar may be exported from the UNLarium: UNLWEB>UNLARIUM>GRAMMAR>ENGLISH>EXPORT.
How to use d-grammars
The d-grammar must be uploaded to or provided directly at the tab d-rules in IAN.
Examples of disambiguation rules
- Preventing hyper-segmentation of temporary entries
- "asdfg" must be tokenized as [asdfg] (one single temporary entry) instead of [as][dfg], which would be the case, because [as] is in the dictionary
- (^W,^" ",^PUT,^SBW,^DIGIT,^PFX,^SHEAD)(^W,^" ",^PUT,^SBW,^DIGIT,^SFX,^STAIL)=0;
- This rule states that words must be isolated by blank spaces, punctuation sign or SHEAD and STAIL, except in the case of subwords, prefixes and suffixes.
- (^DIGIT)({[st]|[nd]|[rd]|[th]})=0;
- The subwords [st], [nd], [rd] and [th] may only appear after a number ("1st" or "tenth"). This prevents hyper-segmentation as in "asdfgst" = [asdfg][st], which would not be blocked by the rule above, because [st] is a suffix.
- Preventing the generation of two temporary words in sequence
- "asdfg hijkl" must be represented as a single temporary word [asdfg hijkl] instead of [asdfg][ ][hijkl], which would be the case, because [ ] is in the dictionary
- (TEMP,^" ",^DIGIT,^W)(^" ",^STAIL)=0;
- (a temporary word, i.e., a word not found in the dictionary, must be followed by a blank space or the end of the sentence)
- (^" ",^PUT,^SHEAD)(TEMP,^" ",^W)=0;
- (a temporary word, i.e., a word not found in the dictionary, must be preceded by a blank space, a punctuation sign or the beginning of the sentence)
- Determiners x pronouns
- There are many ambiguities in the dictionary between determiners and pronouns. The string "that", for instance, is represented in the dictionary as a demonstrative determiner ("that book") and as a demonstrative pronoun ("that is the book"). The following rules help differentiating them:
- (D,^AFT)({PUT|STAIL})=0;
- Determiners cannot come at the end of the sentence, except if their distribution is AFT (after) ("He said that", but "There are books enough")
- (D)(BLK)(V)=0;
- Determiners cannot come before a verb ("That is the book")
- Auxiliary verbs x main verbs
- Many auxiliary verbs may also play the role of main verbs: "He has done that" ("has" is an auxiliary) x "He has a car" ("has" is the main verb). The following rule helps differentiating them:
- (AUX,^COP)(BLK)(^V,^[not])=0;
- Auxiliary verbs which are not copula must be followed by a verb or the adverb [not]