English Disambiguation Grammar
The English disambiguation grammars, or English d-grammars, are a part of the English grammar and are used to improve the results of the tokenization and to control the application of t-rules. They follow the formalism described at UNL Grammar Specs and are used both in natural language analysis (UNLization) and in natural language generation (NLization).
Contents |
UNLization
In natural language analysis, the d-grammar is used to control the tokenization of the English sentences, i.e., to prevent wrong lexical choices and to induce the best matches. The d-grammar comprises two different types of disambiguation rules, or d-rules:
- Negative (blocking) rules, where the probability is equal to 0, prevent lexical choices
- For instance, the rule (D)(BLK)(V)=0; informs that the sequence determiner+blank space+verb is not allowed, i.e., there cannot be a determiner before a verb.
- Positive rules, where the probability is more than 0, force lexical choices
- For instance, the rule (['s],V)(BLK)(GER)=1; informs that the entry ['s] as a verb is to be preferred before a gerund (there are three entries ['s] in the dictionary: the contracted form of "is", the particle used to form the genitive and a plural suffix; if this rule is not stated, the system would simply select the first one appearing in the dictionary with the highest frequency)
File
The English d-grammar for the Corpus500 may be downloaded from eng_ana_dgrammar.txt. The complete English d-grammar may be exported from the UNLarium: UNLWEB>UNLARIUM>GRAMMAR>ENGLISH>EXPORT.
How to use d-grammars
D-grammars must be uploaded to or provided directly at the tab d-rules in IAN.
Examples of disambiguation rules
- Preventing hyper-segmentation of temporary entries
- "asdfg" must be tokenized as [asdfg] (one single temporary entry) instead of [as][dfg], which would be the case, because [as] is in the dictionary
- (TEMP,^" ",^DIGIT,^W)(^" ",^STAIL)=0;
- (a temporary word, i.e., a word not found in the dictionary, must be followed by a blank space or the end of the sentence)
- (^" ",^PUT,^SHEAD)(TEMP,^" ",^W)=0;
- (a temporary word, i.e., a word not found in the dictionary, must be preceded by a blank space, a punctuation sign or the beginning of the sentence)
- (^DIGIT)({[st]|[nd]|[rd]|[th]})=0;
- The subwords [st], [nd], [rd] and [th] may only appear after a number ("1st" or "tenth"). This prevents hyper-segmentation as in "asdfgst" = [asdfg][st], which would not be blocked by the rule above, because [st] is a suffix.
- Determiners x pronouns
- There are many ambiguities in the dictionary between determiners and pronouns. The string "that", for instance, is represented in the dictionary as a demonstrative determiner ("that book") and as a demonstrative pronoun ("that is the book"). The following rules help differentiating them:
- (D,^AFT)({PUT|STAIL})=0;
- Determiners cannot come at the end of the sentence, except if their distribution is AFT (after) ("He said that", but "There are books enough")
- (D)(BLK)(V)=0;
- Determiners cannot come before a verb ("That is the book")
- Auxiliary verbs x main verbs
- Many auxiliary verbs may also play the role of main verbs: "He has done that" ("has" is an auxiliary) x "He has a car" ("has" is the main verb). The following rule helps differentiating them:
- (AUX,^COP)(BLK)(^V,^[not])=0;
- Auxiliary verbs which are not copula must be followed by a verb or the adverb [not]