English Disambiguation Grammar
(→UNLization) |
(→Examples of disambiguation rules) |
||
Line 16: | Line 16: | ||
;Preventing hyper-segmentation of temporary entries: | ;Preventing hyper-segmentation of temporary entries: | ||
:"asdfg" must be tokenized as [asdfg] (one single temporary entry) instead of [as][dfg], which would be the case, because [as] is in the dictionary | :"asdfg" must be tokenized as [asdfg] (one single temporary entry) instead of [as][dfg], which would be the case, because [as] is in the dictionary | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
:(TEMP,^" ",^DIGIT,^W)(^" ",^STAIL)=0; | :(TEMP,^" ",^DIGIT,^W)(^" ",^STAIL)=0; | ||
::(a temporary word, i.e., a word not found in the dictionary, must be followed by a blank space or the end of the sentence) | ::(a temporary word, i.e., a word not found in the dictionary, must be followed by a blank space or the end of the sentence) | ||
:(^" ",^PUT,^SHEAD)(TEMP,^" ",^W)=0; | :(^" ",^PUT,^SHEAD)(TEMP,^" ",^W)=0; | ||
::(a temporary word, i.e., a word not found in the dictionary, must be preceded by a blank space, a punctuation sign or the beginning of the sentence) | ::(a temporary word, i.e., a word not found in the dictionary, must be preceded by a blank space, a punctuation sign or the beginning of the sentence) | ||
+ | :(^DIGIT)({[st]|[nd]|[rd]|[th]})=0; | ||
+ | ::The subwords [st], [nd], [rd] and [th] may only appear after a number ("1st" or "tenth"). This prevents hyper-segmentation as in "asdfgst" = [asdfg][st], which would not be blocked by the rule above, because [st] is a suffix. | ||
;Determiners x pronouns | ;Determiners x pronouns | ||
:There are many ambiguities in the dictionary between determiners and pronouns. The string "that", for instance, is represented in the dictionary as a demonstrative determiner ("that book") and as a demonstrative pronoun ("that is the book"). The following rules help differentiating them: | :There are many ambiguities in the dictionary between determiners and pronouns. The string "that", for instance, is represented in the dictionary as a demonstrative determiner ("that book") and as a demonstrative pronoun ("that is the book"). The following rules help differentiating them: |
Revision as of 19:04, 1 August 2012
The English disambiguation grammars, or English d-grammars, are a part of the English grammar and are used to improve the results of the tokenization and to control the application of t-rules. They follow the formalism described at UNL Grammar Specs and are used both in natural language analysis (UNLization) and in natural language generation (NLization).
Contents |
UNLization
In natural language analysis, the d-grammar is used to control the tokenization of the English sentences, i.e., to prevent wrong lexical choices and to induce the best matches. The d-grammar comprises two different types of disambiguation rules, or d-rules:
- Negative (blocking) rules, where the probability is equal to 0, prevent lexical choices
- For instance, the rule (D)(BLK)(V)=0; informs that the sequence determiner+blank space+verb is not allowed, i.e., there cannot be a determiner before a verb.
- Positive rules, where the probability is more than 0, force lexical choices
- For instance, the rule (['s],V)(BLK)(GER)=1; informs that the entry ['s] as a verb is to be preferred before a gerund (there are three entries ['s] in the dictionary: the contracted form of "is", the particle used to form the genitive and a plural suffix; if this rule is not stated, the system would simply select the first one appearing in the dictionary with the highest frequency)
File
The English d-grammar for the Corpus500 may be downloaded from eng_ana_dgrammar.txt. The complete English d-grammar may be exported from the UNLarium: UNLWEB>UNLARIUM>GRAMMAR>ENGLISH>EXPORT.
How to use d-grammars
D-grammars must be uploaded to or provided directly at the tab d-rules in IAN.
Examples of disambiguation rules
- Preventing hyper-segmentation of temporary entries
- "asdfg" must be tokenized as [asdfg] (one single temporary entry) instead of [as][dfg], which would be the case, because [as] is in the dictionary
- (TEMP,^" ",^DIGIT,^W)(^" ",^STAIL)=0;
- (a temporary word, i.e., a word not found in the dictionary, must be followed by a blank space or the end of the sentence)
- (^" ",^PUT,^SHEAD)(TEMP,^" ",^W)=0;
- (a temporary word, i.e., a word not found in the dictionary, must be preceded by a blank space, a punctuation sign or the beginning of the sentence)
- (^DIGIT)({[st]|[nd]|[rd]|[th]})=0;
- The subwords [st], [nd], [rd] and [th] may only appear after a number ("1st" or "tenth"). This prevents hyper-segmentation as in "asdfgst" = [asdfg][st], which would not be blocked by the rule above, because [st] is a suffix.
- Determiners x pronouns
- There are many ambiguities in the dictionary between determiners and pronouns. The string "that", for instance, is represented in the dictionary as a demonstrative determiner ("that book") and as a demonstrative pronoun ("that is the book"). The following rules help differentiating them:
- (D,^AFT)({PUT|STAIL})=0;
- Determiners cannot come at the end of the sentence, except if their distribution is AFT (after) ("He said that", but "There are books enough")
- (D)(BLK)(V)=0;
- Determiners cannot come before a verb ("That is the book")
- Auxiliary verbs x main verbs
- Many auxiliary verbs may also play the role of main verbs: "He has done that" ("has" is an auxiliary) x "He has a car" ("has" is the main verb). The following rule helps differentiating them:
- (AUX,^COP)(BLK)(^V,^[not])=0;
- Auxiliary verbs which are not copula must be followed by a verb or the adverb [not]