English Disambiguation Grammar
From UNL Wiki
The English Disambiguation Grammar is used to control the tokenization of the English sentences, i.e., to prevent wrong lexical choices and to induce the best matches. It follows the formalism described at UNL Grammar Specs and comprises two different types of rules:
- Negative (blocking) rules, where the probability is equal to 0, prevent lexical choices
- For instance, the rule (D)(BLK)(V)=0; informs that the sequence determiner+blank space+verb is not allowed, i.e., there cannot be a determiner before a verb.
- Positive rules, where the probability is more than 0, force lexical choices
- For instance, the rule (['s],V)(BLK)(GER)=1; informs that the entry ['s] as a verb is to be preferred before a gerund (there are three entries ['s] in the dictionary: the contracted form of "is", the particle used to form the genitive and a plural suffix; if this rule is not stated, the system would select the first one appearing in the dictionary with the highest frequency)
Negative rules
The most important negative rules are used to avoid hyper-segmentation of temporary entries:
- Preventing the hyper-segmentation of temporary entries
- "asdfg" must be tokenized as [asdfg] (one single temporary entry) instead of [as][dfg], which would be the case, because [as] is in the dictionary
- (^W,^" ",^PUT,^SBW,^DIGIT,^PFX,^SHEAD)(^W,^" ",^PUT,^SBW,^DIGIT,^SFX,^STAIL)=0;
- This rule states that words must be isolated by blank spaces, punctuation sign or SHEAD and STAIL, except in the case of subwords, prefixes and suffixes.
- Preventing the generation of two temporary words in sequence
- "asdfg hijkl" must be represented as a single temporary word [asdfg hijkl] instead of [asdfg][ ][hijkl], which would be the case, because [ ] is in the dictionary
- (TEMP,^" ",^DIGIT,^W)(^" ",^STAIL)=0;
- (a temporary word, i.e., a word not found in the dictionary, must be followed by a blank space or the end of the sentence)
- (^" ",^PUT,^SHEAD)(TEMP,^" ",^W)=0;
- (a temporary word, i.e., a word not found in the dictionary, must be preceded by a blank space, a punctuation sign or the beginning of the sentence)