English Disambiguation Grammar
From UNL Wiki
The English Disambiguation Grammar is used to control the tokenization of the English sentences. It comprises two different types of rules:
- Negative (blocking) rules, where the probability is equal to 0, prevent lexical choices
- Positive rules, where the probability is more than 0, force lexical choices
Negative rules
The most important negative rules are used to avoid hyper-segmentation of temporary entries:
- Preventing the hyper-segmentation of temporary entries
- "asdfg" must be tokenized as [asdfg] (one single temporary entry) instead of [as][dfg], which would be the case, because [as] is in the dictionary
- (^W,^" ",^PUT,^SBW,^DIGIT,^PFX,^SHEAD)(^W,^" ",^PUT,^SBW,^DIGIT,^SFX,^STAIL)=0;
- This rule states that words must be isolated by blank spaces, punctuation sign or SHEAD and STAIL, except in the case of subwords, prefixes and suffixes.
- Preventing the generation of two temporary words in sequence
- "asdfg hijkl" will be represented as a single temporary word "asdfg hijkl" instead of two temporary words "asdfg" and "hijkl" isolated by blank space
- (TEMP,^" ",^DIGIT,^W)(^" ",^STAIL)=0;
- (a temporary word, i.e., a word not found in the dictionary, must be followed by a blank space or the end of the sentence)
- (^" ",^PUT,^SHEAD)(TEMP,^" ",^W)=0;
- (a temporary word, i.e., a word not found in the dictionary, must be preceded by a blank space, a punctuation sign or the beginning of the sentence)