English Disambiguation Grammar
From UNL Wiki
(Difference between revisions)
(→Negative rules) |
|||
Line 9: | Line 9: | ||
::This rule states that words must be isolated by blank spaces, punctuation sign or SHEAD and STAIL, except in the case of subwords, prefixes and suffixes. | ::This rule states that words must be isolated by blank spaces, punctuation sign or SHEAD and STAIL, except in the case of subwords, prefixes and suffixes. | ||
;Preventing the generation of two temporary words in sequence: | ;Preventing the generation of two temporary words in sequence: | ||
− | :"asdfg hijkl" | + | :"asdfg hijkl" must be represented as a single temporary word [asdfg hijkl] instead of [asdfg][ ][hijkl], which would be the case, because [ ] is in the dictionary |
:(TEMP,^" ",^DIGIT,^W)(^" ",^STAIL)=0; | :(TEMP,^" ",^DIGIT,^W)(^" ",^STAIL)=0; | ||
::(a temporary word, i.e., a word not found in the dictionary, must be followed by a blank space or the end of the sentence) | ::(a temporary word, i.e., a word not found in the dictionary, must be followed by a blank space or the end of the sentence) | ||
:(^" ",^PUT,^SHEAD)(TEMP,^" ",^W)=0; | :(^" ",^PUT,^SHEAD)(TEMP,^" ",^W)=0; | ||
::(a temporary word, i.e., a word not found in the dictionary, must be preceded by a blank space, a punctuation sign or the beginning of the sentence) | ::(a temporary word, i.e., a word not found in the dictionary, must be preceded by a blank space, a punctuation sign or the beginning of the sentence) | ||
+ | |||
== Positive rules == | == Positive rules == |
Revision as of 14:44, 27 July 2012
The English Disambiguation Grammar is used to control the tokenization of the English sentences. It comprises two different types of rules:
- Negative (blocking) rules, where the probability is equal to 0, prevent lexical choices
- Positive rules, where the probability is more than 0, force lexical choices
Negative rules
The most important negative rules are used to avoid hyper-segmentation of temporary entries:
- Preventing the hyper-segmentation of temporary entries
- "asdfg" must be tokenized as [asdfg] (one single temporary entry) instead of [as][dfg], which would be the case, because [as] is in the dictionary
- (^W,^" ",^PUT,^SBW,^DIGIT,^PFX,^SHEAD)(^W,^" ",^PUT,^SBW,^DIGIT,^SFX,^STAIL)=0;
- This rule states that words must be isolated by blank spaces, punctuation sign or SHEAD and STAIL, except in the case of subwords, prefixes and suffixes.
- Preventing the generation of two temporary words in sequence
- "asdfg hijkl" must be represented as a single temporary word [asdfg hijkl] instead of [asdfg][ ][hijkl], which would be the case, because [ ] is in the dictionary
- (TEMP,^" ",^DIGIT,^W)(^" ",^STAIL)=0;
- (a temporary word, i.e., a word not found in the dictionary, must be followed by a blank space or the end of the sentence)
- (^" ",^PUT,^SHEAD)(TEMP,^" ",^W)=0;
- (a temporary word, i.e., a word not found in the dictionary, must be preceded by a blank space, a punctuation sign or the beginning of the sentence)