N-rule
N-rule or normalization rule is a special type of transformation rule used to prepare the natural language input for automatic processing. They constitute the pre-processing module that applies over the input as a string and runs prior to the tokenization. The set of N-rules forms the Normalization Grammar, or N-Grammar.
Contents |
Syntax
Normalization Rules are a special type of L-rule and observe the same syntax, i.e.:
<CONDITION> := <ACTION>;
Where:
- <CONDITION> is a single form or a sequence of forms over which actions will take place; and
- <ACTION> is the action to be performed over each form or sequence of forms of the CONDITION.
CONDITION and ACTION may be expressed as:
- a character or string of characters, between quotes: ("a");
- a regular expression, between / /: (/a[bcd]e/)
Examples:
- ("Mr."):=("Mister"); (replace "Mr." by "Mister")
- ("doctor"):=("dr."); (replace "doctor" by "dr.")
Difference between N-rules and L-rules
Differently from L-rules, N-rules cannot deal with any features, because it runs prior to tokenization, i.e., before any dictionary search. This means that N-rules may only deal with strings or regular expressions:
- L-rule
- ("I")(BLK)("am"):=("I'm"); (I am>I'm)
- ("a",PRE)(BLK)("a",ART):=("à",+ART,+CTC); (a a>à)
- ("de",PRE)(BLK)("le",ART):=("du",+ART,+CTC); (de le>du)
- N-rule
- ("I")(" ")("am"):=("I'm"); (replace "I am" by "I'm")
Note, in the above, that we may use dictionary features (such as BLK, PRE, ART) in L-rules, but we cannot use any dictionary feature in N-rules. The only features available in N-rules are the system-defined features, such as SHEAD (beginning of the sentence) and STAIL (end of the sentence).
Roles of Normalization Rules
Normalization rules have two roles:
- to normalize the input text (to replace abbreviations by their extend forms, to extend contractions, etc.) before the tokenization
- to segment the natural language text into sentences (i.e., to create the tags <SHEAD> (beginning of a sentence), <STAIL> (end of a sentence), <CHEAD> (beginning of a scope) and <CTAIL> (end of a scope) inside the input text). These tags are used as sentence and clause boundaries, and define the units of processing of the Transformation and Disambiguation grammars.
Type of Normalization Rules
Normalization rules are string replacement rules. They are used to replace existing strings by new strings. They constitute the preprocessing module of natural language analysis, and apply prior to the tokenization and to any dictionary search, when no attribute other than string itself is available. The string to be replaced may be referred by a constant (between "double quotes") or by a regular expression (between /forward slashes/).
ACTION | RULE | DESCRIPTION | EXAMPLE |
---|---|---|---|
REPLACE | ("source string"):=("target string"); | All the instances of the source string will be replaced by the target string | ("x"):=("y"); axbxcxd will become aybycyd |
APPEND (RIGHT) | ("source string",%x):=(%x)(%y,"target string"); | The target string will be appended to the right of all instances of the source string. | ("x",%x):=(%x)("y",%y); axbxcxd will become axybxycxyd |
APPEND (LEFT) | ("source string",%x):=(%y,"target string")(%x); | The target string will be appended to the left of all instances of the source string. | ("x",%x):=("y",%y)(%x); axbxcxd will become ayxbyxcyxd |
DELETE | ("source string"):=; | All the instances of the source string will be deleted. | ("x"):=; axbxcxd will become abcd |
- Indexes (%x, %y, etc.) are used in appending rules in order to define the direction (to the left or to the right).
Segmentation
In the UNL framework, natural language segmentation is done through the following tags:
- <SHEAD> indicates the beginning of a sentence
- <STAIL> indicates the end of a sentence
- <CHEAD> indicates the beginning of a scope (any portion of text smaller than a sentence)
- <CTAIL> indicates the beginning of a scope (any portion of text smaller than a sentence)
The tags <SHEAD> and <STAIL> defines the sentence boundaries and are automatically assigned by the system according to line breaks and paragraph breaks. No punctuation sign is used as a sentence boundary by default. In order to break the input text into other portions, the corresponding N-rules must be provided. This is done by appending empty nodes with the features SHEAD, STAIL, CHEAD or CTAIL to the left or to the right of existing strings.
- Original text: <SHEAD>abcde<STAIL>
- Rule: ("c",%x):=(%x)(STAIL);
- Modified text: <SHEAD>abc<STAIL><SHEAD>de<STAIL>
- Observations
- The tag <SHEAD> is assigned automatically after <STAIL>
- The tag <STAIL> is assigned automatically before <SHEAD>
- The tag <CHEAD> is assigned automatically after <CTAIL>
- The tag <CTAIL> is assigned automatically before <CHEAD>
Examples of Normalization rules
- Segmentation
- ("/.*\./",%x):=(%x)(+STAIL,%y); (creates an STAIL node after any sequence of characters followed by "." (/.*\./);
- ("/\(/",%x):=(+CHEAD,%y)(%x); (creates an CHEAD node before the opening of a parentheses (/\(/);
- Normalization
- ("an "):=("a "); ("an apple" > "a apple")
- ("don't"):=("do not"); ("I don't see" > "I do not see")