N-rule

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(Example)
 
(26 intermediate revisions by one user not shown)
Line 1: Line 1:
 
N-rule or normalization rule is a special type of [[transformation rule]] used to prepare the natural language input for automatic processing. They constitute the pre-processing module that applies over the input as a string and runs prior to the [[tokenization]]. The set of N-rules forms the '''Normalization Grammar''', or '''N-Grammar'''.
 
N-rule or normalization rule is a special type of [[transformation rule]] used to prepare the natural language input for automatic processing. They constitute the pre-processing module that applies over the input as a string and runs prior to the [[tokenization]]. The set of N-rules forms the '''Normalization Grammar''', or '''N-Grammar'''.
 +
 +
== When to use N-rules ==
 +
N-rules are used to normalize the input string PRIOR to the processing, i.e., before any dictionary search. They have two roles:
 +
*to normalize the input text (to replace abbreviations by their extend forms, to extend contractions, etc.)
 +
*to segment the natural language text into sentences (i.e., to create the tags <SHEAD> (beginning of a sentence), <STAIL> (end of a sentence), <CHEAD> (beginning of a scope) and <CTAIL> (end of a scope) inside the input text). These tags are used as sentence and clause boundaries, and define the units of processing of the Transformation and Disambiguation grammars.
 +
 +
== When not to use N-rules ==
 +
N-rules cannot be used when we depend on information extracted from the dictionary (such as part-of-speech, number, gender, etc.)
 +
 +
== Where to use N-rules ==
 +
N-rules are used in [[IAN]] and [[SEAN]], i.e., in UNLization systems. They must be provided at the N-rules tab.
  
 
== Syntax ==
 
== Syntax ==
Normalization Rules are a special type of [[L-rule]] and observe the same syntax, i.e.:
+
N-rules comply with the syntax below:
  <CONDITION> := <ACTION>;
+
  (<NODE>)(<NODE>)...(<NODE>) := (<NODE>)(<NODE>)...(<NODE>);
 
Where:
 
Where:
*<CONDITION> is a single form or a sequence of forms over which actions will take place; and
+
*<NODE> is a string or a [[regular expression]]. Strings are always represented between "quotes"; regular expressions (for strings) between "/forward slashes inside quotes/".
*<ACTION> is the action to be performed over each form or sequence of forms of the CONDITION.
+
*the left side of the operator := states the condition
CONDITION and ACTION may be expressed as:
+
*the right side of the operator := states the action to be performed over each string of the condition.
*a character or string of characters, between quotes: ("a");
+
*a [[regular expression]], between / /: (/a[bcd]e/)
+
Examples:
+
*("Mr."):=("Mister"); (replace "Mr." by "Mister")
+
*("doctor"):=("dr."); (replace "doctor" by "dr.")
+
  
== Difference between N-rules and L-rules ==
+
== Types ==
Differently from L-rules, N-rules cannot deal with any features, because it runs prior to [[tokenization]], i.e., before any dictionary search. This means that N-rules may only deal with strings or regular expressions:
+
N-rules are used to:
*L-rule
+
*replace strings: "axb" > "ayb"
**("I")(BLK)("am"):=("I'm"); (I am>I'm)
+
*delete strings: "axb" > "ab"
**("a",PRE)(BLK)("a",ART):=("à",+ART,+CTC); (a a>à)
+
*create strings: "ab" > "axb"
**("de",PRE)(BLK)("le",ART):=("du",+ART,+CTC); (de le>du)
+
*reorder strings: "ab" > "ba"
*N-rule
+
*assign sentence boundaries: "ab" > "a"<STAIL>"b"
**("I")(" ")("am"):=("I'm"); (replace "I am" by "I'm")
+
Note, in the above, that we may use dictionary features (such as BLK, PRE, ART) in L-rules, but we cannot use any dictionary feature in N-rules. The only features available in N-rules are the system-defined features, such as SHEAD (beginning of the sentence) and STAIL (end of the sentence).
+
  
== Roles of Normalization Rules ==
+
== Example ==
Normalization rules have two roles:
+
*[http://www.unlweb.net/resources/english/eng_ngrammar_ian.txt English N-Grammar (for IAN)]
*to normalize the input text (to replace abbreviations by their extend forms, to extend contractions, etc.) before the tokenization
+
*to segment the natural language text into sentences (i.e., to create the tags <SHEAD> (beginning of a sentence), <STAIL> (end of a sentence), <CHEAD> (beginning of a scope) and <CTAIL> (end of a scope) inside the input text). These tags are used as sentence and clause boundaries, and define the units of processing of the Transformation and Disambiguation grammars.
+
  
== Type of Normalization Rules ==
+
;Replacement of strings
Normalization rules are string replacement rules. They are used to replace existing strings by new strings. They constitute the preprocessing module of natural language analysis, and apply prior to the [[tokenization]] and to any dictionary search, when no attribute other than string itself is available. The string to be replaced may be referred by a constant (between "double quotes") or by a regular expression (between /forward slashes/).
+
*("Mr."):=("Mister"); (replace "Mr." by "Mister")
 
+
*("Mr")("."):=("Mister"); (the same as above)
{|cellpadding="5" border="1" align="center"
+
*("doctor"):=("dr."); (replace "doctor" by "dr.")
|+N-rules
+
*("an "):=("a "); (replace "an " by "a ")
!ACTION
+
*("don't"):=("do not"); (replace "don't" by "do not")
!RULE
+
*("don't"):=("do")(" ")("not"); (the same as above)
!DESCRIPTION
+
;Deletion of strings
!EXAMPLE
+
*("/[A-Z]/",%x)(".",%y):=(%x); (deletes the "." after capital letters)
|-
+
;Creation of strings
|REPLACE
+
*(SHEAD,%x)(^" ",%y):=(%x)(" ",%z)(%y); (add a blank space after the beginning of the sentence)
|("source string"):=("target string");
+
;Reordering of strings
|All the instances of the source string will be replaced by the target string
+
*("Am",%x)(" ",%y)("I",%z):=(%z)(%y)(%x); (reorder "Am I" as "I Am")
|("x"):=("y"); axbxcxd will become aybycyd
+
;Segmentation (see below)
|-
+
*(".",%x):=(%x)(+STAIL,%y); (creates an STAIL node after a ".";<ref>This rule contains an eternal loop and it is used here only to illustrate the creation of nodes. The correct rule would be (".",%x)(^STAIL,%z):=(%x)(+STAIL,%y)(%z);</ref>)
|APPEND (RIGHT)
+
|("source string",%x):=(%x)(%y,"target string");
+
|The target string will be appended to the right of all instances of the source string.
+
|("x",%x):=(%x)("y",%y); axbxcxd will become axybxycxyd
+
|-
+
|APPEND (LEFT)
+
|("source string",%x):=(%y,"target string")(%x);
+
|The target string will be appended to the left of all instances of the source string.
+
|("x",%x):=("y",%y)(%x); axbxcxd will become ayxbyxcyxd
+
|-
+
|DELETE
+
|("source string"):=;
+
|All the instances of the source string will be deleted.
+
|("x"):=; axbxcxd will become abcd
+
|}
+
<br /><br />
+
;Indexes (%x, %y, etc.) are used in appending rules in order to define the direction (to the left or to the right).
+
  
 
== Segmentation ==
 
== Segmentation ==
Line 67: Line 52:
 
*<STAIL> indicates the end of a sentence
 
*<STAIL> indicates the end of a sentence
 
*<CHEAD> indicates the beginning of a scope (any portion of text smaller than a sentence)
 
*<CHEAD> indicates the beginning of a scope (any portion of text smaller than a sentence)
*<CTAIL> indicates the beginning of a scope (any portion of text smaller than a sentence)
+
*<CTAIL> indicates the end of a scope (any portion of text smaller than a sentence)
 
The tags <SHEAD> and <STAIL> defines the sentence boundaries and are automatically assigned by the system according to line breaks and paragraph breaks. No punctuation sign is used as a sentence boundary by default. In order to break the input text into other portions, the corresponding N-rules must be provided. This is done by appending empty nodes with the features SHEAD, STAIL, CHEAD or CTAIL to the left or to the right of existing strings.  
 
The tags <SHEAD> and <STAIL> defines the sentence boundaries and are automatically assigned by the system according to line breaks and paragraph breaks. No punctuation sign is used as a sentence boundary by default. In order to break the input text into other portions, the corresponding N-rules must be provided. This is done by appending empty nodes with the features SHEAD, STAIL, CHEAD or CTAIL to the left or to the right of existing strings.  
 
*Original text: <SHEAD>abcde<STAIL>
 
*Original text: <SHEAD>abcde<STAIL>
*Rule: ("c",%x):=(%x)(STAIL);
+
*Rule: ("c",%x)(^STAIL,%y):=(%x)(STAIL)(%y);
 
*Modified text: <SHEAD>abc<STAIL><SHEAD>de<STAIL>  
 
*Modified text: <SHEAD>abc<STAIL><SHEAD>de<STAIL>  
  
Line 76: Line 61:
 
*The tag <SHEAD> is assigned automatically after <STAIL>
 
*The tag <SHEAD> is assigned automatically after <STAIL>
 
*The tag <STAIL> is assigned automatically before <SHEAD>
 
*The tag <STAIL> is assigned automatically before <SHEAD>
*The tag <CHEAD> is assigned automatically after <CTAIL>
 
*The tag <CTAIL> is assigned automatically before <CHEAD>
 
  
== Examples of Normalization rules ==
+
== Properties ==
*Segmentation
+
#N-rules can only manipulate strings or regular expressions. Features (such as N, NOU, MCL, etc.) cannot be used in N-rules.
**("/.*\./",%x):=(%x)(+STAIL,%y); (creates an STAIL node after any sequence of characters followed by "." (/.*\./);
+
#:("Mr."):=("Mister"); (string manipulation)
**("/\(/",%x):=(+CHEAD,%y)(%x); (creates an CHEAD node before the opening of a parentheses (/\(/);
+
#:("/[A-Z]/",%x)(".",%y):=(%x); (regular expression manipulation)
*Normalization
+
#:<strike>("Mr.",ABB):=("Mister");</strike> (this is not a N-rule, because it involves a non-string element, i.e., ABB)
**("an "):=("a "); ("an apple" > "a apple")
+
#Regular expressions may only be used in the left side.<br />
**("don't"):=("do not"); ("I don't see" > "I do not see")
+
#:("/[A-Z]/",%x)(".",%y):=(%x);  
 +
#:<strike>("/[A-Z]/")("."):=("/[A-Z]/");</strike>
 +
#N-rules are recursive<nowiki>:</nowiki> rules will apply while conditions are true:
 +
#:The rule "(" "):=("-");" will transform "a b c d e" into "a-b-c-d-e" (and not only in "a-b c d e")
 +
#The symbol '''^''' is used for negation and may be used to prevent infinite loops:
 +
#:The rule (".",%x):=(%x)(+STAIL,%y); contains a loop, and will lead to (".")(STAIL)(STAIL)(STAIL)(STAIL).... In order to prevent that, we have to indicate that STAIL must be added if it does not exist yet, i.e.: (".",%x)(^STAIL,%z):=(%x)(+STAIL,%y)(%z);
 +
#In the right side, changes may be expressed by the right side of [[A-rule]]s inside each form. The default is replacement.
 +
#:The rule "("a")(" ")("/[aeiou].+/"):=("an")( )( );" could also be expressed as "("a")(" ")("/[aeiou].*/"):=(0>"n")( )( );", i.e., the change from "a" to "an" could be expressed either by "an" or 0>"n".
 +
#Rules apply only if all conditions are true.
 +
#:The rule "("a")(" ")("/[aeiou].+/"):=("an")( )( );" will apply only in case of "a" before a blank and a node starting with "a", "e", "i", "o" or "u".
 +
#Nodes may be deleted through replacement by zero:
 +
#:(" "):=; (deletes all the blank spaces)
 +
#Nodes in the left side that are not coindexed to nodes in the right side are deleted (see [[Indexation]])
 +
#:<strike>(" ")("don't")(" "):=("do not");</strike> provides "I don't know">"Ido notknow"
 +
#:(" ")("don't")(" "):=()("do not")(); provides "I don't know">I do not know"
 +
#N-rules manipulate any strings meeting the conditions
 +
#:<strike>("art"):=("article");</strike> provides "'''art''' 20">"'''article''' 20", but also "My name is B'''art'''">"My name is B'''article'''", "I love S'''art'''re">"I love S'''article'''re"
 +
#:({SHEAD|" "})("art")({STAIL|" "}):=()("article")(); (i.e., replace "art" by "article" if inbetween blank spaces or sentence boundaries
 +
 
 +
== Indexes ==
 +
see [[Indexation]]
 +
 
 +
== Common mistakes ==
 +
*<strike>"Mr":="Mister";</strike>
 +
**Conditions and actions must always come between parentheses: ("Mr"):=("Mister");
 +
*<strike>(Mr):=(Mister);</strike>
 +
**Strings must come between quotes (inside the parentheses): ("Mr"):=("Mister");
 +
*<strike>("Mr"):=("Mister")</strike>
 +
**Rules must end in semicolon: ("Mr"):=("Mister");
 +
*<strike><nowiki>("a")(" ")("/[aeiou].*/"):=("an");</nowiki></strike>
 +
**"a adjective">"a": the blank and the following form are deleted because they are not present at the right side
 +
*<strike><nowiki>("de")(" ")("/[aeiou].*/"):=("d'")("/[aeiou].*/");</nowiki></strike>
 +
**"de avoir">"d' ": coindexation is based on ordering and not on features. The third form is deleted because it's not present at the right side; the second form, which is BLK, receives the feature VOW;
 +
 
 +
== N-rules and L-rules ==
 +
{{:Difference between N-rules and L-rules}}

Latest revision as of 16:25, 16 July 2014

N-rule or normalization rule is a special type of transformation rule used to prepare the natural language input for automatic processing. They constitute the pre-processing module that applies over the input as a string and runs prior to the tokenization. The set of N-rules forms the Normalization Grammar, or N-Grammar.

Contents

When to use N-rules

N-rules are used to normalize the input string PRIOR to the processing, i.e., before any dictionary search. They have two roles:

  • to normalize the input text (to replace abbreviations by their extend forms, to extend contractions, etc.)
  • to segment the natural language text into sentences (i.e., to create the tags <SHEAD> (beginning of a sentence), <STAIL> (end of a sentence), <CHEAD> (beginning of a scope) and <CTAIL> (end of a scope) inside the input text). These tags are used as sentence and clause boundaries, and define the units of processing of the Transformation and Disambiguation grammars.

When not to use N-rules

N-rules cannot be used when we depend on information extracted from the dictionary (such as part-of-speech, number, gender, etc.)

Where to use N-rules

N-rules are used in IAN and SEAN, i.e., in UNLization systems. They must be provided at the N-rules tab.

Syntax

N-rules comply with the syntax below:

(<NODE>)(<NODE>)...(<NODE>) := (<NODE>)(<NODE>)...(<NODE>);

Where:

  • <NODE> is a string or a regular expression. Strings are always represented between "quotes"; regular expressions (for strings) between "/forward slashes inside quotes/".
  • the left side of the operator := states the condition
  • the right side of the operator := states the action to be performed over each string of the condition.

Types

N-rules are used to:

  • replace strings: "axb" > "ayb"
  • delete strings: "axb" > "ab"
  • create strings: "ab" > "axb"
  • reorder strings: "ab" > "ba"
  • assign sentence boundaries: "ab" > "a"<STAIL>"b"

Example

Replacement of strings
  • ("Mr."):=("Mister"); (replace "Mr." by "Mister")
  • ("Mr")("."):=("Mister"); (the same as above)
  • ("doctor"):=("dr."); (replace "doctor" by "dr.")
  • ("an "):=("a "); (replace "an " by "a ")
  • ("don't"):=("do not"); (replace "don't" by "do not")
  • ("don't"):=("do")(" ")("not"); (the same as above)
Deletion of strings
  • ("/[A-Z]/",%x)(".",%y):=(%x); (deletes the "." after capital letters)
Creation of strings
  • (SHEAD,%x)(^" ",%y):=(%x)(" ",%z)(%y); (add a blank space after the beginning of the sentence)
Reordering of strings
  • ("Am",%x)(" ",%y)("I",%z):=(%z)(%y)(%x); (reorder "Am I" as "I Am")
Segmentation (see below)
  • (".",%x):=(%x)(+STAIL,%y); (creates an STAIL node after a ".";[1])

Segmentation

In the UNL framework, natural language segmentation is done through the following tags:

  • <SHEAD> indicates the beginning of a sentence
  • <STAIL> indicates the end of a sentence
  • <CHEAD> indicates the beginning of a scope (any portion of text smaller than a sentence)
  • <CTAIL> indicates the end of a scope (any portion of text smaller than a sentence)

The tags <SHEAD> and <STAIL> defines the sentence boundaries and are automatically assigned by the system according to line breaks and paragraph breaks. No punctuation sign is used as a sentence boundary by default. In order to break the input text into other portions, the corresponding N-rules must be provided. This is done by appending empty nodes with the features SHEAD, STAIL, CHEAD or CTAIL to the left or to the right of existing strings.

  • Original text: <SHEAD>abcde<STAIL>
  • Rule: ("c",%x)(^STAIL,%y):=(%x)(STAIL)(%y);
  • Modified text: <SHEAD>abc<STAIL><SHEAD>de<STAIL>
Observations
  • The tag <SHEAD> is assigned automatically after <STAIL>
  • The tag <STAIL> is assigned automatically before <SHEAD>

Properties

  1. N-rules can only manipulate strings or regular expressions. Features (such as N, NOU, MCL, etc.) cannot be used in N-rules.
    ("Mr."):=("Mister"); (string manipulation)
    ("/[A-Z]/",%x)(".",%y):=(%x); (regular expression manipulation)
    ("Mr.",ABB):=("Mister"); (this is not a N-rule, because it involves a non-string element, i.e., ABB)
  2. Regular expressions may only be used in the left side.
    ("/[A-Z]/",%x)(".",%y):=(%x);
    ("/[A-Z]/")("."):=("/[A-Z]/");
  3. N-rules are recursive: rules will apply while conditions are true:
    The rule "(" "):=("-");" will transform "a b c d e" into "a-b-c-d-e" (and not only in "a-b c d e")
  4. The symbol ^ is used for negation and may be used to prevent infinite loops:
    The rule (".",%x):=(%x)(+STAIL,%y); contains a loop, and will lead to (".")(STAIL)(STAIL)(STAIL)(STAIL).... In order to prevent that, we have to indicate that STAIL must be added if it does not exist yet, i.e.: (".",%x)(^STAIL,%z):=(%x)(+STAIL,%y)(%z);
  5. In the right side, changes may be expressed by the right side of A-rules inside each form. The default is replacement.
    The rule "("a")(" ")("/[aeiou].+/"):=("an")( )( );" could also be expressed as "("a")(" ")("/[aeiou].*/"):=(0>"n")( )( );", i.e., the change from "a" to "an" could be expressed either by "an" or 0>"n".
  6. Rules apply only if all conditions are true.
    The rule "("a")(" ")("/[aeiou].+/"):=("an")( )( );" will apply only in case of "a" before a blank and a node starting with "a", "e", "i", "o" or "u".
  7. Nodes may be deleted through replacement by zero:
    (" "):=; (deletes all the blank spaces)
  8. Nodes in the left side that are not coindexed to nodes in the right side are deleted (see Indexation)
    (" ")("don't")(" "):=("do not"); provides "I don't know">"Ido notknow"
    (" ")("don't")(" "):=()("do not")(); provides "I don't know">I do not know"
  9. N-rules manipulate any strings meeting the conditions
    ("art"):=("article"); provides "art 20">"article 20", but also "My name is Bart">"My name is Barticle", "I love Sartre">"I love Sarticlere"
    ({SHEAD|" "})("art")({STAIL|" "}):=()("article")(); (i.e., replace "art" by "article" if inbetween blank spaces or sentence boundaries

Indexes

see Indexation

Common mistakes

  • "Mr":="Mister";
    • Conditions and actions must always come between parentheses: ("Mr"):=("Mister");
  • (Mr):=(Mister);
    • Strings must come between quotes (inside the parentheses): ("Mr"):=("Mister");
  • ("Mr"):=("Mister")
    • Rules must end in semicolon: ("Mr"):=("Mister");
  • ("a")(" ")("/[aeiou].*/"):=("an");
    • "a adjective">"a": the blank and the following form are deleted because they are not present at the right side
  • ("de")(" ")("/[aeiou].*/"):=("d'")("/[aeiou].*/");
    • "de avoir">"d' ": coindexation is based on ordering and not on features. The third form is deleted because it's not present at the right side; the second form, which is BLK, receives the feature VOW;

N-rules and L-rules

N-rules and L-rules are basically the same. The only difference is that L-rules are part of the Transformation Grammar and, therefore, applies after tokenization, whereas N-rules constitute the N-grammar, and apply before tokenization. This means that N-rules may only deal with strings or regular expressions, whereas L-rules may also deal with other elements (such as features and UW's):

  • L-rule
    • ("I")(BLK)("am"):=("I'm"); (I am>I'm)
    • ("a",PRE)(BLK)("a",ART):=("à",+ART,+CTC); (a a>à)
    • ("de",PRE)(BLK)("le",ART):=("du",+ART,+CTC); (de le>du)
  • N-rule
    • ("I")(" ")("am"):=("I'm"); (replace "I am" by "I'm")

Note, in the above, that we may use dictionary features (such as BLK, PRE, ART) in L-rules, but we cannot use any dictionary feature in N-rules. The only features available in N-rules are the system-defined features, such as SHEAD (beginning of the sentence) and STAIL (end of the sentence).

Software