N-rule

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(Roles of Normalization Rules)
(Example)
 
(39 intermediate revisions by one user not shown)
Line 1: Line 1:
Normalization rules are used to prepare the natural language input for automatic processing. They constitute the preprocessing module that applies over the input as a string and runs prior to the [[tokenization]]. The set of n-rules forms the '''Normalization Grammar''', or '''N-Grammar'''.
+
N-rule or normalization rule is a special type of [[transformation rule]] used to prepare the natural language input for automatic processing. They constitute the pre-processing module that applies over the input as a string and runs prior to the [[tokenization]]. The set of N-rules forms the '''Normalization Grammar''', or '''N-Grammar'''.
 +
 
 +
== When to use N-rules ==
 +
N-rules are used to normalize the input string PRIOR to the processing, i.e., before any dictionary search. They have two roles:
 +
*to normalize the input text (to replace abbreviations by their extend forms, to extend contractions, etc.)
 +
*to segment the natural language text into sentences (i.e., to create the tags <SHEAD> (beginning of a sentence), <STAIL> (end of a sentence), <CHEAD> (beginning of a scope) and <CTAIL> (end of a scope) inside the input text). These tags are used as sentence and clause boundaries, and define the units of processing of the Transformation and Disambiguation grammars.
 +
 
 +
== When not to use N-rules ==
 +
N-rules cannot be used when we depend on information extracted from the dictionary (such as part-of-speech, number, gender, etc.)
 +
 
 +
== Where to use N-rules ==
 +
N-rules are used in [[IAN]] and [[SEAN]], i.e., in UNLization systems. They must be provided at the N-rules tab.
  
 
== Syntax ==
 
== Syntax ==
Normalization Rules follow the very general formalism
+
N-rules comply with the syntax below:
  α:=β;
+
  (<NODE>)(<NODE>)...(<NODE>) := (<NODE>)(<NODE>)...(<NODE>);
where the left side α is a condition statement, and the right side β is an action to be performed over α.
+
Where:
 +
*<NODE> is a string or a [[regular expression]]. Strings are always represented between "quotes"; regular expressions (for strings) between "/forward slashes inside quotes/".
 +
*the left side of the operator := states the condition
 +
*the right side of the operator := states the action to be performed over each string of the condition.
  
== Type of Normalization Rules ==
+
== Types ==
Normalization rules are used only to replace strings in the natural language input text.
+
N-rules are used to:
 +
*replace strings: "axb" > "ayb"
 +
*delete strings: "axb" > "ab"
 +
*create strings: "ab" > "axb"
 +
*reorder strings: "ab" > "ba"
 +
*assign sentence boundaries: "ab" > "a"<STAIL>"b"
  
{|cellpadding="5" border="1" align="center"
+
== Example ==
|+LL rules
+
*[http://www.unlweb.net/resources/english/eng_ngrammar_ian.txt English N-Grammar (for IAN)]
!ACTION
+
!RULE
+
!DESCRIPTION
+
|-
+
|REPLACE
+
|(%x):=(%y);
+
|All the instances of the node %x will be replaced by the node %y
+
|-
+
|}
+
<div align="center">Where %x and %y are nodes.</div>
+
  
 +
;Replacement of strings
 +
*("Mr."):=("Mister"); (replace "Mr." by "Mister")
 +
*("Mr")("."):=("Mister"); (the same as above)
 +
*("doctor"):=("dr."); (replace "doctor" by "dr.")
 +
*("an "):=("a "); (replace "an " by "a ")
 +
*("don't"):=("do not"); (replace "don't" by "do not")
 +
*("don't"):=("do")(" ")("not"); (the same as above)
 +
;Deletion of strings
 +
*("/[A-Z]/",%x)(".",%y):=(%x); (deletes the "." after capital letters)
 +
;Creation of strings
 +
*(SHEAD,%x)(^" ",%y):=(%x)(" ",%z)(%y); (add a blank space after the beginning of the sentence)
 +
;Reordering of strings
 +
*("Am",%x)(" ",%y)("I",%z):=(%z)(%y)(%x); (reorder "Am I" as "I Am")
 +
;Segmentation (see below)
 +
*(".",%x):=(%x)(+STAIL,%y); (creates an STAIL node after a ".";<ref>This rule contains an eternal loop and it is used here only to illustrate the creation of nodes. The correct rule would be (".",%x)(^STAIL,%z):=(%x)(+STAIL,%y)(%z);</ref>)
  
== Roles of Normalization Rules ==
+
== Segmentation ==
Normalization roles have two roles:
+
In the UNL framework, natural language segmentation is done through the following tags:
*to normalize the input text (to replace abbreviations by their extend forms, to extend contractions, etc.)
+
*<SHEAD> indicates the beginning of a sentence
*to segment the natural language text into sentences (i.e., to create the tags <SHEAD> (beginning of a sentence), <STAIL> (end of a sentence), <CHEAD> (beginning of a scope) and <CTAIL> (end of a scope) inside the input text). These tags are used as sentence and clause boundaries, and define the units of processing of the Transformation and Disambiguation grammars.
+
*<STAIL> indicates the end of a sentence
 +
*<CHEAD> indicates the beginning of a scope (any portion of text smaller than a sentence)
 +
*<CTAIL> indicates the end of a scope (any portion of text smaller than a sentence)
 +
The tags <SHEAD> and <STAIL> defines the sentence boundaries and are automatically assigned by the system according to line breaks and paragraph breaks. No punctuation sign is used as a sentence boundary by default. In order to break the input text into other portions, the corresponding N-rules must be provided. This is done by appending empty nodes with the features SHEAD, STAIL, CHEAD or CTAIL to the left or to the right of existing strings.
 +
*Original text: <SHEAD>abcde<STAIL>
 +
*Rule: ("c",%x)(^STAIL,%y):=(%x)(STAIL)(%y);
 +
*Modified text: <SHEAD>abc<STAIL><SHEAD>de<STAIL>
 +
 
 +
;Observations
 +
*The tag <SHEAD> is assigned automatically after <STAIL>
 +
*The tag <STAIL> is assigned automatically before <SHEAD>
 +
 
 +
== Properties ==
 +
#N-rules can only manipulate strings or regular expressions. Features (such as N, NOU, MCL, etc.) cannot be used in N-rules.
 +
#:("Mr."):=("Mister"); (string manipulation)
 +
#:("/[A-Z]/",%x)(".",%y):=(%x); (regular expression manipulation)
 +
#:<strike>("Mr.",ABB):=("Mister");</strike> (this is not a N-rule, because it involves a non-string element, i.e., ABB)
 +
#Regular expressions may only be used in the left side.<br />
 +
#:("/[A-Z]/",%x)(".",%y):=(%x);
 +
#:<strike>("/[A-Z]/")("."):=("/[A-Z]/");</strike>
 +
#N-rules are recursive<nowiki>:</nowiki> rules will apply while conditions are true:
 +
#:The rule "(" "):=("-");" will transform "a b c d e" into "a-b-c-d-e" (and not only in "a-b c d e")
 +
#The symbol '''^''' is used for negation and may be used to prevent infinite loops:
 +
#:The rule (".",%x):=(%x)(+STAIL,%y); contains a loop, and will lead to (".")(STAIL)(STAIL)(STAIL)(STAIL).... In order to prevent that, we have to indicate that STAIL must be added if it does not exist yet, i.e.: (".",%x)(^STAIL,%z):=(%x)(+STAIL,%y)(%z);
 +
#In the right side, changes may be expressed by the right side of [[A-rule]]s inside each form. The default is replacement.
 +
#:The rule "("a")(" ")("/[aeiou].+/"):=("an")( )( );" could also be expressed as "("a")(" ")("/[aeiou].*/"):=(0>"n")( )( );", i.e., the change from "a" to "an" could be expressed either by "an" or 0>"n".
 +
#Rules apply only if all conditions are true.
 +
#:The rule "("a")(" ")("/[aeiou].+/"):=("an")( )( );" will apply only in case of "a" before a blank and a node starting with "a", "e", "i", "o" or "u".
 +
#Nodes may be deleted through replacement by zero:
 +
#:(" "):=; (deletes all the blank spaces)
 +
#Nodes in the left side that are not coindexed to nodes in the right side are deleted (see [[Indexation]])
 +
#:<strike>(" ")("don't")(" "):=("do not");</strike> provides "I don't know">"Ido notknow"
 +
#:(" ")("don't")(" "):=()("do not")(); provides "I don't know">I do not know"
 +
#N-rules manipulate any strings meeting the conditions
 +
#:<strike>("art"):=("article");</strike> provides "'''art''' 20">"'''article''' 20", but also "My name is B'''art'''">"My name is B'''article'''", "I love S'''art'''re">"I love S'''article'''re"
 +
#:({SHEAD|" "})("art")({STAIL|" "}):=()("article")(); (i.e., replace "art" by "article" if inbetween blank spaces or sentence boundaries
 +
 
 +
== Indexes ==
 +
see [[Indexation]]
 +
 
 +
== Common mistakes ==
 +
*<strike>"Mr":="Mister";</strike>
 +
**Conditions and actions must always come between parentheses: ("Mr"):=("Mister");
 +
*<strike>(Mr):=(Mister);</strike>
 +
**Strings must come between quotes (inside the parentheses): ("Mr"):=("Mister");
 +
*<strike>("Mr"):=("Mister")</strike>
 +
**Rules must end in semicolon: ("Mr"):=("Mister");
 +
*<strike><nowiki>("a")(" ")("/[aeiou].*/"):=("an");</nowiki></strike>
 +
**"a adjective">"a": the blank and the following form are deleted because they are not present at the right side
 +
*<strike><nowiki>("de")(" ")("/[aeiou].*/"):=("d'")("/[aeiou].*/");</nowiki></strike>
 +
**"de avoir">"d' ": coindexation is based on ordering and not on features. The third form is deleted because it's not present at the right side; the second form, which is BLK, receives the feature VOW;
  
== Examples of Normalization rules ==
+
== N-rules and L-rules ==
*Segmentation
+
{{:Difference between N-rules and L-rules}}
**("/.*\./",%x):=(%x)(+STAIL,%y); (creates an STAIL node after any sequence of characters followed by "." (/.*\./);
+
**("/\(/",%x):=(+CHEAD,%y)(%x); (creates an CHEAD node before the opening of a parentheses (/\(/);
+
*Normalization
+
**("an "):=("a "); ("an apple" > "a apple")
+
**("don't"):=("do not"); ("I don't see" > "I do not see")
+

Latest revision as of 16:25, 16 July 2014

N-rule or normalization rule is a special type of transformation rule used to prepare the natural language input for automatic processing. They constitute the pre-processing module that applies over the input as a string and runs prior to the tokenization. The set of N-rules forms the Normalization Grammar, or N-Grammar.

Contents

When to use N-rules

N-rules are used to normalize the input string PRIOR to the processing, i.e., before any dictionary search. They have two roles:

  • to normalize the input text (to replace abbreviations by their extend forms, to extend contractions, etc.)
  • to segment the natural language text into sentences (i.e., to create the tags <SHEAD> (beginning of a sentence), <STAIL> (end of a sentence), <CHEAD> (beginning of a scope) and <CTAIL> (end of a scope) inside the input text). These tags are used as sentence and clause boundaries, and define the units of processing of the Transformation and Disambiguation grammars.

When not to use N-rules

N-rules cannot be used when we depend on information extracted from the dictionary (such as part-of-speech, number, gender, etc.)

Where to use N-rules

N-rules are used in IAN and SEAN, i.e., in UNLization systems. They must be provided at the N-rules tab.

Syntax

N-rules comply with the syntax below:

(<NODE>)(<NODE>)...(<NODE>) := (<NODE>)(<NODE>)...(<NODE>);

Where:

  • <NODE> is a string or a regular expression. Strings are always represented between "quotes"; regular expressions (for strings) between "/forward slashes inside quotes/".
  • the left side of the operator := states the condition
  • the right side of the operator := states the action to be performed over each string of the condition.

Types

N-rules are used to:

  • replace strings: "axb" > "ayb"
  • delete strings: "axb" > "ab"
  • create strings: "ab" > "axb"
  • reorder strings: "ab" > "ba"
  • assign sentence boundaries: "ab" > "a"<STAIL>"b"

Example

Replacement of strings
  • ("Mr."):=("Mister"); (replace "Mr." by "Mister")
  • ("Mr")("."):=("Mister"); (the same as above)
  • ("doctor"):=("dr."); (replace "doctor" by "dr.")
  • ("an "):=("a "); (replace "an " by "a ")
  • ("don't"):=("do not"); (replace "don't" by "do not")
  • ("don't"):=("do")(" ")("not"); (the same as above)
Deletion of strings
  • ("/[A-Z]/",%x)(".",%y):=(%x); (deletes the "." after capital letters)
Creation of strings
  • (SHEAD,%x)(^" ",%y):=(%x)(" ",%z)(%y); (add a blank space after the beginning of the sentence)
Reordering of strings
  • ("Am",%x)(" ",%y)("I",%z):=(%z)(%y)(%x); (reorder "Am I" as "I Am")
Segmentation (see below)
  • (".",%x):=(%x)(+STAIL,%y); (creates an STAIL node after a ".";[1])

Segmentation

In the UNL framework, natural language segmentation is done through the following tags:

  • <SHEAD> indicates the beginning of a sentence
  • <STAIL> indicates the end of a sentence
  • <CHEAD> indicates the beginning of a scope (any portion of text smaller than a sentence)
  • <CTAIL> indicates the end of a scope (any portion of text smaller than a sentence)

The tags <SHEAD> and <STAIL> defines the sentence boundaries and are automatically assigned by the system according to line breaks and paragraph breaks. No punctuation sign is used as a sentence boundary by default. In order to break the input text into other portions, the corresponding N-rules must be provided. This is done by appending empty nodes with the features SHEAD, STAIL, CHEAD or CTAIL to the left or to the right of existing strings.

  • Original text: <SHEAD>abcde<STAIL>
  • Rule: ("c",%x)(^STAIL,%y):=(%x)(STAIL)(%y);
  • Modified text: <SHEAD>abc<STAIL><SHEAD>de<STAIL>
Observations
  • The tag <SHEAD> is assigned automatically after <STAIL>
  • The tag <STAIL> is assigned automatically before <SHEAD>

Properties

  1. N-rules can only manipulate strings or regular expressions. Features (such as N, NOU, MCL, etc.) cannot be used in N-rules.
    ("Mr."):=("Mister"); (string manipulation)
    ("/[A-Z]/",%x)(".",%y):=(%x); (regular expression manipulation)
    ("Mr.",ABB):=("Mister"); (this is not a N-rule, because it involves a non-string element, i.e., ABB)
  2. Regular expressions may only be used in the left side.
    ("/[A-Z]/",%x)(".",%y):=(%x);
    ("/[A-Z]/")("."):=("/[A-Z]/");
  3. N-rules are recursive: rules will apply while conditions are true:
    The rule "(" "):=("-");" will transform "a b c d e" into "a-b-c-d-e" (and not only in "a-b c d e")
  4. The symbol ^ is used for negation and may be used to prevent infinite loops:
    The rule (".",%x):=(%x)(+STAIL,%y); contains a loop, and will lead to (".")(STAIL)(STAIL)(STAIL)(STAIL).... In order to prevent that, we have to indicate that STAIL must be added if it does not exist yet, i.e.: (".",%x)(^STAIL,%z):=(%x)(+STAIL,%y)(%z);
  5. In the right side, changes may be expressed by the right side of A-rules inside each form. The default is replacement.
    The rule "("a")(" ")("/[aeiou].+/"):=("an")( )( );" could also be expressed as "("a")(" ")("/[aeiou].*/"):=(0>"n")( )( );", i.e., the change from "a" to "an" could be expressed either by "an" or 0>"n".
  6. Rules apply only if all conditions are true.
    The rule "("a")(" ")("/[aeiou].+/"):=("an")( )( );" will apply only in case of "a" before a blank and a node starting with "a", "e", "i", "o" or "u".
  7. Nodes may be deleted through replacement by zero:
    (" "):=; (deletes all the blank spaces)
  8. Nodes in the left side that are not coindexed to nodes in the right side are deleted (see Indexation)
    (" ")("don't")(" "):=("do not"); provides "I don't know">"Ido notknow"
    (" ")("don't")(" "):=()("do not")(); provides "I don't know">I do not know"
  9. N-rules manipulate any strings meeting the conditions
    ("art"):=("article"); provides "art 20">"article 20", but also "My name is Bart">"My name is Barticle", "I love Sartre">"I love Sarticlere"
    ({SHEAD|" "})("art")({STAIL|" "}):=()("article")(); (i.e., replace "art" by "article" if inbetween blank spaces or sentence boundaries

Indexes

see Indexation

Common mistakes

  • "Mr":="Mister";
    • Conditions and actions must always come between parentheses: ("Mr"):=("Mister");
  • (Mr):=(Mister);
    • Strings must come between quotes (inside the parentheses): ("Mr"):=("Mister");
  • ("Mr"):=("Mister")
    • Rules must end in semicolon: ("Mr"):=("Mister");
  • ("a")(" ")("/[aeiou].*/"):=("an");
    • "a adjective">"a": the blank and the following form are deleted because they are not present at the right side
  • ("de")(" ")("/[aeiou].*/"):=("d'")("/[aeiou].*/");
    • "de avoir">"d' ": coindexation is based on ordering and not on features. The third form is deleted because it's not present at the right side; the second form, which is BLK, receives the feature VOW;

N-rules and L-rules

N-rules and L-rules are basically the same. The only difference is that L-rules are part of the Transformation Grammar and, therefore, applies after tokenization, whereas N-rules constitute the N-grammar, and apply before tokenization. This means that N-rules may only deal with strings or regular expressions, whereas L-rules may also deal with other elements (such as features and UW's):

  • L-rule
    • ("I")(BLK)("am"):=("I'm"); (I am>I'm)
    • ("a",PRE)(BLK)("a",ART):=("à",+ART,+CTC); (a a>à)
    • ("de",PRE)(BLK)("le",ART):=("du",+ART,+CTC); (de le>du)
  • N-rule
    • ("I")(" ")("am"):=("I'm"); (replace "I am" by "I'm")

Note, in the above, that we may use dictionary features (such as BLK, PRE, ART) in L-rules, but we cannot use any dictionary feature in N-rules. The only features available in N-rules are the system-defined features, such as SHEAD (beginning of the sentence) and STAIL (end of the sentence).

Software