N-rule

From UNL Wiki
Revision as of 16:25, 16 July 2014 by Martins (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

N-rule or normalization rule is a special type of transformation rule used to prepare the natural language input for automatic processing. They constitute the pre-processing module that applies over the input as a string and runs prior to the tokenization. The set of N-rules forms the Normalization Grammar, or N-Grammar.

Contents

When to use N-rules

N-rules are used to normalize the input string PRIOR to the processing, i.e., before any dictionary search. They have two roles:

  • to normalize the input text (to replace abbreviations by their extend forms, to extend contractions, etc.)
  • to segment the natural language text into sentences (i.e., to create the tags <SHEAD> (beginning of a sentence), <STAIL> (end of a sentence), <CHEAD> (beginning of a scope) and <CTAIL> (end of a scope) inside the input text). These tags are used as sentence and clause boundaries, and define the units of processing of the Transformation and Disambiguation grammars.

When not to use N-rules

N-rules cannot be used when we depend on information extracted from the dictionary (such as part-of-speech, number, gender, etc.)

Where to use N-rules

N-rules are used in IAN and SEAN, i.e., in UNLization systems. They must be provided at the N-rules tab.

Syntax

N-rules comply with the syntax below:

(<NODE>)(<NODE>)...(<NODE>) := (<NODE>)(<NODE>)...(<NODE>);

Where:

  • <NODE> is a string or a regular expression. Strings are always represented between "quotes"; regular expressions (for strings) between "/forward slashes inside quotes/".
  • the left side of the operator := states the condition
  • the right side of the operator := states the action to be performed over each string of the condition.

Types

N-rules are used to:

  • replace strings: "axb" > "ayb"
  • delete strings: "axb" > "ab"
  • create strings: "ab" > "axb"
  • reorder strings: "ab" > "ba"
  • assign sentence boundaries: "ab" > "a"<STAIL>"b"

Example

Replacement of strings
  • ("Mr."):=("Mister"); (replace "Mr." by "Mister")
  • ("Mr")("."):=("Mister"); (the same as above)
  • ("doctor"):=("dr."); (replace "doctor" by "dr.")
  • ("an "):=("a "); (replace "an " by "a ")
  • ("don't"):=("do not"); (replace "don't" by "do not")
  • ("don't"):=("do")(" ")("not"); (the same as above)
Deletion of strings
  • ("/[A-Z]/",%x)(".",%y):=(%x); (deletes the "." after capital letters)
Creation of strings
  • (SHEAD,%x)(^" ",%y):=(%x)(" ",%z)(%y); (add a blank space after the beginning of the sentence)
Reordering of strings
  • ("Am",%x)(" ",%y)("I",%z):=(%z)(%y)(%x); (reorder "Am I" as "I Am")
Segmentation (see below)
  • (".",%x):=(%x)(+STAIL,%y); (creates an STAIL node after a ".";[1])

Segmentation

In the UNL framework, natural language segmentation is done through the following tags:

  • <SHEAD> indicates the beginning of a sentence
  • <STAIL> indicates the end of a sentence
  • <CHEAD> indicates the beginning of a scope (any portion of text smaller than a sentence)
  • <CTAIL> indicates the end of a scope (any portion of text smaller than a sentence)

The tags <SHEAD> and <STAIL> defines the sentence boundaries and are automatically assigned by the system according to line breaks and paragraph breaks. No punctuation sign is used as a sentence boundary by default. In order to break the input text into other portions, the corresponding N-rules must be provided. This is done by appending empty nodes with the features SHEAD, STAIL, CHEAD or CTAIL to the left or to the right of existing strings.

  • Original text: <SHEAD>abcde<STAIL>
  • Rule: ("c",%x)(^STAIL,%y):=(%x)(STAIL)(%y);
  • Modified text: <SHEAD>abc<STAIL><SHEAD>de<STAIL>
Observations
  • The tag <SHEAD> is assigned automatically after <STAIL>
  • The tag <STAIL> is assigned automatically before <SHEAD>

Properties

  1. N-rules can only manipulate strings or regular expressions. Features (such as N, NOU, MCL, etc.) cannot be used in N-rules.
    ("Mr."):=("Mister"); (string manipulation)
    ("/[A-Z]/",%x)(".",%y):=(%x); (regular expression manipulation)
    ("Mr.",ABB):=("Mister"); (this is not a N-rule, because it involves a non-string element, i.e., ABB)
  2. Regular expressions may only be used in the left side.
    ("/[A-Z]/",%x)(".",%y):=(%x);
    ("/[A-Z]/")("."):=("/[A-Z]/");
  3. N-rules are recursive: rules will apply while conditions are true:
    The rule "(" "):=("-");" will transform "a b c d e" into "a-b-c-d-e" (and not only in "a-b c d e")
  4. The symbol ^ is used for negation and may be used to prevent infinite loops:
    The rule (".",%x):=(%x)(+STAIL,%y); contains a loop, and will lead to (".")(STAIL)(STAIL)(STAIL)(STAIL).... In order to prevent that, we have to indicate that STAIL must be added if it does not exist yet, i.e.: (".",%x)(^STAIL,%z):=(%x)(+STAIL,%y)(%z);
  5. In the right side, changes may be expressed by the right side of A-rules inside each form. The default is replacement.
    The rule "("a")(" ")("/[aeiou].+/"):=("an")( )( );" could also be expressed as "("a")(" ")("/[aeiou].*/"):=(0>"n")( )( );", i.e., the change from "a" to "an" could be expressed either by "an" or 0>"n".
  6. Rules apply only if all conditions are true.
    The rule "("a")(" ")("/[aeiou].+/"):=("an")( )( );" will apply only in case of "a" before a blank and a node starting with "a", "e", "i", "o" or "u".
  7. Nodes may be deleted through replacement by zero:
    (" "):=; (deletes all the blank spaces)
  8. Nodes in the left side that are not coindexed to nodes in the right side are deleted (see Indexation)
    (" ")("don't")(" "):=("do not"); provides "I don't know">"Ido notknow"
    (" ")("don't")(" "):=()("do not")(); provides "I don't know">I do not know"
  9. N-rules manipulate any strings meeting the conditions
    ("art"):=("article"); provides "art 20">"article 20", but also "My name is Bart">"My name is Barticle", "I love Sartre">"I love Sarticlere"
    ({SHEAD|" "})("art")({STAIL|" "}):=()("article")(); (i.e., replace "art" by "article" if inbetween blank spaces or sentence boundaries

Indexes

see Indexation

Common mistakes

  • "Mr":="Mister";
    • Conditions and actions must always come between parentheses: ("Mr"):=("Mister");
  • (Mr):=(Mister);
    • Strings must come between quotes (inside the parentheses): ("Mr"):=("Mister");
  • ("Mr"):=("Mister")
    • Rules must end in semicolon: ("Mr"):=("Mister");
  • ("a")(" ")("/[aeiou].*/"):=("an");
    • "a adjective">"a": the blank and the following form are deleted because they are not present at the right side
  • ("de")(" ")("/[aeiou].*/"):=("d'")("/[aeiou].*/");
    • "de avoir">"d' ": coindexation is based on ordering and not on features. The third form is deleted because it's not present at the right side; the second form, which is BLK, receives the feature VOW;

N-rules and L-rules

N-rules and L-rules are basically the same. The only difference is that L-rules are part of the Transformation Grammar and, therefore, applies after tokenization, whereas N-rules constitute the N-grammar, and apply before tokenization. This means that N-rules may only deal with strings or regular expressions, whereas L-rules may also deal with other elements (such as features and UW's):

  • L-rule
    • ("I")(BLK)("am"):=("I'm"); (I am>I'm)
    • ("a",PRE)(BLK)("a",ART):=("à",+ART,+CTC); (a a>à)
    • ("de",PRE)(BLK)("le",ART):=("du",+ART,+CTC); (de le>du)
  • N-rule
    • ("I")(" ")("am"):=("I'm"); (replace "I am" by "I'm")

Note, in the above, that we may use dictionary features (such as BLK, PRE, ART) in L-rules, but we cannot use any dictionary feature in N-rules. The only features available in N-rules are the system-defined features, such as SHEAD (beginning of the sentence) and STAIL (end of the sentence).

Software