L-rule

From UNL Wiki
Revision as of 12:24, 14 October 2010 by Martins (Talk | contribs)
Jump to: navigation, search

L-rule (linear rule) is the formalism used for applying transformations over ordered sequences of isolated words in the UNLarium framework.

Contents

When to use L-rules

L-rules are mainly used for generating spelling changes (such as in contraction, elision, assimilation, etc). They are also used to generate other spelling conventions, such as the use of capital letters and punctuation marks.

When not to use L-rules

L-rules are not to be used for sound changes that do not affect spelling.

Syntax

The general syntax for L-rules is the following:

(CONDITION) := (ACTION);

Where:

  • CONDITION is a single form or a sequence of forms over which actions will take place; and
  • ACTION is the action to be performed over each form or sequence of forms of the CONDITION.

CONDITION and ACTION may be expressed as:

  • a character or string of characters, between quotes: ("a");
  • a regular expression, between / /: (/a[bcd]e/)
  • a tag or list of tags, extracted from the UNDL Foundation tagset: (VOW);
  • a combination of characters and tags: ("a",PRE);

Examples:

  • ("Mr."):=("Mister"); (replace "Mr." by "Mister")
  • ("doctor"):=("dr."); (replace "doctor" by "dr.")

L-rules are normally sensitive to the context and apply over a set of conditions rather than over isolated word forms. In this case, each separate word form must be isolated between parentheses and described as a different condition.

  • ("I")(BLK)("am"):=("I'm"); (replace "I am" by "I'm")

Types of L-rules

There are basically three types of L-rules:

  • replacement, when the number of parentheses in the CONDITION field is equal to the number of parentheses in the ACTION field:
  • addition, when the number of parentheses in the CONDITION field is lower than the number of parentheses in the ACTION field;
  • deletion, when the number of parentheses in the CONDITION field is greater than the number parentheses in the ACTION field.

Parentheses are automatically co-indexed between the CONDITION and the ACTION field, so that the first pair of parentheses of the CONDITION field corresponds to the first pair of parentheses of the ACTION field, and so on. This means that parentheses are to be repeated on the right side of a L-rule if their corresponding forms are not expected to be deleted. In order to control the process of adding, deleting and reordering, forms may be referred by indexes:

Examples
RULE BEFORE > AFTER DESCRIPTION
("a")("b")("c"):=("d")("e")("f"); abc > def "a" will be replaced by "d"; "b" by "e"; and "c" by "f"
("a")("b")("c"):=("d")( )( ); abc > dbc "a" will be replaced by "d"; "b" and "c" will be preserved
("a")("b")("c"):=("d")("")(""); abc > d "a" will be replaced by "d"; "b" and "c" will be replaced by "" (i.e., blank)
("a")("b")("c"):=("d")( ); abc > ab "a" will be replaced by "d"; "b" will be preserved; "c" will be deleted
("a")("b")("c"):=("d"); abc > d "a" will be replaced by "d"; "b" and "c" will be deleted
("a")("b")("c"):=(%03)(%02)(%01); abc > cba "a", "b" and "c" will be preserved, but reordered
("a")("b")("c"):=("d")(%03); abc > dc "a" will be replaced by "d"; "b" will be deleted; "c" will be preserved
("a")("b")("c"):=("d")("g")( )( ); abc > dgc "a" will be replaced by "d"; "b" will be replaced by "g"; "c" will be preserved; and a new form will be generated after it
("a")("b")("c"):=("d")("g")(%02)(%03); abc > dgbc "a" will be replaced by "d"; "g" will be generated after it; and then "b" and "c", which will be preserved

Examples

Examples
RULE BEFORE > AFTER DESCRIPTION
("a",ART)(BLK)(VOW):=("an")( )( ); a adjective > an adjective replace the article (ART) "a" by "an" before a blank space (BLK) and a vowel (VOW); preserve the second (BLK) and the third form (VOW) without any change
("a",PRE)(BLK)("a",ART):=("à",+ART,+CTC); a a > à replace the preposition (PRE) "a" in front of blank (BLK) and the article (ART) "a" by "à"; add the features ART (article) and CTC (contraction) to the first form; and delete the second (BLK) and the third form ("a",ART)
("de",PRE)(BLK)("le",ART):=("du",+ART,+CTC); de le > du replace the preposition (PRE) "de" in front of blank (BLK) and the article (ART) "le" by "du"; add the features ART and CTC to the first form; and delete the second (BLK) and the third form ("le",ART)
("a",VER)(BLK)("il",PPR):=( )("-t-",-BLK)( ); a il > a-t-il replace the blank space (BLK) between the verb (VER) "a" and the pronoun (PPR) "il" by "-t-"; remove the feature BLK from the second form; preserve the first and the third form without any change
("de",PRE)(BLK)(VOW):=("d'")(%03); de avoir > d'avoir replace the preposition (PRE) "de" before a blank space (BLK) and a vowel (VOW) by "d'"; delete the second form (BLK); and preserve the third form (%03) without any change

Observations

In L-Rules, nodes may have the following arguments
  • strings (between quotes or brackets): "de", "d'", [sense], etc.
  • features: PRE, BLK, VOW, etc.
  • indexes (preceded by "%"): %03, %head, etc.
  • regular expressions (between / /, only in the left side): /a[bcde]/, /a*/, etc.
  • A-rules (only in the right side): 0>"s", "a":"b"
Arguments may be combined (but strings, regular expressions and A-rules are mutually exclusive)
  • ("X",%x,X)("Y",%y,Y):=("Z",%x,-X,+A)(0>"s",%y,+B);
A node may contain one single string
  • ("a"):=("b");
  • ("a","b"):=("c");
Strings in the right side always replace strings in the left side
In the rule ("x"):=("y"); the string "x" is replaced by the string "y".
Strings are represented between "quotes" while lemmas are represented between [brackets].
The UNLarium distinguishes between strings (to be represented between "quotes") and lemmas (to be represented between [brackets]). The difference between strings and lemmas has to do with variance and the dictionary status: if the constituent is expected to figure as an entry in the dictionary (e.g., "in", "the", "after", "love", "sense", etc) or if may vary (e.g., if it may be inflected, or further composed by specification, adjunction or complementation), it must be represented between brackets; if it's a full phrase whose internal structure is not relevant, because invariant, it must come between quotes:
  • ("into account"); (the string "into account" does not vary: take > take into account, take into more account)
  • ([sense]); (the term "sense" may be further specified: make > make sense, make any sense, make no sense, etc).
Features are added through "+" and deleted through "-"
  • (X):=(+Y); (= add the feature Y to the node containing the feature X)
  • (X):=(-X); (= delete the feature X from any node containing the feature X)
L-rules are recursive: rules will apply while conditions are true
The rule "(BLK):=("-");" will transform "a b c d e" into "a-b-c-d-e" (and not only in "a-b c d e")
The rule *(X):=(+Y);" will never stop (i.e., it contains an infinite loop): the feature Y will keep been added eternally (X,Y,Y,Y,Y,Y,Y,Y,...)
The symbol ^ is used for negation and may be used to prevent infinite loops
  • (X,^Y):=(+Y); (= add the feature Y to a node containing the feature X that does not contain the feature Y yet)
  • (^".")(STAIL):=(".")(%02); (Add a period before the end of the sentence if there is not a period yet)
Rules are conservative. No feature is changed or deleted unless explicitly indicate through "-".
In the rule ("x",FEA):=("y"); the string "x" is replaced by the string "y", but the feature FEA is not altered (i.e.,the final state will be ("y",FEA));
The rule "("a",ART)(BLK)(VOW):=("an")( )( );" does not affect the status of the second and the third word forms, which continue to be BLK and VOW. On the other hand, the rule "("a",VER)(BLK)("il",PPR):=( )("-t-",-BLK)( );" alters the status of the second form by deleting the feature BLK.
Indexes are used to control rules
  • (%a)(%b)(%c):=(%b); (delete the first and the third nodes, and keep the second)
  • (%a)(%b)(%c):=(%c)(%b)(%a); (reverse the order)
In the ACTION field, changes may be expressed by the right side of A-rules (i.e., by prefixation, infixation, suffixation or replacement) inside each form. The default is replacement.
The rule "("a",ART)(BLK)(VOW):=("an")( )( );" could also be expressed as "("a",ART)(BLK)(VOW):=(0>"n")( )( );", i.e., the change from "a" to "an" could be expressed either by "an" or 0>"n".
Rules apply only if all conditions are true.
The rule "("a")(BLK)(VOW):=("an")( )( );" will apply only in case of "a" before a blank and a vowel.
In order to enhance its power, conditions (but not actions) may be replaced by regular expressions between //.
(/a[bcd]e/):=(""); (Delete the words "abe", "ace" and "ade")

Indexes

Nodes are always indexed in L-rules
Indexes (%) are used for indexing nodes, attributes and values between the left (condition) and the right side of rules.
  • (%a)(%b):=(%b)(%a); (change the order of the constituents)
If omitted, indexes are assigned by default, according to the position
  • (A)(B):=(C)(D); is the same as (A,%01)(B,%02):=(C,%01)(D,%02);
Indexes can be replaced by user-defined labels made of any sequence of alphabetic characters and underscore
(A,%a)(B,%b):=(C,%a)(D,%b);
Numeric characters cannot be used as user-defined indexes
(A,%03)(B,%05):=(C,%03)(D,%05);
%01 = A, %02 = B (there is no %03 nor %05)
Indexes may also be used to transfer attribute values expressed in the format ATTRIBUTE=VALUE
(A,%a,ATT1=VAL1)(B,%b):=()(B,ATT1=%a); (the value "VAL1" of "ATT1" of %a is copied to the node %b)

Common mistakes

  • "Mr":="Mister";
    • Conditions and actions must always come between parentheses: ("Mr"):=("Mister");
  • (Mr):=(Mister);
    • Constants must come between quotes (inside the parentheses): ("Mr"):=("Mister");
  • ("Mr"):=("Mister")
    • Rules must end in semicolon: ("Mr"):=("Mister");
  • ("I am"):=("I'm);
    • Each separate word form must be isolated between parentheses and described as a different condition: ("I")(BLK)("am"):=("I'm");
  • ("a",ART)(BLK)(VOW):=("an");
    • "a adjective">"a": the blank and the following form are deleted because they are not present at the right side
  • ("de",PRE)(BLK)(VOW):=("d'")(VOW);
    • "de avoir">"d' ": coindexation is based on ordering and not on features. The third form is deleted because it's not present at the right side; the second form, which is BLK, receives the feature VOW;

Formal syntax

L-rules comply with the following formal syntax:

<L-RULE>          ::= ( "("<CONDITION>")" )+ ":=" ( "("<ACTION>")" )+ ";"
<CONDITION>        ::= """<STRING>""" ("," <TAGLIST> )* | "["<STRING>"]" ("," <TAGLIST> )* | <TAGLIST>
<ACTION>           ::= (<INDEX>)? ( <AFFIXATION> ("," <AFFIXATION>)* )* ( <ATT_CHANGE> ("," <ATT_CHANGE>)* )*
<AFFIXATION>       ::= <PREFIXATION> | <SUFFIXATION> | <INFIXATION> | <REPLACEMENT> (cf. A-rule)
<ATT_CHANGE>       ::= { "+" | "-" } <TAG> 
<TAGLIST>          ::= <INDEX> | (<INDEX> ",")? <TAG> ("," <TAG>)* 
<INDEX>            ::= "%"[01..99]
<TAG>              ::= {one of the tags defined in the UNDLF Tagset}
<STRING>           ::= [a-Z]+
<INTEGER>          ::= [0-9]+

where

<a> = a is a non-terminal symbol
“a“ = a is a constant
a | b = a or b
{ a | b } = either a or b
(a)? = a can occur 0 or 1 time
(a)* = a can be repeated 0 or more times
(a)+ = a can be repeated 1 or more times

Software