English grammar

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(UNLization (ENG->UNL))
(UNLization (ENG->UNL))
Line 44: Line 44:
 
</ol>
 
</ol>
 
</ol>
 
</ol>
=== Examples of Transformation Rules ===
+
=== Examples of ENG->UNL Transformation Rules ===
 
  (N,PLR,^@pl,^@multal,^@paucal,^@all):=(+att=@pl);  
 
  (N,PLR,^@pl,^@multal,^@paucal,^@all):=(+att=@pl);  
 
:assigns the attribute @pl to plural nouns (books > book.@pl). In order to avoid redundancy, the system checks whether the word will not receive any other plural attribute (such as @multal, @paucal and @all)
 
:assigns the attribute @pl to plural nouns (books > book.@pl). In order to avoid redundancy, the system checks whether the word will not receive any other plural attribute (such as @multal, @paucal and @all)

Revision as of 23:15, 29 October 2012

The English grammars follow, in general, the X-bar approach, with some adaptations. They are used for transforming English sentences into UNL (UNLization) and for generating English sentences out of UNL graphs (NLization). They follow the syntax defined at the UNL Grammar Specs and the tags described at the Tagset.

Contents

Structure

The English grammars are unidirectional. There is a grammar for UNLization (the ENG->UNL Analysis Grammar) and another grammar for NLization (the UNL->ENG Generation Grammar). The former takes natural languages sentences as inputs and provides the corresponding UNL graphs as outputs; the latter takes UNL graphs as inputs and provides the corresponding English sentences as outputs.

The English grammars are of two types: the transformation grammar, or simply t-grammar, which is used to manipulate data structures (i.e., to convert lists into trees, trees into networks, networks into a trees, trees into lists); and the disambiguation grammar, or simply d-grammar, which is used to control the behavior of the t-grammar (by prohibiting or inducing some of its possibilities).

The English grammars are divided into two parts: the English Grammar itself, which contains rules that are specific to English, and the Default Grammar, which contains language-independent rules and may be used by any language. The English Grammar applies first (i.e., the rules of the English Grammar have higher priority); the Default Grammar applies when no rule from the English Grammar can be fired.

Features

The grammars play with a set of features that come from three different sources:

  • Dictionary features are the features ascribed to the entries in the dictionary, and appear as attribute-value pairs (LEX=N,GEN=MCL,NUM=SNG).
  • System-defined features are features automatically assigned by EUGENE and IAN during the processing. They are the following:
    • SHEAD = beggining of the sentence (system-defined feature assigned automatically by the machine)
    • CHEAD = beginning of a scope (system-defined feature assigned automatically by the machine)
    • STAIL = end of the sentence (system-defined feature assigned automatically by the machine)
    • CTAIL = end of a scope (system-defined feature assigned automatically by the machine)
    • TEMP = temporary entry (system-defined feature assigned to the strings that are not present in the dictionary)
    • SCOPE = scopes entry (system-defined feature assigned to hyper-nodes)
    • DIGIT = digits (system-defined feature assigned to digits)
  • Grammar features are features created inside the grammar in any of its intermediate states between the input and the output.

The dictionary and system-defined features are described at the Tagset.

UNLization (ENG->UNL)

The UNLization process is performed in three different steps:

  1. Segmentation of English sentences is done automatically by the machine. It uses some punctuation signs (such as ".","?","!") and special characters (end of line, end of paragraph) as sentence boundaries. As the sentences are provided one per line, this step does not require any action from the grammar developer.
  2. Tokenization of each sentence is done against the dictionary entries, from left to right, following the principle of the longest first. As there are several lexical ambiguities, some disambiguation rules are required to induce the correct lexical choice. The tokenization is done with the English Disambiguation Grammar.
  3. Transformation applies after tokenization and is divided in two modules:
    1. English-specific transformation is performed by the ENG->UNL T-Grammar and is divided in two steps:
      1. Morphology, where English features (such as PLR, PAS and [not]) are mapped into attributes (@pl, @past and @not, respectively).
      2. Syntax, where structures that are specific to English (such as determiners, compounds and coordination) are mapped into UNL.
    2. General transformation is performed by the Default grammar and is divided in six steps:
      1. Pre-processing (prepares the input for the processing)
      2. Normalization (standardizes the feature structure)
      3. Parsing (converts the input list structure into a tree structure)
      4. Transformation (converts the surface tree struture into the deep tree structure)
      5. Dearborization (converts the tree structure into a network structure)
      6. Interpretation (converts the syntactic network into a semantic network)
      7. Post-processing (adjusts the final output)

Examples of ENG->UNL Transformation Rules

(N,PLR,^@pl,^@multal,^@paucal,^@all):=(+att=@pl); 
assigns the attribute @pl to plural nouns (books > book.@pl). In order to avoid redundancy, the system checks whether the word will not receive any other plural attribute (such as @multal, @paucal and @all)
(MOV,%x)(V,%y):=(%y,+att=%x); 
copies the attributes from the modal verb (%x) to the main verb (%y) and deletes the modal verb (must.@obligation kill > kill.@obligation). Attributes of modal verbs are assigned in the dictionary.
(VB,%x)(FPR):=(%x,+att=@reflexive);
assigns the feature @reflexive to the verb if followed by a reflexive pronoun, and deletes the reflexive pronoun (kill himself > kill.@reflexive)
(D,att,%x)(NB,%y)({^N|PUT|STAIL|CTAIL},%right):=(%y,+att=%x)(%right); 
copies the attributes of the determiner to noun phrase (the.@def book > book.@def). Attributes of determiners are assigned in the dictionary. The rule only applies if the noun phrase is not followed by a noun or if it is followed by a punctuation sign, the end of sentence or the end of scope.

NLization (UNL->ENG)

The NLization process is performed in three different steps:

  1. Segmentation of UNL sentences is done automatically by the machine. It uses the UNL document structure to split the input UNL document into a set of sentences to be processed one at a time.
  2. Tokenization of each sentence is done against the dictionary entries, following the principle of the highest priority first. As there are several lexical ambiguities, some disambiguation rules are required to induce the correct lexical choice. The tokenization is done with the English Disambiguation Grammar.
  3. Transformation applies after tokenization and is divided in two modules:
    1. English-specific transformation is performed by the UNL->ENG T-Grammar and is divided in three steps:
      1. Semantics, where relations and attributes of UNL are mapped into English structures.
      2. Morphology, where the paradigms are copied from the grammar to each entry.
      3. Post-processing, where the output list is adjusted to the English standards.
    2. General transformation is performed by the Default grammar and is divided in seven steps:
      1. Pre-processing (prepares the input for the processing)
      2. Normalization (standardizes the feature structure)
      3. Arborization (converts the syntactic network into a syntactic tree)
      4. Transformation (converts the deep syntactic structure into the surface syntactic structure)
      5. Linearization (converts the syntactic structure into a list structure)
      6. Morphological generation (inflects the words that need to be inflected)
      7. Post-processing (adjusts the final output)
Software