English grammar
(→Tokenization) |
(→UNLization (ENG-UNL)) |
||
Line 21: | Line 21: | ||
<li>[[Segmentation]] of English sentences is done automatically by the machine. It uses some punctuation signs (such as ".","?","!") and special characters (end of line, end of paragraph) as sentence boundaries. As the sentences of Corpus500 are provided one per line, this step does not require any action from the grammar developer.</li> | <li>[[Segmentation]] of English sentences is done automatically by the machine. It uses some punctuation signs (such as ".","?","!") and special characters (end of line, end of paragraph) as sentence boundaries. As the sentences of Corpus500 are provided one per line, this step does not require any action from the grammar developer.</li> | ||
<li>[[Tokenization]] of each sentence is done against the dictionary entries, from left to right, following the principle of the longest first. As there are several lexical ambiguities, even in a simple corpus as Corpus500, some disambiguation rules are required to induce the correct lexical choice. </li> | <li>[[Tokenization]] of each sentence is done against the dictionary entries, from left to right, following the principle of the longest first. As there are several lexical ambiguities, even in a simple corpus as Corpus500, some disambiguation rules are required to induce the correct lexical choice. </li> | ||
− | <li>Transformation applies after tokenization and is divided in five different steps:</li> | + | <li>[[Transformation]] applies after tokenization and is divided in five different steps:</li> |
<ol> | <ol> | ||
<li>'''Normalization''' prepares the input for the transformation rules. In the normalization step, we delete blank spaces, replace some words by symbols (such as "point" by ".", when between numbers), process numbers and temporary words (such as proper nouns) and standardize the feature structure of the nodes (by informing, for instance, that words having the feature "SNGT" (singulare tantum) are also "SNG" (singular); that "N" is a value of the attribute "LEX"; etc).</li> | <li>'''Normalization''' prepares the input for the transformation rules. In the normalization step, we delete blank spaces, replace some words by symbols (such as "point" by ".", when between numbers), process numbers and temporary words (such as proper nouns) and standardize the feature structure of the nodes (by informing, for instance, that words having the feature "SNGT" (singulare tantum) are also "SNG" (singular); that "N" is a value of the attribute "LEX"; etc).</li> |
Revision as of 21:39, 28 July 2012
The English grammars here presented target the Corpus500, and are provided as a didactic sample that may help users to build their own grammars. They are used for representing the English sentences into UNL (UNLization) and for generating English sentences from UNL graphs (NLization). They follow the syntax defined at the UNL Grammar Specs and have been used for IAN and EUGENE.
Contents |
Requisites
The grammars here presented depend heavily on the structure of the dictionary presented at English dictionary. You have to be acquainted with the formalism described at the UNL Dictionary Specs and the Tagset in order to fully understand how the grammar deal with the dictionary entry structure. You should also understand the process of tokenization done by the machine.
Features
The grammars play with a set of features that come from three different sources:
- Dictionary features are the features ascribed to the entries in the dictionary, and appear either as simple attributes (LEX,GEN,NUM), as simple values (N,MCL,SNG) or attribute-value pairs (LEX=N,GEN=MCL,NUM=SNG).
- System-defined features are features automatically assigned by EUGENE and IAN during the processing. They are the following:
- SHEAD = beggining of the sentence (system-defined feature assigned automatically by the machine)
- CHEAD = beginning of a scope (system-defined feature assigned automatically by the machine)
- STAIL = end of the sentence (system-defined feature assigned automatically by the machine)
- CTAIL = end of a scope (system-defined feature assigned automatically by the machine)
- TEMP = temporary entry (system-defined feature assigned to the strings that are not present in the dictionary)
- Grammar features are features created inside the grammar in any of its intermediate states between the input and the output.
All the features are described at the Tagset.
UNLization (ENG-UNL)
The UNLization process is performed in three different steps:
- Segmentation of English sentences is done automatically by the machine. It uses some punctuation signs (such as ".","?","!") and special characters (end of line, end of paragraph) as sentence boundaries. As the sentences of Corpus500 are provided one per line, this step does not require any action from the grammar developer.
- Tokenization of each sentence is done against the dictionary entries, from left to right, following the principle of the longest first. As there are several lexical ambiguities, even in a simple corpus as Corpus500, some disambiguation rules are required to induce the correct lexical choice.
- Transformation applies after tokenization and is divided in five different steps:
- Normalization prepares the input for the transformation rules. In the normalization step, we delete blank spaces, replace some words by symbols (such as "point" by ".", when between numbers), process numbers and temporary words (such as proper nouns) and standardize the feature structure of the nodes (by informing, for instance, that words having the feature "SNGT" (singulare tantum) are also "SNG" (singular); that "N" is a value of the attribute "LEX"; etc).
- Parsing performs the syntactic analysis of the normalized input. The parsing follows some general procedures coming from the X-bar theory and results in a tree structure with binary branching with the following configuration:
XP / \ spec XB / \ XB adjt / \ X comp | head
- Where X is the category of any of the heads (N,V,J,A,P,D,I,C), XB is any of the intermediate projections (there can be as many intermediate projections as complements (comp) and adjuncts (adjt) in a phrase) and XP is the maximal projection, always linking the topmost intermediate projection to the specifier (spec).
- Dearborization rewrites the tree structure as a graph structure, replacing intermediate (XB) and maximal projections (XP) by head-driven binary syntactic relations: XS(head,spec), XC(head,comp) and XA(head,adjt), where X is the category of any of the heads (e.g.,VC means complement to the verb).
- Interpretation replaces syntactic binary relations by the UNL semantic binary relations (e.g., VC(head,comp) may be rewritten as obj(head,comp)).
- Rectification adjusts the output graph to the UNL Standards.
Tokenization
The tokenization is done with the English Disambiguation Grammar.