English grammar
The English grammars here presented target the Corpus500, and are provided as a didactic sample that may help users to build their own grammars. They are used for representing the English sentences into UNL (UNLization) and for generating English sentences from UNL graphs (NLization). They follow the syntax defined at the UNL Grammar Specs and have been used for IAN and EUGENE.
Contents |
Requisites
The grammars here presented depend heavily on the structure of the dictionary presented at English dictionary. You have to be acquainted with the formalism described at the UNL Dictionary Specs and the Tagset in order to fully understand how the grammar deal with the dictionary entry structure. You should also understand the process of tokenization done by the machine.
Features
The grammars play with a set of features that come from three different sources:
- Dictionary features are the features ascribed to the entries in the dictionary, and appear either as simple attributes (LEX,GEN,NUM), as simple values (N,MCL,SNG) or attribute-value pairs (LEX=N,GEN=MCL,NUM=SNG).
- System-defined features are features automatically assigned by EUGENE and IAN during the processing. They are the following:
- SHEAD = beggining of the sentence (system-defined feature assigned automatically by the machine)
- CHEAD = beginning of a scope (system-defined feature assigned automatically by the machine)
- STAIL = end of the sentence (system-defined feature assigned automatically by the machine)
- CTAIL = end of a scope (system-defined feature assigned automatically by the machine)
- TEMP = temporary entry (system-defined feature assigned to the strings that are not present in the dictionary)
- Grammar features are features created inside the grammar in any of its intermediate states between the input and the output.
All the features are described at the Tagset.
UNLization (ENG-UNL)
The UNLization process is performed in three different steps:
- Segmentation of English sentences is done automatically by the machine. It uses some punctuation signs (such as ".","?","!") and special characters (end of line, end of paragraph) as sentence boundaries. As the sentences of Corpus500 are provided one per line, this step does not require any action from the grammar developer.
- Tokenization of each sentence is done against the dictionary entries, from left to right, following the principle of the longest first. As there are several lexical ambiguities, even in a simple corpus as Corpus500, some disambiguation rules are required to induce the correct behavior of the system.
- Transformation applies after tokenization and is divided in five different steps:
- Normalization prepares the input for the transformation rules. In the normalization step, we delete blank spaces, replace some words by symbols (such as "point" by ".", when between numbers), process numbers and temporary words (such as proper nouns) and standardize the feature structure of the nodes (by informing, for instance, that words having the feature "SNGT" (singulare tantum) are also "SNG" (singular), that "N" is a value of the attribute "LEX", etc).
- Parsing performs the syntactic analysis of the normalized input. The parsing follows some general procedures coming from the X-bar theory and results in a tree structure with binary branching with the following structure:
XP / \ spec XB / \ XB adjt / \ X comp | head
- Where X is the category of any of the heads (N,V,J,A,P,D,I,C), XB is any of the intermediate projections (there can be as many intermediate projections as complements (comp) and adjuncts (adjt) in a phrase) and XP is the maximal projection, always linking the topmost intermediate projection to the specifier (spec).
- Dearborization rewrites the tree structure as a graph structure, replacing intermediate (XB) and maximal projections (XP) by head-driven binary syntactic relations: XS(head,spec), XC(head,comp) and XA(head,adjt), where X is the category of any of the heads (e.g.,VC means complement to the verb).
- Interpretation replaces syntactic binary relations by the UNL semantic binary relations (e.g., VC(head,comp) may be rewritten as obj(head,comp)).
- Rectification adjust the output graph to the UNL Standards.
Tokenization
The tokenization is done with the [English Disambiguation Grammar] (to be provided in the tab D-rules of IAN). It comprises two different types of rules:
- Negative (blocking) rules, where the probability is equal to 0, prevent lexical choices
- Positive rules, where the probability is more than 0, force lexical choices
The most important negative rules are used to avoid hyper-segmentation of temporary entries:
- Preventing the hyper-segmentation of temporary entries
- "asdfg" must be tokenized as [asdfg] (one single temporary entry) instead of [as][dfg], which would be the case, because [as] is in the dictionary
- (^W,^" ",^PUT,^SBW,^DIGIT,^PFX,^SHEAD)(^W,^" ",^PUT,^SBW,^DIGIT,^SFX,^STAIL)=0;
- This rule states that words must be isolated by blank spaces, punctuation sign or SHEAD and STAIL, except in the case of subwords, prefixes and suffixes.
- Preventing the generation of two temporary words in sequence
- "asdfg hijkl" will be represented as a single temporary word "asdfg hijkl" instead of two temporary words "asdfg" and "hijkl" isolated by blank space
- (TEMP,^" ",^DIGIT,^W)(^" ",^STAIL)=0;
- (a temporary word, i.e., a word not found in the dictionary, must be followed by a blank space or the end of the sentence)
- (^" ",^PUT,^SHEAD)(TEMP,^" ",^W)=0;
- (a temporary word, i.e., a word not found in the dictionary, must be preceded by a blank space, a punctuation sign or the beginning of the sentence)