Grammar

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(Types)
(Undo revision 7682 by Domtheo (talk))
 
(82 intermediate revisions by 2 users not shown)
Line 1: Line 1:
In the UNL framework, a '''grammar''' is a set of rules that are used to generate UNL out of natural language, and UNL out of natural language.  
+
In the UNL framework, a '''grammar''' is a set of rules that is used to generate UNL out of natural language, and natural language out of UNL. Along with [[Dictionary|dictionaries]], they constitute the basic resource for [[UNLization]] and [[NLization]].
  
== Direction ==
+
== Basic Symbols ==
In the UNL<sup>framework</sup>, we distinguish between '''analysis''' and '''generation''' grammars:
+
{{:Basic Symbols}}
*The UNL-NL (Generation) Grammar is used to generate natural language out of UNL
+
*The NL-UNL (Analysis) Grammar is used to generate UNL out of natural language
+
  
== Types ==
+
== Basic Concepts ==
Main article: [[Grammar Specs]]
+
{{:Grammar units}}
In the UNL<sup>framework</sup>, we distinguish between '''transformation''' and '''disambiguation''' grammars:
+
*Transformation Grammar, or T-Grammar, is the set of T-rules, which are used to transform structures<ref>To convert a list structure into a tree structure, a tree structure into a list structure, a tree structure into a network structure, and so on.</ref>
+
*Disambiguation Grammar, or D-Grammar, the set of D-rules, which are used to improve the performance of the T-rules
+
  
 +
== Rules ==
 +
{{:Rule}}
  
The UNL-NL Grammar and the NL-UNL Grammar consist of two different types of rules:
+
== Modules ==
*T-rules<ref>In order for T-rules to be processed in the UNL<sup>dev</sup>, they should comply with the syntax defined in the [[Grammar Specs]]. For simplification reasons, the rules here presented may omit some of the necessary features required by the UNL<sup>dev</sup>, which are, however, automatically provided by the UNL<sup>arium</sup></ref>, or transformation rules, are used to modify structures. T-rules are further divided in:
+
In the UNL framework there are three types of grammar:
**'''[[A-rule]]s''' (affixation rules) apply over isolated word forms (as to generate possible inflections);
+
*[[N-Grammar]], or Normalization Grammar, is a set of T-rules used to segment the natural language text into sentences and to prepare the input for processing.
**'''[[L-rule]]s''' (linear rules) apply over lists of word forms (as to provide transformations in the surface structure);
+
*[[T-Grammar]], or Transformation Grammar, is a set of T-rules used to transform natural language into UNL or UNL into natural language.
**'''[[S-rule]]s''' (syntactic rules) apply over trees (as to modify the syntactic configuration).
+
*[[D-Grammar]], or Disambiguation Grammar, is a set of D-rules used to to improve the performance of transformation rules by constraining or forcing their applicability.
*[[D-rule]]s, or disambiguation rules, are used to assign priorities
+
 
 +
== Direction ==
 +
In the UNL framework, grammars are not bidirectional, although they share the same syntax:
 +
*[[UNLization]] (NL>UNL)
 +
**The '''N-Grammar''' contains the normalization rules for natural natural analysis
 +
**The '''Analysis T-Grammar''' contains the transformation rules used for natural language analysis
 +
**The '''Analysis D-Grammar''' contains the disambiguation rules used for [[tokenization]] and for improving the results of the NL-UNL T-Grammar
 +
*[[NLization]] (UNL>NL)
 +
**The '''Generation T-Grammar''' contains the transformation rules used for natural language generation
 +
**The '''Generation D-Grammar''' contains the disambiguation rules used for improving the results of the UNL-NL T-Grammar
  
{|border=1 cellpadding=2 align=center
+
== Processing Units ==
|+Examples of Grammar Rules
+
In the UNL framework, grammars may target different processing units:
!Type
+
*'''Text-driven grammars''' process the source document as a single unit (i.e., without any internal subdivision)
!Rule
+
*'''Sentence-driven grammars''' process each sentence or graph separately
!Description
+
*'''Word-driven grammars''' process words in isolation
!Example
+
Text-driven grammars are normally used in summarization and simplification, when the rhetorical structure of the source document is important. Sentence-driven grammars are used mostly in translation, when the source document can be treated as a list of non-semantically related units, to be processed one at a time. Word-driven grammars are used in information retrieval and opinion mining, when each word or node can be treated in isolation. <br />
|-
+
|D-rule
+
|(ART)(ART)=0;
+
|It's not possible to have an article after another article
+
|
+
|-
+
|A-rule
+
|PLR:=0>"s";
+
|In case of plural (PLR), add "s" to the end of the word
+
|table > tables, boy > boys
+
|-
+
|L-rule
+
|("I")(BLK)("am"):=("I'm");
+
|In case of "I" before a blank space and "am", replace "I" by "I'm"
+
|I am > I'm
+
|-
+
|S-rule
+
|MTW:=VA("into account");
+
|In order to form the multiword expression, add "into account" as an adjunct to the verb (VA).
+
|take > take into account
+
|}
+
  
== Syntax ==
+
== Recall ==  
D-rules are defined by the general syntax:
+
Grammars may target the whole source document or only parts of it (e.g. main clauses):
<CONDITION> = <PRIORITY>;
+
*'''Chunk grammars''' target only a part of the source document
While T-rules are defined as:
+
*'''Full grammars''' target the whole source document
<CONDITION> := <ACTION>;
+
Both rules always end in a semicolon (";").
+
Special symbols and notation apply in each case. For further information, see [[D-rule]]s, [[A-rule]]s, [[L-rule]]s or [[S-rule]]s.
+
  
== When to use D-rules ==
+
== Precision ==
D-rules must be used to assign priorities. They do not provoke any changes, but only induce or prohibit transformations.
+
Grammars may target the deep or the surface structure of the source document:
 +
*'''Deep grammars''' focus on the deep dependency relations of the source document and normally have three levels (network, tree and list)
 +
*'''Shallow grammars''' focus only on the surface dependency relations of the source document and normally have only two levels (network and list)
  
== When to use T-rules ==
+
== Assessment ==
T-rules are used for changes, and vary according to the scope of the changes:
+
''Main article: [[F-measure]]''
*'''A-rules''' are used when the transformations apply over '''isolated forms''' to generate inflections of the [[base form]]. They are used only when the transformations may be expressed by prefixation, infixation or suffixation. In any case, the transformation must affect only the structure of the word; the structure of the phrase is preserved. In that sense, A-rules must never be used when a new word is introduced in the syntactic structure (as in the formation of compounds).
+
*'''L-rules''' are used when the transformations affect '''a linear sequence of isolated forms'''. The transformations are rather at the surface level and do not affect the deep structure of the phrase.
+
*'''S-rules''' are used when the transformations affect '''the structure of the phrase''', as in the generation of compounds (including compound tenses and periphrastic constructions). They are also used to describe syntactic behaviour such as word order, agreement and government.
+
  
== Notes ==
+
Grammars are evaluated through a weighted average of precision and recall, the F-measure.
<references />
+

Latest revision as of 09:39, 27 May 2014

In the UNL framework, a grammar is a set of rules that is used to generate UNL out of natural language, and natural language out of UNL. Along with dictionaries, they constitute the basic resource for UNLization and NLization.

Contents

Basic Symbols

Basic symbols used in the UNL framework
Symbol Definition Example
( ) node (%a)
" " string "went"
[ ] natural language entry (headword) [go]
[[ ]] UW [[to go(icl>to move)]]
// regular expression /a{2,3}/ = aa,aaa
rel(x;y) relation agt(kill;Peter)
^ not ^a = not a
{ | } or {a|b} = a or b
% index for nodes, attributes and values %x
: scope ID :01
# index for sub-NLWs #01
= attribute-value assignment POS=NOU
! rule trigger !PLR
& merge operator %x&%y
? dictionary lookup operator ?[a]

Basic Concepts

Grammar.png
Node
A node is the most elementary unit in the graph. It is the result of the tokenization process, and corresponds to the notion of "lexical item". At the surface level, a natural language sentence is considered a list of nodes, and a UNL graph a set of relations between nodes.
Relation
In order to form a natural language sentence or a UNL graph, nodes are inter-related by relations. In the UNL framework, there are three different types of relations: the linear (list) relation, syntactic relations and semantic relations.
Hyper-Node
A hyper-node is a sub-graph, i.e., a scope: a node containing relations between nodes.
Hyper-Relation
A hyper-relation is a relation between relations.

Rules

Grammars are sets of rules used to go from UNL into natural language, or from natural language into UNL. In the UNL framework, there can be two different types of rules:

  • T-rules, or transformation rules, are used to perform changes to nodes or relations
  • D-rules, or disambiguation rules, are used to control changes over nodes or relations

T-rules

main article:T-rule

T-rules are used to perform actions and follow the very general formalism

α:=β;

where the left side α is a condition statement, and the right side β is an action to be performed over α.

There are several different especial types of T-rules:

  • A-rule is a specific type of T-rule used for affixation (prefixation, infixation, suffixation)
  • C-rule is a specific type of T-rule used for composition (word formation in case of compounds and multiword expressions)
  • L-rule is a specific type of T-rule used for handling word order
  • N-rule is a specific type of T-rule used for segmenting sentences and normalizing the input text
  • S-rule is a specific type of T-rule used for handling syntactic structures

Examples of T-rules

  • PLR:=0>"s"; (A-rule: add "s" in case of plural, as in book>books)
  • MTW:=+VA("into account",PP); (C-rule: add the prepositional phrase "into account" as an adjunct to the verbal phrase (VA) in order to form the multiword expression, as in take>take into account)
  • (ART,%x)(QUA,%y):=(%y)(%x); (L-rule: reverse the order ART+QUA to QUA+ART, as in the all>all the)
  • ("don't"):=("do not"); (N-rule: replace the contraction "don't" by "do not")
  • (V,%x)(N,%y):=VC(%x;%y); (S-rule: replace the linear relation between a verb and a noun by the syntactic relation VC between them)

D-rules

main article: D-rule

D-rules are used to control the action of T-rules. They are used to control the dictionary retrieval (in tokenization) and to prevent or to induce the application of rules in transformation.

D-rules follow the syntax:

α=P;

where the left side α is a statement and the right side P is an integer from 0 to 255 that indicates the probability of occurrence of α.

Examples of D-rules

  • (ART)(VER)=0; (there cannot be any article before a verb)
  • agt(^V,^J;)=0; (the source node of an agent relation must be either a verb or an adjective)
  • (D)(N)=1; (determiners may come before nouns)

Modules

In the UNL framework there are three types of grammar:

  • N-Grammar, or Normalization Grammar, is a set of T-rules used to segment the natural language text into sentences and to prepare the input for processing.
  • T-Grammar, or Transformation Grammar, is a set of T-rules used to transform natural language into UNL or UNL into natural language.
  • D-Grammar, or Disambiguation Grammar, is a set of D-rules used to to improve the performance of transformation rules by constraining or forcing their applicability.

Direction

In the UNL framework, grammars are not bidirectional, although they share the same syntax:

  • UNLization (NL>UNL)
    • The N-Grammar contains the normalization rules for natural natural analysis
    • The Analysis T-Grammar contains the transformation rules used for natural language analysis
    • The Analysis D-Grammar contains the disambiguation rules used for tokenization and for improving the results of the NL-UNL T-Grammar
  • NLization (UNL>NL)
    • The Generation T-Grammar contains the transformation rules used for natural language generation
    • The Generation D-Grammar contains the disambiguation rules used for improving the results of the UNL-NL T-Grammar

Processing Units

In the UNL framework, grammars may target different processing units:

  • Text-driven grammars process the source document as a single unit (i.e., without any internal subdivision)
  • Sentence-driven grammars process each sentence or graph separately
  • Word-driven grammars process words in isolation

Text-driven grammars are normally used in summarization and simplification, when the rhetorical structure of the source document is important. Sentence-driven grammars are used mostly in translation, when the source document can be treated as a list of non-semantically related units, to be processed one at a time. Word-driven grammars are used in information retrieval and opinion mining, when each word or node can be treated in isolation.

Recall

Grammars may target the whole source document or only parts of it (e.g. main clauses):

  • Chunk grammars target only a part of the source document
  • Full grammars target the whole source document

Precision

Grammars may target the deep or the surface structure of the source document:

  • Deep grammars focus on the deep dependency relations of the source document and normally have three levels (network, tree and list)
  • Shallow grammars focus only on the surface dependency relations of the source document and normally have only two levels (network and list)

Assessment

Main article: F-measure

Grammars are evaluated through a weighted average of precision and recall, the F-measure.

Software