Grammar

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(Grammar modules)
Line 1: Line 1:
 
In the UNL framework, a '''grammar''' is a set of rules that is used to generate UNL out of natural language, and natural language out of UNL. Along with the [[Dictionary|dictionaries]], they constitute the basic resource for [[UNLization]] and [[NLization]].
 
In the UNL framework, a '''grammar''' is a set of rules that is used to generate UNL out of natural language, and natural language out of UNL. Along with the [[Dictionary|dictionaries]], they constitute the basic resource for [[UNLization]] and [[NLization]].
  
== Grammar modules ==
+
== Modules ==
 
In the UNL framework there are three types of grammar:
 
In the UNL framework there are three types of grammar:
 
*N-Grammar, or Normalization Grammar, is a set of [[N-rule]]s used to segment the natural language text into sentences and to prepare the input for processing.
 
*N-Grammar, or Normalization Grammar, is a set of [[N-rule]]s used to segment the natural language text into sentences and to prepare the input for processing.
Line 9: Line 9:
 
== Direction ==
 
== Direction ==
 
In the UNL framework, grammars are not bidirectional, although they share the same syntax:
 
In the UNL framework, grammars are not bidirectional, although they share the same syntax:
*The '''N-Grammar''' contains the normalization rules for natural natural analysis
+
*[[UNLization]]
 +
**The '''N-Grammar''' contains the normalization rules for natural natural analysis
 +
*The '''Analysis (NL>UNL) T-Grammar''' contains the transformation rules used for natural language analysis
 +
*The '''Anlaysis (NL>UNL) D-Grammar''' contains the disambiguation rules used for [[tokenization]] and for improving the results of the NL-UNL T-Grammar
 +
*[[NLization]]
 
*The '''Generation (UNL>NL) T-Grammar''' contains the transformation rules used for natural language generation
 
*The '''Generation (UNL>NL) T-Grammar''' contains the transformation rules used for natural language generation
 
*The '''Generation (UNL>NL) D-Grammar''' contains the disambiguation rules used for improving the results of the UNL-NL T-Grammar
 
*The '''Generation (UNL>NL) D-Grammar''' contains the disambiguation rules used for improving the results of the UNL-NL T-Grammar
*The '''Analysis (NL>UNL) T-Grammar''' contains the transformation rules used for natural language analysis
 
*The '''Anlaysis (NL>UNL) D-Grammar''' contains the disambiguation rules used for [[tokenization]] and for improving the results of the NL-UNL T-Grammar
 
  
 
== Processing Units ==
 
== Processing Units ==
Line 36: Line 38:
  
 
Grammars are evaluated through a weighted average of precision and recall, the F-measure.
 
Grammars are evaluated through a weighted average of precision and recall, the F-measure.
 
== Structure ==
 
 
=== Basic symbols ===
 
 
{| border="1" cellpadding="2" align=center
 
|+Basic symbols used in UNL grammar rules
 
!Symbol
 
!Definition
 
!Example
 
|-
 
|align=center|<nowiki>^</nowiki>
 
|not
 
|^a = not a
 
|-
 
|align=center|{ | }
 
|or
 
|<nowiki>{a|b}</nowiki> = a or b
 
|-
 
|align=center|%
 
|index for nodes, attributes and values
 
|%x (see [[#Indexes|below]])
 
|-
 
|align=center|#
 
|index for sub-NLWs
 
|#01 (see [[#Indexes|below]])
 
|-
 
|align=center|=
 
|attribute-value assignment
 
|POS=NOU
 
|-
 
|align=center|!
 
|rule trigger
 
|!PLR
 
|-
 
|align=center|&
 
|merge operator
 
|%x&%y
 
|-
 
|align=center|?
 
|dictionary lookup operator
 
|?[a]
 
|-
 
|align=center|“ “
 
|string
 
|"went"
 
|-
 
|align=center|[ ]
 
|natural language entry (headword)
 
|[go]
 
|-
 
|align=center|[[ ]]
 
|UW
 
|[[to go(icl>to move)]]
 
|-
 
|align=center|( )
 
|node
 
|(a)
 
|-
 
|align=center|//
 
|regular expression
 
|/a{2,3}/ = aa,aaa
 
|}
 
 
;The differences between "", [] and [[]]
 
:Double quotes are always used to represent strings: "a" will match only the string "a"
 
:Simple square brackets are always used to represent natural language entries (headwords) in the dictionary: [a] will match the node associated to the entry [a] retrieved from the dictionary, no matter its current realization, which may be affected by other rules (the original [a] may have been replaced, for instance, by "b", but will still be indexed to the entry [a])
 
:Double square brackets are always used to represent UWs: <nowiki>[[a]]</nowiki> will match the node associated to the UW <nowiki>[[a]]</nowiki>
 
 
;Predefined values (assigned by default)
 
:SCOPE - Scope
 
:SHEAD - Sentence head (the beginning of a sentence)
 
:STAIL - Sentence tail (the end of a sentence)
 
:CHEAD - Scope head (the beginning of a scope)
 
:CTAIL - Scope tail (the end of a scope)
 
:TEMP - Temporary entry (entry not found in the dictionary)
 
:DIGIT - Any sequence of digits (i.e.: 0,1,2,3,4,5,6,7,8,9)
 
 
=== Basic concepts ===
 
==== Nodes ====
 
''main article: [[node]]s''
 
 
==== Relations ====
 
In order to form a natural language sentence or a UNL graph, nodes are inter-related by relations. In the UNL framework, there can be three different types of relations:
 
*the '''linear''' relation L expresses the surface structure of natural language sentences
 
*'''syntactic''' relations express the deep (tree) structure of natural language sentences
 
*'''semantic''' relations express the structure of UNL graphs
 
===== Properties of relations =====
 
;The linear relation is always binary and is represented in two possible formats:
 
*L(%x;%y), where L is the invariant name of the linear relation, and %x and %y are nodes; or
 
*(%x)(%y)
 
;Syntactic relations are not predefined, although we have been using a set of binary relations based on the [[X-bar theory]].
 
;Semantic relations constitute a predefined and closed set that can be found [[relations|here]].
 
;Syntactic and semantic relations are represented in the same way:
 
*rel(%x;%y), where "rel" is the name of the relation, %x is the source node, and %y is the target node
 
;Arguments of linear, syntactic and semantic relations are not commutative.
 
:The order of the elements in a relation affects the result:
 
::(%x)(%y) is different from (%y)(%x)
 
::relation(%x;%y) is different from relation(%y;%x)
 
;Linear and semantic relations are always binary; syntactic relations may be n-ary:
 
:L(%x;%y) - linear relation
 
:agt(%x;%y) - semantic relation
 
:VH(%x) - unary syntactic relation
 
:VC(%x;%y) - binary syntactic relation
 
:XX(%x;%y;%z) - possible ternary syntactic relation
 
;Inside each relation, nodes are isolated by semicolon (;).
 
:VC(%x;%y)
 
:<strike>VC(%x,%y)</strike>
 
;Inside each relation, nodes may be referenced by any of its elements, isolated by comma (,):
 
:("a")([b]) - linear relation between a node where string = "a" and another node where headword = [b]
 
:L(<nowiki>[[c]]</nowiki>;D) - linear relation between a node where UW = <nowiki>[[c]]</nowiki> and another node having the feature D
 
:VC(%a;%b) - syntactic relation between a node where index = %a and another node where index = %b
 
:agt("a",[a],<nowiki>[[a]]</nowiki>,A;"b",[b],<nowiki>[[b]]</nowiki>,B) - semantic relation between a node having the feature A where string = "a" AND headword "a" AND UW = <nowiki>[[a]]</nowiki> AND another node having the feature B where string = "b" AND headword = [b] AND UW = <nowiki>[[b]]</nowiki>
 
;Relations may be conjoined through juxtaposition:
 
:("a")("b")("c") - two linear relations: one between ("a") and ("b") AND other between ("b") and ("c")
 
:agt(%x;%y)obj(%x;%z) - two semantic relations: one between (%x) and (%y) AND other between (%x) and (%z)
 
:<strike>VC([a];[b]),VC([a];[c])</strike> - conjoined relations must not be isolated by comma
 
;Relations may be disjoined through {braces}
 
:{("a")|("b")}("c") - either ("a")("c") or ("b")("c")
 
:{agt(%x;%y)|exp(%x;%y)}obj(%x;%z) - either agt(%x;%y)obj(%x;%z) or exp(%x;%y)obj(%x;%z)
 
;Syntactic and semantic relations may be replaced by regular expressions
 
:/.{2,3}/(%x;%y) - any relation made of two or three characters between %x and %y
 
 
==== Hyper-nodes ====
 
Nodes may contain one or more relations. In this case, they are said to be "hyper-nodes", and represent scopes or sub-graphs. As any node, hyper-nodes contain a string, a headword, a UW, an index and features, of which the internal relations are a special type. Examples of hyper-nodes are the following:
 
*(("a")("b")) - a hyper-node containing a linear relation between the nodes ("a") and ("b")
 
*(VC(%x;%y)VA(%x;%z)) - a hyper-node containing two syntactic relations: VC(%x;%y)AND VA(%x;%z)
 
*(agt([a];[b])obj([a];[c])) - a hyper-node containing two semantic relations: agt([a];[b]) AND obj([a];[c])
 
*(([kick],V)([the],D)([bucket],N),V,NTST) - a hyper-node having the features N and NTST and containing two linear relations: one between the nodes ([kick],V) and ([the],D), and other between ([the],D) and [bucket],N)
 
*(([kick],V)([the],D)([bucket],N),"kick the bucket",<nowiki>[[die]]</nowiki>,V,NTST) - the same as before, except for the fact that the hyper-node has string = "kick the bucket" and UW = <nowiki>[[die]]</nowiki>
 
Hyper-nodes may also contain internal hyper-nodes:
 
*((("a")("b"))("c")) - a hyper-node containing a linear relation between the hyper-node (("a")("b")) and the node ("c")
 
===== Properties of hyper-nodes =====
 
;As any node, hyper-nodes are expressed between (parentheses)
 
:(("a")("b"))
 
;As any node, hyper-nodes may have one single string, one single headword and one single UW, but may have as many features and internal relations as necessary
 
:(([kick],V)([the],D)([bucket],N),"kick the bucket",[kick the bucket],<nowiki>[[die]]</nowiki>,V,NTST)
 
;As any node, hyper-nodes may be referenced by any of its elements, including internal relations
 
:(([kick],V)) - refers to any hyper-node containing the node ([kick],V)
 
:(([the],D)([bucket],N)) - refers to any hyper-node containing a linear relation between ([the],D) AND ([bucket],N)
 
:(([kick],D),([bucket],N)) - refers to any hyper-node containing the nodes ([kick],V) AND ([bucket],N)
 
;When a hyper-node is deleted, all its internal relations are deleted as well
 
:(([kick],V)([the],D)([bucket],N)):=; (the hyper-node is deleted, as well as the relations ([kick],V)([the],D) AND ([the],D)([bucket],N))
 
 
==== Hyper-relations ====
 
Relations may have relations as arguments. In this case, they are said to be "hyper-relations". Examples of hyper-relations are the following:
 
*XP(XB(%a;%b);%c) - a syntactic relation XP between the syntactic relation XB(%a;%b) and the node %c
 
*and(agt([a];[b]);agt([a];[c])) - a semantic relation "and" between the semantic relations agt([a];[b]) AND agt([a];[c])
 
===== Properties of hyper-relations =====
 
;A hyper-relation may have one single relation as each argument
 
*XP(XB(%a;%b);%c) - the source argument of the hyper-relation XP is a relation
 
*XP(%a;XB(%b;%c)) - the target argument of the hyper-relation XP is a relation
 
*XP(VC(%a;%b);VA(%a;%c)) - the source and the target argument of the hyper-relation XP are relations
 
*<strike>XP(VC(%a;%b)VA(%a;%c);VS(%a;%d))</strike> - a hyper-relation may not have more than one relation as one single argument (in this case, the hyper-relation XP contained two relations as the source argument)
 
;Relations do not have strings, UWs, headwords or any features
 
*<strike>XP(XB(%a;%b),"ab",[ab],<nowiki>[[ab]]</nowiki>,A,B;%c)</strike> (the relation XB(%a;%b) may not have strings, UWs, headwords or any features)
 

Revision as of 14:55, 16 August 2013

In the UNL framework, a grammar is a set of rules that is used to generate UNL out of natural language, and natural language out of UNL. Along with the dictionaries, they constitute the basic resource for UNLization and NLization.

Contents

Modules

In the UNL framework there are three types of grammar:

  • N-Grammar, or Normalization Grammar, is a set of N-rules used to segment the natural language text into sentences and to prepare the input for processing.
  • T-Grammar, or Transformation Grammar, is a set of T-rules used to transform natural language into UNL or UNL into natural language.
  • D-Grammar, or Disambiguation Grammar, is a set of D-rules used to to improve the performance of transformation rules by constraining or forcing their applicability.

Direction

In the UNL framework, grammars are not bidirectional, although they share the same syntax:

  • UNLization
    • The N-Grammar contains the normalization rules for natural natural analysis
  • The Analysis (NL>UNL) T-Grammar contains the transformation rules used for natural language analysis
  • The Anlaysis (NL>UNL) D-Grammar contains the disambiguation rules used for tokenization and for improving the results of the NL-UNL T-Grammar
  • NLization
  • The Generation (UNL>NL) T-Grammar contains the transformation rules used for natural language generation
  • The Generation (UNL>NL) D-Grammar contains the disambiguation rules used for improving the results of the UNL-NL T-Grammar

Processing Units

In the UNL framework, grammars may target different processing units:

  • Text-driven grammars process the source document as a single unit (i.e., without any internal subdivision)
  • Sentence-driven grammars process each sentence or graph separately
  • Word-driven grammars process words in isolation

Text-driven grammars are normally used in summarization and simplification, when the rhetorical structure of the source document is important. Sentence-driven grammars are used mostly in translation, when the source document can be treated as a list of non-semantically related units, to be processed one at a time. Word-driven grammars are used in information retrieval and opinion mining, when each word or node can be treated in isolation.

Recall

Grammars may target the whole source document or only parts of it (e.g. main clauses):

  • Chunk grammars target only a part of the source document
  • Full grammars target the whole source document

Precision

Grammars may target the deep or the surface structure of the source document:

  • Deep grammars focus on the deep dependency relations of the source document and normally have three levels (network, tree and list)
  • Shallow grammars focus only on the surface dependency relations of the source document and normally have only two levels (network and list)

Assessment

Main article: F-measure

Grammars are evaluated through a weighted average of precision and recall, the F-measure.

Software