Grammar

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(Undo revision 7682 by Domtheo (talk))
 
(73 intermediate revisions by 2 users not shown)
Line 1: Line 1:
In the UNL framework, a '''grammar''' is a set of rules that are used to generate UNL out of natural language, and natural language out of UNL. Along with the [[Lexica|UNL-NL dictionaries]], they constitute the basic resource for [[UNLization]] and [[NLization]].
+
In the UNL framework, a '''grammar''' is a set of rules that is used to generate UNL out of natural language, and natural language out of UNL. Along with [[Dictionary|dictionaries]], they constitute the basic resource for [[UNLization]] and [[NLization]].
  
== Networks, Trees and Lists ==
+
== Basic Symbols ==
Natural language sentences and UNL graphs are supposed to convey the same amount of information in different structures: whereas the former arranges data as an ordered list of words, the latter organizes it as a network. In that sense, going from natural language into UNL and from UNL into natural language is ultimately a matter of transforming lists into networks and vice-versa.
+
{{:Basic Symbols}}
+
The UNL framework assumes that such transformation can be carried out progressively, i.e., through a transitional data structure: the tree, which could be used as an interface between lists and networks. Accordingly, there are seven different types of rules (LL, TT, NN, LT, TL, TN, NT), as indicated below:
+
  
*'''ANALYSIS''' (NL-UNL)
+
== Basic Concepts ==
**LL - List Processing (list-to-list)
+
{{:Grammar units}}
**LT - Surface-Structure Formation (list-to-tree)
+
**TT - Syntactic Processing (tree-to-tree)
+
**TN - Deep-Structure Formation (tree-to-network)
+
**NN - Semantic Processing (network-to-network)
+
  
*'''GENERATION''' (UNL-NL)
+
== Rules ==
**NN - Semantic Processing (network-to-network)
+
{{:Rule}}
**NT - Deep-Structure Formation (network-to-tree)
+
**TT - Syntactic Processing (tree-to-tree)
+
**TL - Surface-Structure Formation (tree-to-list)
+
**LL - List Processing (list-to-list)
+
  
The '''NL original sentence''' is supposed to be preprocessed, by the LL rules, in order to become an ordered list. Next, the resulting '''list structure''' is parsed with the LT rules, so as to unveil its '''surface syntactic structure''', which is already a tree. The tree structure is further processed by the TT rules in order to expose its inner organization, the '''deep syntactic structure''', which is supposed to be more suitable to the semantic interpretation. Then, this deep syntactic structure is projected into a semantic network by the TN rules. The resultant '''semantic network''' is then post-edited by the NN rules in order to comply with UNL standards and generate the '''UNL Graph'''.
+
== Modules ==
 
+
In the UNL framework there are three types of grammar:
The reverse process is carried out during natural language generation. The '''UNL graph''' is preprocessed by the NN rules in order to become a more easily tractable semantic network. The resulting '''network structure''' is converted, by the NT rules, into a syntactic structure, which is still distant from the surface structure, as it is directly derived from the semantic arrangement. This '''deep syntactic structure''' is subsequently transformed into a '''surface syntactic structure''' by the TT rules. The surface syntactic structure undergoes many other changes according to the TL rules, which generate a NL-like '''list structure'''. This list structure is finally realized as a '''natural language sentence''' by the LL rules.
+
*[[N-Grammar]], or Normalization Grammar, is a set of T-rules used to segment the natural language text into sentences and to prepare the input for processing.
 
+
*[[T-Grammar]], or Transformation Grammar, is a set of T-rules used to transform natural language into UNL or UNL into natural language.
As sentences are complex structures that may contain nested or embedded phrases, both the analysis and the generation processes may be '''interleaved''' rather than pipelined. This means that the natural flow described above is only "normal" and not "necessary". During natural language generation, a LL rule may apply prior to a TT rule, or a NN rule may be applied after a TL rule. Rules are recursive and must be applied in the order defined in the grammar as long as their conditions are true, regardless of the state.
+
*[[D-Grammar]], or Disambiguation Grammar, is a set of D-rules used to to improve the performance of transformation rules by constraining or forcing their applicability.
 
+
= Types of rules ==
+
''Main article: [[Grammar Specs]]''
+
 
+
In the UNL framework there are two basic types of rules:
+
*Transformation rules, or [[T-rule]]s, are used to manipulate data structures, i.e., to transform lists into trees, trees into lists, trees into networks, networks into trees, etc. They follow the very general formalism
+
α:=β;
+
where the left side α is a condition statement, and the right side β is an action to be performed over α.  
+
*Disambiguation rules, or [[D-rule]]s, are used to improve the performance of transformation rules by constraining or forcing their applicability. The Disambiguation Rules follows the formalism:
+
α=P;
+
where the left side α is a statement and the right side P is an integer from 0 to 255 that indicates the probability of occurrence of α. 
+
  
 
== Direction ==
 
== Direction ==
In the UNL<sup>framework</sup>, we distinguish between analysis and generation grammars:
+
In the UNL framework, grammars are not bidirectional, although they share the same syntax:
*The '''UNL-NL T-G Grammar''' is used to generate natural language out of UNL
+
*[[UNLization]] (NL>UNL)
*The '''NL-UNL (Analysis) Grammar''' is used to generate UNL out of natural language
+
**The '''N-Grammar''' contains the normalization rules for natural natural analysis
 
+
**The '''Analysis T-Grammar''' contains the transformation rules used for natural language analysis
== Units ==
+
**The '''Analysis D-Grammar''' contains the disambiguation rules used for [[tokenization]] and for improving the results of the NL-UNL T-Grammar
The process of UNLization may have different representation units, as follows:
+
*[[NLization]] (UNL>NL)
*Word-driven UNLization (the source document is represented as a single network of individual concepts)
+
**The '''Generation T-Grammar''' contains the transformation rules used for natural language generation
*Sentence-driven UNLization (the source document is represented as a list of non-semantically related networks of individual concepts)
+
**The '''Generation D-Grammar''' contains the disambiguation rules used for improving the results of the UNL-NL T-Grammar
*Text-driven UNLization (the source document is represented as a network of semantically related networks of individual concepts)
+
In word-driven UNLization, the sentence boundaries and the structure of the source document are ignored, and the source document is represented as a single graph, i.e., as a simple network of individual concepts. In sentence-driven UNLization, the source document is analyzed, sentence by sentence, as a list of non-semantically related hyper-graphs. Each sentence is represented separately, and the only relation standing between sentences is the order in the source document. At last, text-driven UNLization targets the rhetorical structure of the source document, i.e., it analyzes the source document as a network of semantically related hyper-graphs. Word-driven UNLization is used mainly for information retrieval and extraction, whereas sentence- and text-driven UNLization are normally used for translation.
+
  
== Paradigms ==
+
== Processing Units ==
The process of UNLization may follow several different paradigms, as follows:
+
In the UNL framework, grammars may target different processing units:
*Language-based UNLization (based mainly in a [[UNL Dictionary|NL-UNL dictionary]] and [[Grammar Specs|NL-UNL grammar]])
+
*'''Text-driven grammars''' process the source document as a single unit (i.e., without any internal subdivision)
*Knowledge-based UNLization (based mainly in the [[UNL Knowledge Base]])
+
*'''Sentence-driven grammars''' process each sentence or graph separately
*Example-based UNLization (based mainly in the [[UNL Example Base]])
+
*'''Word-driven grammars''' process words in isolation
*Memory-based UNLization (based mainly in the [[UM Specs|UNLization Memory]])
+
Text-driven grammars are normally used in summarization and simplification, when the rhetorical structure of the source document is important. Sentence-driven grammars are used mostly in translation, when the source document can be treated as a list of non-semantically related units, to be processed one at a time. Word-driven grammars are used in information retrieval and opinion mining, when each word or node can be treated in isolation. <br />
*Statistical-based UNLization (based mainly in statistical predictions derived from UNL-NL corpora)
+
*Dialogue-based UNLization (based mainly in the interaction with the user)
+
The actual UNLization is normally hybrid and may combine several of the strategies above.
+
  
 
== Recall ==  
 
== Recall ==  
The process of UNLization may target the whole source document or only parts of it (e.g. main clauses):
+
Grammars may target the whole source document or only parts of it (e.g. main clauses):
*Full UNLization (the whole source document is UNLized)
+
*'''Chunk grammars''' target only a part of the source document
*Partial (or chunk) UNLization (only a part of the source document is UNLized)
+
*'''Full grammars''' target the whole source document
;Peter killed Mary with a knife yesterday morning.
+
:Full UNLization: Peter killed Mary with a knife yesterday morning.
+
:Partial UNLization: Peter killed Mary.
+
  
 
== Precision ==
 
== Precision ==
The process of UNLization may target the deep semantic structure of the source document (i.e., the resulting semantic structure replicates the syntactic structure of the original) or only its surface structure (the resulting semantic structure does not preserve the syntactic structure of the original)
+
Grammars may target the deep or the surface structure of the source document:
*Deep UNLization (the UNLization focus the deep semantic structure of the source document)
+
*'''Deep grammars''' focus on the deep dependency relations of the source document and normally have three levels (network, tree and list)
*Shallow UNLization (the UNLization focus the surface semantic structure of the source document)
+
*'''Shallow grammars''' focus only on the surface dependency relations of the source document and normally have only two levels (network and list)
Syntactic structures are preserved in the UNL document by the use of syntactic attributes (such as @passive, @topic, etc) or by hyper-nodes (i.e., [[scope]]s). For some purposes, as translation, UNLization may require syntactic details; for others, such as information retrieval, syntactic structures at this level are not normally necessary:
+
;Mary was killed by Peter
+
:Shallow UNLization: Peter killed Mary
+
:Deep UNLization: [Peter killed Mary].@passive
+
;Mary saw Peter going to Paris.
+
:Shallow UNLization: Mary saw Peter & Peter was going to Paris
+
:Deep UNLization: Mary saw [Peter going to Paris].
+
;As for the little girl, the dog licked her.
+
:Shallow UNLization: the dog licked the little girl
+
:Deep UNLization: the dog licked [the little girl].@topic
+
  
== Level ==
+
== Assessment ==
The process of UNLization may target literal meanings (locutionary content) or non-literal meanings (ilocutionary content).
+
''Main article: [[F-measure]]''
*Locutionary (the UNLization represents only the literal meaning)
+
*Ilocutionary (the UNLization represents also non-literal meanings, including speech acts)
+
The ilocutionary force may be represented by figure of speech and speech acts attributes:
+
;It is as soft as concrete
+
:Locutionary level: it is as soft as concrete
+
:Ilocutionary level: [it is as soft as concrete].@irony
+
;Can you pass me the salt?
+
:Locutionary level: can you pass me the salt?
+
:Ilocutionaruy level: [you pass me the salt].@request
+
  
== Methods ==
+
Grammars are evaluated through a weighted average of precision and recall, the F-measure.
Humans and machines may play different roles in UNLization methods:
+
*Fully automatic UNLization (the whole process is carried out by the machine, without any intervention of the human user)
+
*Human-aided machine UNLization (the process is carried mainly by the machine, with some intervention of the human user, either as a pre-editor or as a post-editor, or during the UNLization itself, as in dialogue-based UNLization)
+
*Machine-aided human UNLization (the process is carried mainly by the human user, with some help of the machine, as in the dictionary or memory lookup)
+
*Fully human UNLization (the whole process is carried by the human user, without any intervention of the machine)
+

Latest revision as of 09:39, 27 May 2014

In the UNL framework, a grammar is a set of rules that is used to generate UNL out of natural language, and natural language out of UNL. Along with dictionaries, they constitute the basic resource for UNLization and NLization.

Contents

Basic Symbols

Basic symbols used in the UNL framework
Symbol Definition Example
( ) node (%a)
" " string "went"
[ ] natural language entry (headword) [go]
[[ ]] UW [[to go(icl>to move)]]
// regular expression /a{2,3}/ = aa,aaa
rel(x;y) relation agt(kill;Peter)
^ not ^a = not a
{ | } or {a|b} = a or b
% index for nodes, attributes and values %x
: scope ID :01
# index for sub-NLWs #01
= attribute-value assignment POS=NOU
! rule trigger !PLR
& merge operator %x&%y
? dictionary lookup operator ?[a]

Basic Concepts

Grammar.png
Node
A node is the most elementary unit in the graph. It is the result of the tokenization process, and corresponds to the notion of "lexical item". At the surface level, a natural language sentence is considered a list of nodes, and a UNL graph a set of relations between nodes.
Relation
In order to form a natural language sentence or a UNL graph, nodes are inter-related by relations. In the UNL framework, there are three different types of relations: the linear (list) relation, syntactic relations and semantic relations.
Hyper-Node
A hyper-node is a sub-graph, i.e., a scope: a node containing relations between nodes.
Hyper-Relation
A hyper-relation is a relation between relations.

Rules

Grammars are sets of rules used to go from UNL into natural language, or from natural language into UNL. In the UNL framework, there can be two different types of rules:

  • T-rules, or transformation rules, are used to perform changes to nodes or relations
  • D-rules, or disambiguation rules, are used to control changes over nodes or relations

T-rules

main article:T-rule

T-rules are used to perform actions and follow the very general formalism

α:=β;

where the left side α is a condition statement, and the right side β is an action to be performed over α.

There are several different especial types of T-rules:

  • A-rule is a specific type of T-rule used for affixation (prefixation, infixation, suffixation)
  • C-rule is a specific type of T-rule used for composition (word formation in case of compounds and multiword expressions)
  • L-rule is a specific type of T-rule used for handling word order
  • N-rule is a specific type of T-rule used for segmenting sentences and normalizing the input text
  • S-rule is a specific type of T-rule used for handling syntactic structures

Examples of T-rules

  • PLR:=0>"s"; (A-rule: add "s" in case of plural, as in book>books)
  • MTW:=+VA("into account",PP); (C-rule: add the prepositional phrase "into account" as an adjunct to the verbal phrase (VA) in order to form the multiword expression, as in take>take into account)
  • (ART,%x)(QUA,%y):=(%y)(%x); (L-rule: reverse the order ART+QUA to QUA+ART, as in the all>all the)
  • ("don't"):=("do not"); (N-rule: replace the contraction "don't" by "do not")
  • (V,%x)(N,%y):=VC(%x;%y); (S-rule: replace the linear relation between a verb and a noun by the syntactic relation VC between them)

D-rules

main article: D-rule

D-rules are used to control the action of T-rules. They are used to control the dictionary retrieval (in tokenization) and to prevent or to induce the application of rules in transformation.

D-rules follow the syntax:

α=P;

where the left side α is a statement and the right side P is an integer from 0 to 255 that indicates the probability of occurrence of α.

Examples of D-rules

  • (ART)(VER)=0; (there cannot be any article before a verb)
  • agt(^V,^J;)=0; (the source node of an agent relation must be either a verb or an adjective)
  • (D)(N)=1; (determiners may come before nouns)

Modules

In the UNL framework there are three types of grammar:

  • N-Grammar, or Normalization Grammar, is a set of T-rules used to segment the natural language text into sentences and to prepare the input for processing.
  • T-Grammar, or Transformation Grammar, is a set of T-rules used to transform natural language into UNL or UNL into natural language.
  • D-Grammar, or Disambiguation Grammar, is a set of D-rules used to to improve the performance of transformation rules by constraining or forcing their applicability.

Direction

In the UNL framework, grammars are not bidirectional, although they share the same syntax:

  • UNLization (NL>UNL)
    • The N-Grammar contains the normalization rules for natural natural analysis
    • The Analysis T-Grammar contains the transformation rules used for natural language analysis
    • The Analysis D-Grammar contains the disambiguation rules used for tokenization and for improving the results of the NL-UNL T-Grammar
  • NLization (UNL>NL)
    • The Generation T-Grammar contains the transformation rules used for natural language generation
    • The Generation D-Grammar contains the disambiguation rules used for improving the results of the UNL-NL T-Grammar

Processing Units

In the UNL framework, grammars may target different processing units:

  • Text-driven grammars process the source document as a single unit (i.e., without any internal subdivision)
  • Sentence-driven grammars process each sentence or graph separately
  • Word-driven grammars process words in isolation

Text-driven grammars are normally used in summarization and simplification, when the rhetorical structure of the source document is important. Sentence-driven grammars are used mostly in translation, when the source document can be treated as a list of non-semantically related units, to be processed one at a time. Word-driven grammars are used in information retrieval and opinion mining, when each word or node can be treated in isolation.

Recall

Grammars may target the whole source document or only parts of it (e.g. main clauses):

  • Chunk grammars target only a part of the source document
  • Full grammars target the whole source document

Precision

Grammars may target the deep or the surface structure of the source document:

  • Deep grammars focus on the deep dependency relations of the source document and normally have three levels (network, tree and list)
  • Shallow grammars focus only on the surface dependency relations of the source document and normally have only two levels (network and list)

Assessment

Main article: F-measure

Grammars are evaluated through a weighted average of precision and recall, the F-measure.

Software