NL Memory

From UNL Wiki
Revision as of 14:46, 24 September 2012 by Martins (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The NL Memory constitutes a list of syntactic (subcategorization) frames between natural language words or terms that co-occur more often than would be expected by chance. They are used to represent collocations, i.e., partly or fully fixed expressions that become established through repeated context-dependent use.

The NL Memory may be provided in two different formats:


Contents

Extended format

NL Memory entries in extended format must have the following structure:

<relation name="RNAME" frequency="RFREQ">
  <source id="SID" attribute="ATT" lang="<LID>">SOURCE</source>
  <target id="TID" attribute="ATT" lang="<LID>">TARGET</target>
</relation>

Where:
RNAME is the name of a syntactic relation ("NA", "NC", "NS", etc);
RFREQ is the frequency of the relation RNAME between the SOURCE and the TARGET in the corpus;
SID is a number used to identify the SOURCE;
TID is a number used to identify the TARGET;
ATT is a set of attribute-value pairs that apply to the SOURCE or to the TARGET ("POS=NOU", "GEN=NEU", etc);
SOURCE is the source node of the syntactic relation;
TARGET is the target node of the syntactic relation;
<LID> is the ISO 639-2 three-character code for the language.

XML Schema

<?xml version="1.0" encoding="utf-16"?>
<xsd:schema attributeFormDefault="unqualified" elementFormDefault="qualified" version="1.0" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
 <xsd:element name="nlm">
   <xsd:complexType>
     <xsd:sequence>
       <xsd:element maxOccurs="unbounded" name="relation">
         <xsd:complexType>
           <xsd:sequence>
             <xsd:element name="source">
               <xsd:complexType>
                 <xsd:attribute name="id" type="xsd:unsignedLong" use="required" />
                 <xsd:attribute name="attribute" type="xsd:string" use="optional" />
                 <xsd:attribute name="lang" type="xsd:string" use="optional" />
               </xsd:complexType>
             </xsd:element>
             <xsd:element name="target">
               <xsd:complexType>
                 <xsd:attribute name="id" type="xsd:unsignedLong" use="required"/>
                 <xsd:attribute name="attribute" type="xsd:string" use="optional" />
                 <xsd:attribute name="lang" type="xsd:string" use="optional"/>
              </xsd:complexType>
             </xsd:element>
           </xsd:sequence>
           <xsd:attribute name="name" type="xsd:string" use="required"/>
           <xsd:attribute name="frequency" type="xsd:int" use="optional"/>
         </xsd:complexType>
       </xsd:element>
     </xsd:sequence>
   </xsd:complexType>
 </xsd:element>
</xsd:schema>

Simplified format

NL Memory entries in simplified format must have the structure of network disambiguation rules, as follows:

RELATION(SOURCE;TARGET)=DC;

Where:
RELATION is the name of a syntactic relation ("NA", "NC", "NS", etc.);
SOURCE is the source node of the syntactic relation, and the corresponding attributes, if necessary;
TARGET is the target node of the syntactic relation, and the corresponding attributes, if necessary;
DC is the degree of certainty (i.e., the likelihood of the relation between the SOURCE and the TARGET), ranging from 0 (impossible) to 255 (necessary)
The SOURCE and the TARGET nodes may be referred as:

  • constants (i.e., specific natural language words), to be represented between square brackets, if lemmas, or between quotes, if strings: [United States] and "United States"
  • a feature (attribute, value, or attribute-value pair) or set of features of a group of natural language: LEX=NOU, GEN=MCL, etc.

Examples

NS([United States];[the])=1; (The lemma [United States] requires the specifier [the])

Software