NL Memory
The NL Memory constitutes a list of syntactic (subcategorization) frames between natural language words or terms that co-occur more often than would be expected by chance. They are used to represent collocations, i.e., partly or fully fixed expressions that become established through repeated context-dependent use.
The NL Memory may be provided in two different formats:
- Extended, in XML; or
- Simplified, as a set of network disambiguation rules
Contents |
Extended format
NL Memory entries in extended format must have the following structure:
<relation name="RNAME" frequency="RFREQ"> <source id="SID" attribute="ATT" lang="<LID>">SOURCE</source> <target id="TID" attribute="ATT" lang="<LID>">TARGET</target> </relation>
Where:
RNAME is the name of a syntactic relation ("NA", "NC", "NS", etc);
RFREQ is the frequency of the relation RNAME between the SOURCE and the TARGET in the corpus;
SID is a number used to identify the SOURCE;
TID is a number used to identify the TARGET;
ATT is a set of attribute-value pairs that apply to the SOURCE or to the TARGET ("POS=NOU", "GEN=NEU", etc);
SOURCE is the source node of the syntactic relation;
TARGET is the target node of the syntactic relation;
<LID> is the ISO 639-2 three-character code for the language.
XML Schema
<?xml version="1.0" encoding="utf-16"?> <xsd:schema attributeFormDefault="unqualified" elementFormDefault="qualified" version="1.0" xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:element name="nlm"> <xsd:complexType> <xsd:sequence> <xsd:element maxOccurs="unbounded" name="relation"> <xsd:complexType> <xsd:sequence> <xsd:element name="source"> <xsd:complexType> <xsd:attribute name="id" type="xsd:unsignedLong" use="required" /> <xsd:attribute name="attribute" type="xsd:string" use="optional" /> <xsd:attribute name="lang" type="xsd:string" use="optional" /> </xsd:complexType> </xsd:element> <xsd:element name="target"> <xsd:complexType> <xsd:attribute name="id" type="xsd:unsignedLong" use="required"/> <xsd:attribute name="attribute" type="xsd:string" use="optional" /> <xsd:attribute name="lang" type="xsd:string" use="optional"/> </xsd:complexType> </xsd:element> </xsd:sequence> <xsd:attribute name="name" type="xsd:string" use="required"/> <xsd:attribute name="frequency" type="xsd:int" use="optional"/> </xsd:complexType> </xsd:element> </xsd:sequence> </xsd:complexType> </xsd:element> </xsd:schema>
Simplified format
NL Memory entries in simplified format must have the structure of network disambiguation rules, as follows:
RELATION(SOURCE;TARGET)=DC;
Where:
RELATION is the name of a syntactic relation ("NA", "NC", "NS", etc.);
SOURCE is the source node of the syntactic relation, and the corresponding attributes, if necessary;
TARGET is the target node of the syntactic relation, and the corresponding attributes, if necessary;
DC is the degree of certainty (i.e., the likelihood of the relation between the SOURCE and the TARGET), ranging from 0 (impossible) to 255 (necessary)
The SOURCE and the TARGET nodes may be referred as:
- constants (i.e., specific natural language words), to be represented between square brackets, if lemmas, or between quotes, if strings: [United States] and "United States"
- a feature (attribute, value, or attribute-value pair) or set of features of a group of natural language: LEX=NOU, GEN=MCL, etc.
Examples
NS([United States];[the])=1; (The lemma [United States] requires the specifier [the])