NL Reference Corpus

From UNL Wiki

Revision as of 00:10, 18 September 2012 by Martins (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

The NL Reference Corpus (NC) is the corpus used to prepare and to assess grammars for sentence-based UNLization. It is divided in 6 different levels according to the Framework of Reference for UNL (FRAU):

NC-A1: NL Reference Corpus A1
NC-A2: NL Reference Corpus A2
NC-B1: NL Reference Corpus B1
NC-B2: NL Reference Corpus B2
NC-C1: NL Reference Corpus C1
NC-C2: NL Reference Corpus C2

Methodology

As a natural language corpus, the NC varies for each language. It must be derived from a base corpus to be processed according to the following criteria:

The Base Corpus must have at least 5,000,000 tokens (any sequence of alphanumeric characters isolated by blank space and other word boundary markers). The Base Corpus must be as representative as possible of the standard use of the written language, and should include documents from different genres and domains.
The Base Corpus must be segmented according to the usual set of sentence boundary markers (punctuation marks and end of paragraph).
The Average Sentence Length will be calculated from the number of segmented sentences according. The Average Sentence Length will be used to differentiate between the three main levels of reference (A, B and C).
All sentences must be

of at least 5,000,000 tokens,

according to the following criteria:

NC-A1 must correspond to exemplars of the 500 most frequent syntactic structures among the shortest ones
NC-A2 must correspond to exemplars of the 1,000 most frequent syntactic structures among the shortest ones
NC-B1 must correspond to exemplars of the 1,500 most frequent syntactic structures among those below the average length
NC-B2 must correspond to exemplars of the 2,000 most frequent syntactic structures among those below the average length
NC-C1 must correspond to exemplars of the 2,500 most frequent syntactic structures among all
NC-C2 must correspond to exemplars of the 3,000 most frequent syntactic structures among all

In the above:

A token is (such as punctuation marks)

The set of "shortes

NL Reference Corpus

Methodology

Views

Personal tools

Search

UNL

Lingware

Software

UNL Program

Navigation

Toolbox

Print/export