NL Reference Corpus

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(Methodology)
(Methodology)
Line 9: Line 9:
 
== Methodology ==
 
== Methodology ==
 
As a natural language corpus, the NC varies for each language. It is derived from a base corpus processed according to the following criteria:
 
As a natural language corpus, the NC varies for each language. It is derived from a base corpus processed according to the following criteria:
#The Base Corpus must have at least 5,000,000 tokens (any sequence of alphanumeric characters isolated by blank space and other word boundary markers). It must be as representative as possible of the standard use of the written language, and should include documents from as many different genres and domains as possible.  
+
#The '''Base Corpus''' must have at least 5,000,000 tokens (strings isolated by blank space and other word boundary markers). It must be representative of the contemporary standard use of the written language, and should include documents from as many different genres and domains as possible.  
#The Base Corpus must be segmented according to the usual set of sentence boundary markers (punctuation marks and end of sentence).
+
#The Base Corpus must be '''segmented''' (in sentences) and '''tagged''' for POS.
#The Average Sentence Length is calculated from the number of segmented sentences according. The Average Sentence Length will be used to differentiate between the three main levels of reference (A, B and C).
+
#The segmented corpus is used to calculate the '''average sentence length''' (ASL), which is the median of the length (in words) of all sentences.
#All sentences must be
+
#The tagged corpus is used to extract the '''syntactic surface structures''' (SSS), which are sequences of POS.
 
+
#The average sentence length (ASL) and the syntactic surface structures are used to generate the '''NC templates''', as follows:
 
+
#*NC-A1 = 500 most frequent SSS's where length < (ASL/2) (500 most frequent shortest syntactic structures)
of at least 5,000,000 tokens,
+
#*NC-A2 = 1,000 most frequent SSS's where length < (ASL/2) (1,000 most frequent shortest syntactic structures)
 
+
#*NC-B1 = 2,000 most frequent SSS's where length < ASL (2,000 most frequent short syntactic structures)
 
+
#*NC-B2 = 3,000 most frequent SSS's where length < ASL (3,000 most frequent short syntactic structures)
according to the following criteria:
+
#*NC-C1 = 4,000 most frequent SSS's
*NC-A1 must correspond to exemplars of the 500 most frequent syntactic structures among the shortest ones
+
#*NC-C2 = 5,000 most frequent SSS's
*NC-A2 must correspond to exemplars of the 1,000 most frequent syntactic structures among the shortest ones
+
#The NC templates are used to compile the NC corpora: the training corpora and the testing corpora. The training corpora consists of 1 exemplar of each SSS, and will be used to prepare the grammar. The testing corpora consists of 4 exemplars of each SSS randomly selected in the Base Corpus.
*NC-B1 must correspond to exemplars of the 1,500 most frequent syntactic structures among those below the average length
+
*NC-B2 must correspond to exemplars of the 2,000 most frequent syntactic structures among those below the average length
+
*NC-C1 must correspond to exemplars of the 2,500 most frequent syntactic structures among all
+
*NC-C2 must correspond to exemplars of the 3,000 most frequent syntactic structures among all
+
In the above:
+
:A token is (such as punctuation marks)
+
:The set of "shortes
+

Revision as of 12:51, 18 September 2012

The NL Reference Corpus (NC) is the corpus used to prepare and to assess grammars for sentence-based UNLization. It is divided in 6 different levels according to the Framework of Reference for UNL (FRAU):

  • NC-A1: NL Reference Corpus A1
  • NC-A2: NL Reference Corpus A2
  • NC-B1: NL Reference Corpus B1
  • NC-B2: NL Reference Corpus B2
  • NC-C1: NL Reference Corpus C1
  • NC-C2: NL Reference Corpus C2

Methodology

As a natural language corpus, the NC varies for each language. It is derived from a base corpus processed according to the following criteria:

  1. The Base Corpus must have at least 5,000,000 tokens (strings isolated by blank space and other word boundary markers). It must be representative of the contemporary standard use of the written language, and should include documents from as many different genres and domains as possible.
  2. The Base Corpus must be segmented (in sentences) and tagged for POS.
  3. The segmented corpus is used to calculate the average sentence length (ASL), which is the median of the length (in words) of all sentences.
  4. The tagged corpus is used to extract the syntactic surface structures (SSS), which are sequences of POS.
  5. The average sentence length (ASL) and the syntactic surface structures are used to generate the NC templates, as follows:
    • NC-A1 = 500 most frequent SSS's where length < (ASL/2) (500 most frequent shortest syntactic structures)
    • NC-A2 = 1,000 most frequent SSS's where length < (ASL/2) (1,000 most frequent shortest syntactic structures)
    • NC-B1 = 2,000 most frequent SSS's where length < ASL (2,000 most frequent short syntactic structures)
    • NC-B2 = 3,000 most frequent SSS's where length < ASL (3,000 most frequent short syntactic structures)
    • NC-C1 = 4,000 most frequent SSS's
    • NC-C2 = 5,000 most frequent SSS's
  6. The NC templates are used to compile the NC corpora: the training corpora and the testing corpora. The training corpora consists of 1 exemplar of each SSS, and will be used to prepare the grammar. The testing corpora consists of 4 exemplars of each SSS randomly selected in the Base Corpus.
Software