NL Reference Corpus
From UNL Wiki
(Difference between revisions)
(→Methodology) |
(→Methodology) |
||
Line 14: | Line 14: | ||
#The tagged corpus is used to extract the [[LSS|linear sentence structures]] (LSS), which are sequences of POS, and to calculate their frequency of occurrence. | #The tagged corpus is used to extract the [[LSS|linear sentence structures]] (LSS), which are sequences of POS, and to calculate their frequency of occurrence. | ||
#The average sentence length (ASL) and the linear sentence structures (LSS) are used to generate the '''NC templates''', as follows: | #The average sentence length (ASL) and the linear sentence structures (LSS) are used to generate the '''NC templates''', as follows: | ||
− | #*NC-A1 = (ASL*0. | + | #*NC-A1 = (ASL*0.0) < length <= (ASL*0.5) |
#*NC-A2 = (ASL*0.5) < length <= (ASL*1.0) | #*NC-A2 = (ASL*0.5) < length <= (ASL*1.0) | ||
#*NC-B1 = (ASL*1.0) < length <= (ASL*1.5) | #*NC-B1 = (ASL*1.0) < length <= (ASL*1.5) |
Revision as of 20:58, 18 March 2014
The NL Reference Corpus (NC) is the corpus used to prepare and to assess grammars for sentence-based UNLization. It is divided in 6 different levels according to the Framework of Reference for UNL (FoR-UNL):
- NC-A1: NL Reference Corpus A1
- NC-A2: NL Reference Corpus A2
- NC-B1: NL Reference Corpus B1
- NC-B2: NL Reference Corpus B2
- NC-C1: NL Reference Corpus C1
- NC-C2: NL Reference Corpus C2
Methodology
As a natural language corpus, the NC varies for each language. It is derived from a base corpus to be compiled and processed according to the following criteria:
- The Base Corpus must have at least 5,000,000 tokens (strings isolated by blank space and other word boundary markers). It must be representative of the contemporary standard use of the written language, and should include documents from as many different genres and domains as possible.
- The Base Corpus must be segmented (in sentences) and tagged for POS.
- The segmented corpus is used to calculate the average sentence length (ASL), which is the median of the length (in words) of all sentences.
- The tagged corpus is used to extract the linear sentence structures (LSS), which are sequences of POS, and to calculate their frequency of occurrence.
- The average sentence length (ASL) and the linear sentence structures (LSS) are used to generate the NC templates, as follows:
- NC-A1 = (ASL*0.0) < length <= (ASL*0.5)
- NC-A2 = (ASL*0.5) < length <= (ASL*1.0)
- NC-B1 = (ASL*1.0) < length <= (ASL*1.5)
- NC-B2 = (ASL*1.5) < length <= (ASL*2.0)
- NC-C1 = (ASL*2.0) < length <= (ASL*2.5)
- NC-C2 = (ASL*2.5) < length <= (ASL*3.0)
- The NC templates are used to compile the NC corpora: the training corpora and the testing corpora. The training corpora consists of 1 exemplar of each LSS, and will be used to prepare the grammar. The testing corpora consists of 4 exemplars of each LSS randomly selected in the Base Corpus. The whole NC corpora (i.e., 5 exemplars for each LSS) is used to calculate the F-measure, which is the parameter for assessing the precision and the recall of the grammars.
Files
- Arabic
- Source: Wikipedia
- Total number of distinct sentences: 801,258
- ASL = 16
- Corpus