BRUNO
From UNL Wiki
(Difference between revisions)
(→Methodology) |
(→Repository) |
||
Line 8: | Line 8: | ||
== Repository == | == Repository == | ||
BRUNO is language dependent. Every language has its own set of entries to be addressed. The repository is divided into 6 different subprojects according to the frequency of use of the lemmas. | BRUNO is language dependent. Every language has its own set of entries to be addressed. The repository is divided into 6 different subprojects according to the frequency of use of the lemmas. | ||
+ | *BRUNO-A1 contains the list of the 2,000 most frequent lemmas of the language (including articles, prepositions, conjunctions, auxiliary verbs, etc.); | ||
+ | *BRUNO-A2 contains the next 3,000 most frequent lemmas of the language; | ||
+ | *BRUNO-B1 contains the next 5,000 most frequent lemmas of the language; | ||
+ | And so on. | ||
+ | |||
{|border="1" align="center" cellpadding="2" | {|border="1" align="center" cellpadding="2" | ||
!Repository | !Repository | ||
− | !# of lemmas | + | !# of lemmas |
|- | |- | ||
|align="center"|BRUNO-A1 | |align="center"|BRUNO-A1 |
Revision as of 16:41, 24 September 2013
The project BRUNO (Basic Resources for UNLizatiOn) is devoted to the creation of NL-UNL (analysis) dictionaries.
Contents |
Goal
The project BRUNO has two main goals:
- To provide several word-to-concept monolingual databases (i.e., encoding or reader's dictionaries). These dictionaries are expected to be used in UNLization, i.e., in generating UNL graphs out of natural language documents, especially through IAN.
- To find concepts that are not enclosed in the WordNet3.0 and should be incorporated to the UNL Dictionary.
Repository
BRUNO is language dependent. Every language has its own set of entries to be addressed. The repository is divided into 6 different subprojects according to the frequency of use of the lemmas.
- BRUNO-A1 contains the list of the 2,000 most frequent lemmas of the language (including articles, prepositions, conjunctions, auxiliary verbs, etc.);
- BRUNO-A2 contains the next 3,000 most frequent lemmas of the language;
- BRUNO-B1 contains the next 5,000 most frequent lemmas of the language;
And so on.
Repository | # of lemmas |
---|---|
BRUNO-A1 | 2,000 |
BRUNO-A2 | 3,000 |
BRUNO-B1 | 5,000 |
BRUNO-B2 | 5,000 |
BRUNO-C1 | 5,000 |
BRUNO-C2 | 5,000 |
Requisites
The project BRUNO is open to all languages complying with the following requisites:
- MIR-A1 and NADIA-A1 are required for BRUNO-A1;
- MIR-A2 and NADIA-A2 are required for BRUNO-A2;
- MIR-B1 and NADIA-B1 are required for BRUNO-B1;
- MIR-B2 and NADIA-B2 are required for BRUNO-B2;
- MIR-C1 and NADIA-C1 are required for BRUNO-C1;
- MIR-C2 and NADIA-C2 are required for BRUNO-C2;
- In all cases, the language must contain a reasonable amount of inflectional paradigms and subcategorization frames already registered in the UNLarium.
Methodology
- List of entries
- Participants are expected to provide a list of the entries according to the following criteria:
- The list of entries can be extracted from prestigious monolingual dictionaries or from a corpus considered to be representative of the standard written language[1].
- The list of entries must be ordered according to the frequency of occurrence (the most frequent entries must come first)[2].
- The list of entries must be lemmatized[3]
- Entries must be provided in a plain text file (.txt) with UTF-8 encoding, with one entry per line, along with the corresponding value of the lexical category LEX, in the following format:
- lemma:LEX[4]
- Participants are expected to provide a list of the entries according to the following criteria:
- Verification
- The list of entries is verified by a language manager or, in case there is no language manager for the target language, by the Language Resources Manager of the UNDL Foundation. If approved, it is uploaded to the UNLarium, and the corresponding BRUNO project is open.
- Dictionary
- Entries become available, in the UNLarium, for all the registered users of a given language, in case of open projects, or for the approved candidates, in case of closed projects. Users are expected to provide all the morphological, syntactic and semantic information for each entry
Notes
- ↑ This corpus can be either an existing reputable corpus or a new corpus compiled according to the criteria defined at NC.
- ↑ The frequency of use is not often informed by ordinary dictionaries but may be inferred from the several distributions of the same dictionary: basic, intermediate or advanced, for instance.
- ↑ There should be as many lemmas as different morphological behavior (part-of-speech, gender, number, inflections, etc.). The word "book", in English, should correspond to two lemmas: "book" as a noun, and "book" as a verb. Note that the many different meanings of "book" as a noun do not lead to different lemmas, because all of them have the same morphological behavior, i.e., are singular and make plural in -s. On the other hand, the noun "livre", in French, should correspond to two lemmas: "livre" as a noun masculine (= "book"), and "livre" as a noun feminine (= "pound"). This difference is not derived from the different meanings, but from the different morphological behavior: one is masculine and the other is feminine.
- ↑ See an example at [1]