LACEhpc

From UNL Wiki

(Difference between revisions)

Latest revision as of 13:05, 16 September 2013

The project LACE^hpc is part of the project LACE and aims at designing and implementing efficient high-performance computing methods for extracting monolingual and multilingual resources from comparable non-parallel corpora.

Goals

The project LACE^hpc is divided in four main tasks:

extracting n-grams from monolingual corpora;
aligning n-grams in bilingual corpora;
building monolingual and multilingual language models;
minimizing and indexing the resulting databases for use in the UNL framework.

The proposal includes the adaptation and implementation of existing algorithms; the evaluation, revision and optimization of extraction and alignment methods; and studies for sustainability of the resulting techniques, especially on scalability and portability.
In addition to HPC-oriented algorithms, the project is expected to deliver several different monolingual and bilingual databases, as well as aligned corpora and translation memories, which are important assets for natural language processing and fundamental resources for research in Linguistics and Computational Linguistics.

Corpus

In order to extract the data, we have proposed the use of the Wikipedia as our corpus.
The choice for the Wikipedia derives from five main reasons:

Relevance: Wikipedia is one of the largest reference web sites, attracting nearly 68 million visitors monthly;
Multilinguality: Wikipedia comprises more than 15,000,000 articles in more than 270 languages, many of which are inter-related and may be used to constitute a document-aligned multilingual comparable (non-parallel) corpus;
Comprehensiveness: Wikipedia is not constrained in domain;
Openness: Wikipedia texts are available under the Creative Commons Attribution-Share Alike License, which would avoid copyright issues concerning the distribution and use of the derived material;
Accessibility: Wikipedia is easily and freely downloadable.

The raw corpus is presented in two distributions at [1]:

The experimental corpus contains 10K documents from 3 languages (English, French and Japanese) aligned at the document level
The abridged corpus contains 100K documents from 10 languages (Dutch, English, French, German, Italian, Japanese, Polish, Portuguese, Russian and Spanish) aligned at the document level.

N-grams

main article: N-gram

The n-grams are presented in two different sets: continuous n-grams and discontinuous n-grams. Each set is further organized in four different subsets:

0. raw data (n-grams extracted from the corpus)
1. frequency filtered (n-grams whose frequency is equal or higher than the ratio between tokens/types for all n-grams in the corpus)
2. redundancy filtered (frequency-filtered n-grams that cannot be subsumed by any other existing frequency-filtered n-gram)
3. constituency scores (the results of applying constituency scores to the redundancy-filtered n-grams)

The latest release of the n-grams extracted in the project LACE^hpc may be downloaded from [2]

Anchors

main article: Anchor

MWE

main article: MWE

Participants

The project LACE^hpc has been developed by the UNDL Foundation in collaboration with the Centre for Advanced Modelling Science (CADMOS), which includes researchers from the University of Geneva (UNIGE) and from the École Polytechnique Fédérale de Lausanne (EPFL).

Project Managers
- Bastien CHOPARD (CADMOS)
- Gilles FALQUET (UNIGE)
- Ronaldo MARTINS (UNDL Foundation)
Participants
- Kamal CHICK ECHIOUK (UNDL Foundation)
- Meghdad FAHRAMAND (PhD student at UNIGE)
- Jean-Luc FALCONE (UNIGE)
- Jacques GUYOT (Simple Shift)

@@ Line 1: / Line 1: @@
-The project LACE<sup>hpc</sup> aims at designing and implementing efficient high-performance computing methods for extracting monolingual and multilingual resources from comparable non-parallel corpora.
+The project LACE<sup>hpc</sup> is part of the project [[LACE]] and aims at designing and implementing efficient high-performance computing methods for extracting monolingual and multilingual resources from comparable non-parallel corpora.
 == Goals ==
@@ Line 10: / Line 10: @@
 In addition to HPC-oriented algorithms, the project is expected to deliver several different monolingual and bilingual databases, as well as aligned corpora and translation memories, which are important assets for natural language processing and fundamental resources for research in Linguistics and Computational Linguistics.
-== Methodology ==
+== Corpus ==
-=== Corpus ===
 In order to extract the data, we have proposed the use of the [http://www.wikipedia.org Wikipedia] as our corpus.<br />
 The choice for the Wikipedia derives from five main reasons:
@@ Line 21: / Line 20: @@
 The raw corpus is presented in two distributions at [http://cadmos.undlfoundation.ch:8080/corpus/]:
 *The experimental corpus contains 10K documents from 3 languages (English, French and Japanese) aligned at the document level
-*The abridged corpus contains 100K documents from 10 languages (Chinese, English, French, German, Italian, Japanese, Polish, Portuguese, Russian and Spanish) aligned at the document level
+*The abridged corpus contains 100K documents from 10 languages (Dutch, English, French, German, Italian, Japanese, Polish, Portuguese, Russian and Spanish) aligned at the document level.
-== Definitions ==
+== N-grams ==
-== N-gram ==
+''main article: [[N-gram]]''<br /><br />
-An n-gram is any sequence of n strings isolated by blank space or punctuation mark.<br />
+The n-grams are presented in two different sets: continuous n-grams and discontinuous n-grams. Each set is further organized in four different subsets:
-For the purposes of this project, n-grams can be “continuous” or “discontinuous”:
+*0. raw data (n-grams extracted from the corpus)
-*a '''continuous n-gram''' is an invariant sequence of n immediately adjacent items, i.e., without any other items in-between;
+*1. frequency filtered (n-grams whose frequency is equal or higher than the ratio between tokens/types for all n-grams in the corpus)
-*a '''discontinuous n-gram''' is an invariant sequence of n items that are not necessarily adjacent but come always in the same position, i.e., which are isolated by the same number of in-between items.<br />
+*2. redundancy filtered (frequency-filtered n-grams that cannot be subsumed by any other existing frequency-filtered n-gram)
-A discontinuous n-gram may have one or more discontinuities. For instance: given “a b c d e”, extracted from the corpus, where “a”, “b”, “c”, “d” and “e” are items, “a b c d e” is a continuous n-gram, whereas “a . c d e”, “a b . d e”, “a b c . e”, “a . c . e”, “a . . d e”, “a b . . e”, “a . . . e” are discontinuous n-grams. Due to the necessity of defining the external boundaries of discontinuous n-grams, we limit the notion of discontinuity to the internal items of an n-gram, i.e., we do not consider “. b c d e”, “a b c d .”, “. . c d e”, “a b c . .”, “. . . d e”, “a b . . .”, etc., to be discontinuous n-grams; those are actually smaller n-grams (“b c d e”, “a b c d”, “c d e”, “a b c”, etc.).
+*3. constituency scores (the results of applying constituency scores to the redundancy-filtered n-grams)
+The latest release of the n-grams extracted in the project LACE<sup>hpc</sup> may be downloaded from [http://cadmos.undlfoundation.ch:8080/n-grams/]
+== Anchors ==
+''main article: [[Anchor]]''
+== MWE ==
+''main article: [[MWE]]''
-For the purpose of this project, the following filters were applied over the data:
-;Length <= 7
-:For the purpose of this project, we treated both continuous and discontinuous n-grams with up to 7 items, i.e., where 1 <= n <= 7.
-;Frequency >= tokens/types
-:For the purpose of this project, we intend an n-gram to be linguistically relevant in the corpus if its frequency of occurrence is equal or higher than the ratio between tokens and types, where “tokens” is total number of n-grams in the corpus, and “types” is the number of distinct n-grams in the corpus. For instance: given a corpus with 5,000 occurrences of distinct 1,000 unigrams, a unigram is considered relevant if, and only if, it occurs 5 or more times.
-;Redundancy
-:For the purpose of this project, we intend an n-gram to be linguistically redundant if it is subsumed by any other x-gram, where x>=n. In that sense, the unigram “a” is considered unique if, and only if, there is at least one context “x a” and at least one context “a y”, where “x a” and “a y” have not been defined as an n-gram according to the criteria above concerning length and frequency. For instance, the items “Sri” and “Lanka” are not considered to be unigrams because they cannot occur in isolation: they always appear as part of the bigram “Sri Lanka” (i.e., there is no context in the corpus in which we have “Sri” but not “Lanka”). The same applies for discontinuous n-grams: the sequence “a . . d” is a four-gram if it is not subsumed by the four-gram “a b . d”, i.e., if there is at least one “a x . d” where x ≠ b.
-== MWE (multiword expression) ==
-== Anchor ==
-Anchors are
 == Participants ==
@@ Line 56: / Line 52: @@
 == Files ==
-;[http://www.undlfoundation.org/lace/docs/LACEhpc.pdf Original project]
+*[http://www.undlfoundation.org/lace/docs/LACEhpc.pdf Original project]
-;[http://cadmos.undlfoundation.ch:8080/n-grams/ N-grams]
-:The n-grams are presented in two different sets: continuous n-grams and discontinuous n-grams. Each set is further organized in four different subsets:
-:*0. raw data (n-grams extracted from the corpus)
-:*1. frequency filtered (n-grams whose frequency is equal or higher than the ratio between tokens/types for all n-grams in the corpus)
-:*2. redundancy filtered (frequency-filtered n-grams that cannot be subsumed by any other existing frequency-filtered n-gram)
-:*3. constituency scores (the results of applying constituency scores to the redundancy-filtered n-grams)
 == Support ==
 The LACE<sup>hpc</sup> project is supported by a grant from the Hans Wilsdorf Foundation.
+== Notes ==
+<references />

LACEhpc

Latest revision as of 13:05, 16 September 2013

Contents

Goals

Corpus

N-grams

Anchors

MWE

Participants

Files

Support

Notes

Views

Personal tools

Search

UNL

Lingware

Software

UNL Program

Navigation

Toolbox

Print/export