LACE

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(Files)
 
(10 intermediate revisions by one user not shown)
Line 6: Line 6:
 
The project LACE aims at compiling, replicating and extending techniques that have been widely used in statistical natural language processing, and evaluating their results in UNL-based applications. As a long term enterprise, the Project has been divided in three subsidiary projects, devoted to three different types of corpus and involving, therefore, three different extraction strategies:
 
The project LACE aims at compiling, replicating and extending techniques that have been widely used in statistical natural language processing, and evaluating their results in UNL-based applications. As a long term enterprise, the Project has been divided in three subsidiary projects, devoted to three different types of corpus and involving, therefore, three different extraction strategies:
 
*LACE<sup>pc</sup> - To extract data from parallel corpora (proceedings from the United Nations and from the European Parliament);
 
*LACE<sup>pc</sup> - To extract data from parallel corpora (proceedings from the United Nations and from the European Parliament);
*LACE<sup>hpc</sup> - To extract data from comparable semi-parallel corpora (Wikipedia) using high-performance computing; and
+
*[[LACEhpc|LACE<sup>hpc</sup>]] - To extract data from comparable semi-parallel corpora (Wikipedia) using high-performance computing; and
 
*LACE<sup>npc</sup> - To extract data from comparable non-parallel corpora (newspapers) using linguistically-motivated models of language automatic acquisition.
 
*LACE<sup>npc</sup> - To extract data from comparable non-parallel corpora (newspapers) using linguistically-motivated models of language automatic acquisition.
 
== LACE<sup>hpc</sup> ==
 
The project LACE<sup>hpc</sup> aims at designing and implementing efficient high-performance computing methods for extracting monolingual and multilingual resources from comparable non-parallel corpora.
 
 
=== Methodology ===
 
The project LACE<sup>hpc</sup> is divided in four main tasks:
 
*extracting n-grams from monolingual corpora;
 
*aligning n-grams in bilingual corpora;
 
*building monolingual and multilingual language models;
 
*minimizing and indexing the resulting databases for use in the UNL framework.
 
The proposal includes the adaptation and implementation of existing algorithms; the evaluation, revision and optimization of extraction and alignment methods; and studies for sustainability of the resulting techniques, especially on scalability and portability. In addition to HPC-oriented algorithms, the project is expected to deliver several different monolingual and bilingual databases, as well as aligned corpora and translation memories, which are important assets for natural language processing and fundamental resources for research in Linguistics and Computational Linguistics.
 
 
=== Participants ===
 
The project LACE<sup>hpc</sup> has been developed by the UNDL Foundation in collaboration with the Centre for Advanced Modelling Science (CADMOS), which includes researchers from the University of Geneva (UNIGE) and from the École Polytechnique Fédérale de Lausanne (EPFL).
 
*Project Managers
 
**Bastien CHOPARD (CADMOS)
 
**Gilles FALQUET (UNIGE)
 
**Ronaldo MARTINS (UNDL Foundation)
 
**Martin RAJMAN (EPFL)
 
*Participants
 
**Kamal CHICK ECHIOUK (UNDL Foundation)
 
**Meghdad FAHRAMAND (PhD student at UNIGE)
 
**Jean-Luc FALCONE (UNIGE)
 
**Jacques GUYOT (Simple Shift)
 
 
=== Files ===
 
*Documents
 
**[http://www.undlfoundation.org/lace/docs/LACEhpc.pdf Original project]
 
**Partial report
 
*Corpus (Wikipedia)
 
**[http://www.undlfoundation.ch/cadmos/corpus/wikipedia_exp_version.rar Experimental] (10K documents in 3 languages = 30K documents in total)
 
**Abridged
 
***[http://www.undlfoundation.ch/cadmos/corpus/wikipedia_abr_version.rar Aligned at the document level] (100K documents in 10 languages = 1,000K documents in total)
 
***[http://www.undlfoundation.ch/cadmos/corpus/wikipedia_ali_version.rar Aligned at the sentence level] (French-English only)
 
**Unabridged (the whole Wikipedia)
 
*N-grams
 
**[http://www.undlfoundation.ch/cadmos/ngrams/raw/ Raw data]
 
**[http://www.undlfoundation.ch/cadmos/ngrams/freq/ Filtered for frequency]
 
**[http://www.undlfoundation.ch/cadmos/ngrams/red/ Filtered for redundancy]
 
**[http://www.undlfoundation.ch/cadmos/ngrams/const/ Filtered for constituency]
 

Latest revision as of 14:12, 11 July 2013

The main goal of the project LACE (Language Acquisition from Comparable tExts) is to build language modules out of data automatically extracted from comparable corpora. The results are expected to be incorporated in the architecture of UNL-based systems as supplementary resources for natural language disambiguation, both in analysis and generation, and will be used for improving the performance of applications in machine translation, summarization, information retrieval and semantic reasoning.

Motivation

UNL-based systems have been built upon lexical resources provided in a rather manual basis, mainly because the current technology on word sense disambiguation has not achieved yet the maturity level that would dispense the treatment by humans. The increasing availability of natural language data in digital format encourages, however, the exploration of strategies for extracting supplementary lexical information from comparable corpora, which could extend the coverage of the current resources and, in the end, may provide a less expensive alternative for populating lexical databases in the UNL framework.

The project LACE aims at compiling, replicating and extending techniques that have been widely used in statistical natural language processing, and evaluating their results in UNL-based applications. As a long term enterprise, the Project has been divided in three subsidiary projects, devoted to three different types of corpus and involving, therefore, three different extraction strategies:

  • LACEpc - To extract data from parallel corpora (proceedings from the United Nations and from the European Parliament);
  • LACEhpc - To extract data from comparable semi-parallel corpora (Wikipedia) using high-performance computing; and
  • LACEnpc - To extract data from comparable non-parallel corpora (newspapers) using linguistically-motivated models of language automatic acquisition.
Software