Localization

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(Localization of the English Dictionary)
(Localization of the English Dictionary)
Line 18: Line 18:
 
*The localization should be corpus-driven: your dictionary should contain all and only the words appearing in your translated version of the corpus.  
 
*The localization should be corpus-driven: your dictionary should contain all and only the words appearing in your translated version of the corpus.  
 
*:The word "book" may have several different meanings, and may be translated by different words in your locale. You don't have to consider all these possibilities. You must address only the use that the word actually had in the corpus. Additionally, you don't have to treat forms that did not appear in the corpus. In the English Dictionary, for instance, there are both "book" and "books", because these two forms appear in the corpus; but note that there is only "girl" in singular, because "girls" in plural does not appear in the corpus. In short: your dictionary must reflect your corpus. You don't have to include words or word forms that are not there. You are not creating a generic dictionary, but a rather small, corpus-driven dictionary, because the main goal here is not the dictionary, but the grammars.
 
*:The word "book" may have several different meanings, and may be translated by different words in your locale. You don't have to consider all these possibilities. You must address only the use that the word actually had in the corpus. Additionally, you don't have to treat forms that did not appear in the corpus. In the English Dictionary, for instance, there are both "book" and "books", because these two forms appear in the corpus; but note that there is only "girl" in singular, because "girls" in plural does not appear in the corpus. In short: your dictionary must reflect your corpus. You don't have to include words or word forms that are not there. You are not creating a generic dictionary, but a rather small, corpus-driven dictionary, because the main goal here is not the dictionary, but the grammars.
 +
*The set of natural language entries depend on the locale:
 +
*:In English, for instance, we may have different modal auxiliaries for ability ("can") and possibility ("may"). In French, there is only one: "pouvoir". On the other hand, in French there are three different definite articles ("le","la","les"), whereas in English there is only one ("the"). These differences do affect the dictionary structure:
 +
*::English dictionary
 +
*:::[can]{}"" (LEX=V,POS=AUX,POS=MOV,att=@ability)<eng,255,0>;
 +
*:::[may]{}"" (LEX=V,POS=AUX,POS=MOV,att=@possibility)<eng,255,0>;
 +
*::French dictionary
 +
*:::[peut]{}""(LEX=V,POS=AUX,POS=MOV,ATE=PRS,PER=3PS,att=@ability,att=@possibility)<eng,255,0>; (note that "pouvoir", in French, may have several other different forms: "peux", "pouvons", "pouvez", etc, but they do not appear in the corpus)
 +
*::English dictionary
 +
*:::[the]{}"" (LEX=D,POS=ART,att=@def)<eng,255,0>;
 +
*::French dictionary
 +
*:::[le]{}"" (LEX=D,POS=ART,NUM=SNG,GEN=MCL,att=@def)<eng,255,0>; (note that number and gender, which were not revelant in English, are now required in French)
 +
*:::[la]{}"" (LEX=D,POS=ART,NUM=SNG,GEN=FEM,att=@def)<eng,255,0>;
 +
*:::[les]{}"" (LEX=D,POS=ART,NUM=PLR,att=@def)<eng,255,0>;

Revision as of 15:48, 29 July 2012

Localization is the process of adapting dictionaries and grammars to a specific language, which is referred to as "locale". In the UNL framework, localization of existing resources is one of the strategies that can be adopted in order to create basic dictionaries and grammars required to process simple corpora, such as Corpus500. In what follows, we give some instructions concerning the localization of the English dictionary and the English grammar.

Localization of the English Dictionary

In order to localize the English Dictionary, you have to consider the following:

  • The localization affects only the fields [natural language entry] and (the feature list). Do not localize the field "UW": it is not English (although written in English); it is UNL.
    For instance: given the entry
    [book]{}"book" (LEX=N,POS=NOU,NUM=SNG)<eng,0,0>;
    The localization of this entry to Spanish, French, Portuguese, German and Russian should be as follows:
    [livre]{}"book" (LEX=N,POS=NOU,NUM=SNG,GEN=MCL)<fra,0,0>;
    [libro]{}"book" (LEX=N,POS=NOU,NUM=SNG,GEN=MCL)<esp,0,0>;
    [livro]{}"book" (LEX=N,POS=NOU,NUM=SNG,GEN=MCL)<por,0,0>;
    [Buch]{}"book" (LEX=N,POS=NOU,NUM=SNG,GEN=NEU)<deu,0,0>;
    [книга]{}"book" (LEX=N,POS=NOU,NUM=SNG,GEN=FEM,CAS=NOM)<rus,0,0>;
  • The feature list depends on the locale
    In English, for instance, there is no gender information, because English has no grammatical gender. This is not the case of French, where gender must be provided. In Russian, in addition to gender, we have to add case as well, because Russian has morphological case. You have to add, to the feature list, the inflectional categories available for your language. But note that you may only use the tags available at the Tagset.
  • Values of attributes depend on the locale
    Note also that the values of the attribute may vary from language to language. In Spanish, the natural language word corresponding to "book" is masculine (GEN=MCL); in German, it is neutral (GEN=NEU); in Russian, it is feminine (GEN=FEM).
  • The localization should be corpus-driven: your dictionary should contain all and only the words appearing in your translated version of the corpus.
    The word "book" may have several different meanings, and may be translated by different words in your locale. You don't have to consider all these possibilities. You must address only the use that the word actually had in the corpus. Additionally, you don't have to treat forms that did not appear in the corpus. In the English Dictionary, for instance, there are both "book" and "books", because these two forms appear in the corpus; but note that there is only "girl" in singular, because "girls" in plural does not appear in the corpus. In short: your dictionary must reflect your corpus. You don't have to include words or word forms that are not there. You are not creating a generic dictionary, but a rather small, corpus-driven dictionary, because the main goal here is not the dictionary, but the grammars.
  • The set of natural language entries depend on the locale:
    In English, for instance, we may have different modal auxiliaries for ability ("can") and possibility ("may"). In French, there is only one: "pouvoir". On the other hand, in French there are three different definite articles ("le","la","les"), whereas in English there is only one ("the"). These differences do affect the dictionary structure:
    English dictionary
    [can]{}"" (LEX=V,POS=AUX,POS=MOV,att=@ability)<eng,255,0>;
    [may]{}"" (LEX=V,POS=AUX,POS=MOV,att=@possibility)<eng,255,0>;
    French dictionary
    [peut]{}""(LEX=V,POS=AUX,POS=MOV,ATE=PRS,PER=3PS,att=@ability,att=@possibility)<eng,255,0>; (note that "pouvoir", in French, may have several other different forms: "peux", "pouvons", "pouvez", etc, but they do not appear in the corpus)
    English dictionary
    [the]{}"" (LEX=D,POS=ART,att=@def)<eng,255,0>;
    French dictionary
    [le]{}"" (LEX=D,POS=ART,NUM=SNG,GEN=MCL,att=@def)<eng,255,0>; (note that number and gender, which were not revelant in English, are now required in French)
    [la]{}"" (LEX=D,POS=ART,NUM=SNG,GEN=FEM,att=@def)<eng,255,0>;
    [les]{}"" (LEX=D,POS=ART,NUM=PLR,att=@def)<eng,255,0>;
Software