How to create inflectional paradigms

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(Warnings)
(Warnings)
Line 18: Line 18:
 
:*3PS&PRS&IND
 
:*3PS&PRS&IND
 
:because this is the MINIMUM SET of simple word forms that the verb may assume (as in "love, loved, loving, loves").<br />
 
:because this is the MINIMUM SET of simple word forms that the verb may assume (as in "love, loved, loving, loves").<br />
:Note that the base form "love" may not have any other word form. Obviously, some of these forms are used to convey different information (e.g., "love" is infinitive, but it is also first person present indicative, second person present indicative, ..., first person present subjunctive, second person imperative, etc.). But all this information is conveyed by the same string of the infinitive, and there is no reason to include the alternatives in the dictionary. If we do so, we will have a serious person/tense disambiguation problem during [[tokenization]]. In a very simple sentence such as "I love Paris", there will be many different candidates for "love", and this handles the analysis very difficult. In order to be sure that we are picking "love" as 1PS&PRS&IND, we would have to have so many disambiguation rules that will be very expensive in terms of processing and time. It's much easier simply to have one single form "love" as an infinitive, and to calculate the tense, person and aspect inside the grammar.<br />
+
:Note that the base form "love" cannot generate any other word form. Obviously, some of these forms are used to convey different information (e.g., "love" is infinitive, but it is also first person present indicative, second person present indicative, ..., first person present subjunctive, second person imperative, etc.). But all this information is conveyed by the same string of the infinitive, and there is no reason to include the alternatives in the dictionary. If we do so, we will have a serious person/tense disambiguation problem during [[tokenization]]. In a very simple sentence such as "I love Paris", there will be many different candidates for "love", and this handles the analysis very difficult. In order to be sure that we are picking "love" as 1PS&PRS&IND, we would have to have so many disambiguation rules that will be very expensive in terms of processing and time. It's much easier simply to have one single form "love" as an infinitive, and to calculate the tense, person and aspect inside the grammar.<br />
 
:Therefore, the English verbal paradigm, that could be expressed through many different rules (most of which will generate the same word forms), such as:
 
:Therefore, the English verbal paradigm, that could be expressed through many different rules (most of which will generate the same word forms), such as:
 
:*INF:=0>"";
 
:*INF:=0>"";

Revision as of 11:16, 30 January 2014

Inflectional paradigms are sets of rules that are used to generate inflections out of the base forms. In the dictionary, we store only the base forms (e.g., "book" and "explain"); the inflections ("book/books", "explain/explains/explained/explaining" are generated through rules. These rules are of the A-rule (affixation rules) type.

Contents

Warnings

Before starting, consider the following:

Do not duplicate paradigms.
Before creating a paradigm, check whether it is really necessary, i.e., whether there is no existing paradigm that may be used in order to generate the intended inflections.
Do not create paradigms for a single word.
Paradigms are used to describe the behavior of several words. If the behavior is irregular, i.e., it is restricted only to a single word, it should be described as an inflectional rule instead of an inflectional paradigm. For instance, the plural of the English word "foot" is better generated by an inflectional rule rather than by an inflectional paradigm. Inflectional rules are not included in the grammar. They are added directly to the dictionary entry, in the dictionary.
Do not include compound forms in your paradigm.
Paradigms must deal only with simple forms, i.e., forms that can be generated by prefixation, infixation or suffixation. In many cases, inflections are also generated by adding auxiliary or supporting words. These compound forms must not be included inside the paradigm, but should be handled by the grammar. For instance, in English, the simple present ("explain">"explain"/"explains") is defined inside the paradigm, but the present progressive and the future are not ("explain">"is explaining", "explain">"will explain") because they cannot be formed through suffixation. They require more complex structures and should be not treated as simple string manipulations (note that the negation, for instance, comes between the auxiliary and the main verb: "is NOT explaining", "will NOT explain", and this prevents the possiblity of treating "will explain" as a single string formed out of "explain" through the prefixation of "will ").
Avoid redundant forms in your paradigm
Paradigms must generate the MINIMUM SET of different word forms that can be associated to the same base form. By "word form", we understand all the possible variants that a base form must have "at the string level".
For instance, the verbal morphology of English is represented, in most regular cases (such as "to love"), only by 4 rules:
  • INF
  • PAS
  • GER
  • 3PS&PRS&IND
because this is the MINIMUM SET of simple word forms that the verb may assume (as in "love, loved, loving, loves").
Note that the base form "love" cannot generate any other word form. Obviously, some of these forms are used to convey different information (e.g., "love" is infinitive, but it is also first person present indicative, second person present indicative, ..., first person present subjunctive, second person imperative, etc.). But all this information is conveyed by the same string of the infinitive, and there is no reason to include the alternatives in the dictionary. If we do so, we will have a serious person/tense disambiguation problem during tokenization. In a very simple sentence such as "I love Paris", there will be many different candidates for "love", and this handles the analysis very difficult. In order to be sure that we are picking "love" as 1PS&PRS&IND, we would have to have so many disambiguation rules that will be very expensive in terms of processing and time. It's much easier simply to have one single form "love" as an infinitive, and to calculate the tense, person and aspect inside the grammar.
Therefore, the English verbal paradigm, that could be expressed through many different rules (most of which will generate the same word forms), such as:
  • INF:=0>"";
  • GER:=1>"ing";
  • PTP:=0>"d";
  • 1PS&PRS&IND=0>"";
  • 2PS&PRS&IND=0>"";
  • 3PS&PRS&IND=0>"e";
  • 1PP&PRS&IND=0>"";
  • 2PP&PRS&IND=0>"";
  • 3PP&PRS&IND=0>"";
  • 1PS&PRS&SUB=0>"";
  • 2PS&PRS&SUB=0>"";
  • 3PS&PRS&SUB=0>"";
  • 1PP&PRS&SUB=0>"";
  • 2PP&PRS&SUB=0>"";
  • 3PP&PRS&SUB=0>"";
  • 1PS&PAS&IND=0>"d";
  • 2PS&PAS&IND=0>"d";
  • 3PS&PAS&IND=0>"d";
  • 1PP&PAS&IND=0>"d";
  • 2PP&PAS&IND=0>"d";
  • 3PP&PAS&IND=0>"d";
can be reduced to 4 rules (in case of "love"[1]):
  • INF:=0>"";
  • GER:=1>"ing";
  • PAS:=0>"d";
  • 3PS&PRS&IND=0>"e";
Being all the others calculated directly in the grammar.

Steps

In order to create inflectional paradigms, follow the steps below:

  1. Create the inflectional schema for the intended part-of-speech, if it has not been created yet
  2. Create the paradigm
    1. Name the paradigm
    2. Define the paradigm
    3. Create the rules
    4. Provide an exemplar base form (to test the paradimg)
    5. Provide examples

1. Create the inflectional schema

The inflectional schema is a template used to build paradigms and to assure that they will follow the same structure.
The inflectional schema is a list of inflectional categories for each part-of-speech. It describes the differences between the possible forms of the same lemma.
Consider the examples below for English, French and Latin.

English

In English, inflections concern only two part-of-speech: nouns and verbs. The others (determiners, adjectives, adverbs, etc.) are not inflectional.

Nouns
English nouns may have two forms: singular (SNG) and plural (PLR). Therefore, the inflectional schema for English nouns is the following:
  • SNG (singular): table, man, foot
  • PLR (plural): tables, men, feet
Verbs
English verbs may have several forms, but there are only 5 simple distinctive forms: infinitive (INF), gerund (GER), participle (PTP), simple past (PAS) and third person present indicative (3PS&PRS&IND). Therefore, the inflectional schema for English verbs is the following:
  • INF (infinitive): love, do
  • GER (gerund): loving, doing
  • PAS (past): loved, did
  • PTP (participle): loved, done
  • 3PS&PRS&IND (third person singular present indicative): loves, does
Note, in the above, that the inflectional schema does not include simple present (PRS) because this uses the same forms of the infinitive. Note, also, that the only person informed is the third person singular (3PS) in case of present indicative (PRS&IND), because this is the only one that has a special behavior. Note, at last, that the schema does not include any compound tense (such as future, present progressive, present perfect, past perfect, etc.), because they cannot be generated through simple affixation.

French

In French, inflections affect nouns, adjectives and verbs.

Nouns
There are two types of French nouns: those that have only number inflection, and those that have number and gender. There will be, therefore, two inflectional schemes:
  • Nouns inflecting only in number (such as "table" (=table), "ville" (=city), "voiture" (=car), "père" (=father), "dentiste" (=dentist), etc.)
    • SNG (singular): table, ville, voiture, père, dentiste
    • PLR (plural): tables, villes, voitures, pères, dentistes
  • Nouns inflecting in number and gender (such as "ami" (=friend), "chien" (=dog), "danceur" (=dancer), etc.)
    • MCL&SNG (masculine singular): ami, chien, danceur
    • FEM&SNG (feminine singular): amie, chienne, danceuse
    • MCL&PLR (masculine plural): amis, chiens, danceurs
    • FEM&PLR (feminine plural): amies, chiennes, danceuses
Adjectives
In French, adjectives vary regularly in number and gender, according to the following inflectional schema:
  • MCL&SNG (masculine singular): beau
  • FEM&SNG (feminine singular): belle
  • MCL&PLR (masculine plural): beaux
  • FEM&PLR (feminine plural): belles
Verbs
In French, verbs may have 51 different simple forms, as described in the following inflectionaln schema:
  • INF (infinitive): aimer
  • PTP&MCL&SNG (participle masculine singular): aimé
  • PTP&MCL&PLR (participle masculine plural): aimés
  • PTP&FEM&SNG (participle feminine singular): aimée
  • PTP&FEM&PLR (particile feminine plural): aimées
  • 1PS&PRS&IND (first person singular present indicative): aime
  • 2PS&PRS&IND (second person singular present indicative): aimes
  • 3PS&PRS&IND (third person singular present indicative): aime
  • 1PP&PRS&IND (first person plural present indicative): aimons
  • 2PP&PRS&IND (second person plural present indicative): aimez
  • 3PP&PRS&IND (third person plural present indicative): aiment
  • 1PS&PAS&NPFV&IND (first person singular past imperfective indicative): aimais
  • 2PS&PAS&NPFV&IND (second person singular past imperfective indicative): aimais
  • 3PS&PAS&NPFV&IND (third person singular past imperfective indicative): aimait
  • 1PP&PAS&NPFV&IND (first person plural past imperfective indicative): aimions
  • 2PP&PAS&NPFV&IND (second person plural past imperfective indicative): aimiez
  • 3PP&PAS&NPFV&IND (third person plural past imperfective indicative): aient
  • etc. (see the complete list at French grammar)

Latin

In Latin, inflections affect nouns, adjectives and verbs.

Nouns
Latin nouns may inflect in number and case (or in number, gender and case, in the special cases of some words having animals and human as referents, as in French)
  • NOM&SNG (nominative singular): rosa
  • NOM&PLR (nominative plural): rosae
  • VOC&SNG (vocative singular): rosa
  • VOC&PLR (vocative plural): rosae
  • ACC&SNG (accusative singular): rosam
  • ACC&PLR (accusative plural): rosas
  • GNT&SNG (genitive singular): rosae
  • GNT&PLR (genitive plural): rosarum
  • DAT&SNG (dative singular): rosae
  • DAT&PLR (dative plural): rosis
  • ABL&SNG (ablative singular): rosa
  • ABL&PLR (ablative plural): rosis
Adjectives
Latin adjectives inflect in gender, number and case
  • MCL&NOM&SNG (masculine nominative singular): bonus
  • MCL&NOM&PLR (masculine nominative plural): boni
  • MCL&VOC&SNG (masculine vocative singular): bone
  • MCL&VOC&PLR (masculine vocative plural): boni
  • MCL&ACC&SNG (masculine accusative singular): bonum
  • MCL&ACC&PLR (masculine accusative plural): bonos
  • MCL&GNT&SNG (masculine genitive singular): boni
  • MCL&GNT&PLR (masculine genitive plural): bonorum
  • MCL&DAT&SNG (masculine dative singular): bono
  • MCL&DAT&PLR (masculine dative plural): bonis
  • MCL&ABL&SNG (masculine ablative singular): bono
  • MCL&ABL&PLR (masculine ablative plural): bonis
  • FEM&NOM&SNG (feminine nominative singular): bona
  • FEM&NOM&PLR (feminine nominative plural): bonae
  • FEM&VOC&SNG (feminine vocative singular): bona
  • FEM&VOC&PLR (feminine vocative plural): bonae
  • FEM&ACC&SNG (feminine accusative singular): bonam
  • FEM&ACC&PLR (feminine accusative plural): bonas
  • FEM&GNT&SNG (feminine genitive singular): boni
  • FEM&GNT&PLR (feminine genitive plural): bonarum
  • FEM&DAT&SNG (feminine dative singular): bonae
  • FEM&DAT&PLR (feminine dative plural): bonis
  • FEM&ABL&SNG (feminine ablative singular): bona
  • FEM&ABL&PLR (feminine ablative plural): bonis
  • NEU&NOM&SNG (neuter nominative singular): bonum
  • NEU&NOM&PLR (neuter nominative plural): bona
  • NEU&VOC&SNG (neuter vocative singular): bonum
  • NEU&VOC&PLR (neuter vocative plural): bona
  • NEU&ACC&SNG (neuter accusative singular): bonum
  • NEU&ACC&PLR (neuter accusative plural): bona
  • NEU&GNT&SNG (neuter genitive singular): boni
  • NEU&GNT&PLR (neuter genitive plural): bonorum
  • NEU&DAT&SNG (neuter dative singular): bono
  • NEU&DAT&PLR (neuter dative plural): bonis
  • NEU&ABL&SNG (neuter ablative singular): bono
  • NEU&ABL&PLR (neuter ablative plural): bonis
Verbs
Latin verbs inflect in many different simple forms (see the complete list at Latin grammar)

Observations

  1. The same part-of-speech may involve different inflectional schemes.
    In French, for instance, some nouns, such as "livre" (= book), only inflect in number (SNG and PLR); other nouns, such as "ami" (= friend), inflect in number and in gender (MCL&SNG,MCL&PLR,FEM&SNG,FEM&PLR). In these cases, there can be more than one inflectional schema for the same part-of-speech.
  2. Inflectional schemes must only include simple forms (i.e., those that are formed by affixation).
    Do not include categories in inflectional schema if they involve auxiliary or supporting words (such as future, in English, or passé composé, in French)
  3. Rules are not cumulative.
    You have to combine inflectional categories in one same condition because it's not possible to apply rules sequentially. For instance, it's not possible, in French, to write simply FEM:=0>"e"; and PLR:=0>"s";. It's necessary to write FEM&PLR:=0>"es";. This happens because, for the time being, it's not possible to tell the machine in which order the rules should be applied, i.e., we could have "amise" instead of "amies", if we define the number and the gender separately.
  4. Rules must be mutually exclusive.
    Inside the same paradigm, the conditions must be necessarily different, i.e., there cannot be two rules with the same conditions, or a rule that contains another rule:
    • SNG:=0>"";MCL&SNG:=0>""; (the condition SNG, of the first rule, is included in the condition of the second rule)
    • PLR:=0>"";PLR:=0>"es"; (the condition of the first rule is the same as the second rule)
  5. In order to deal with possible variants for the same lemma, the features ALT1, ALT2, ALT3, etc. must be used:
    For instance, the English word "fish" may have two different plurals: "fish" and "fishes". This is to be represented by
    • SNG:=0>"";PLR&ALT1:=0>"";PLR&ALT2:=0>"es"; instead of
    • SNG:=0>"";PLR:=0>"";PLR:=0>"es";
  6. Inflectional schemes are created inside the UNLarium (UNLARIUM>GRAMMAR>[LOCALE]>SETTINGS>INFLECTIONAL SCHEMES)
    Inflectional schemes, as templates, are created only once. After being created, they become available inside the UNLarium are used as templates in order to create new paradigms.

2. Create the paradigm

Paradigms are created inside the UNLarium (at UNLWEB>UNLARIUM>GRAMMAR>[LOCALE]>INFLECTIONAL PARADIGMS>ADD). They are normally based in inflectional schemes (templates).

2.1 Name the paradigm

The first field to be provided in the paradigm form is "name". Paradigm names must be unique. The following standards have been used to name paradigms:

  • a common name (such as "first declension", "first group"), in case of well-established reference;
  • the rule itself, in case of single-rule paradigms;
  • the most distinctive rule, if any; or
  • a "leading form", i.e., a typical example (a prototype) representative of the whole category, otherwise.

2.2 Define the paradigm

The paradigm definition must state clearly what the paradigm does and when it is applied.

2.3 Create the rules

Rules are normally creating by filling in the inflectional schema (which may be selected by the specific button at the right side of the form). Paradigm rules are always of the A-rule type, i.e., they are always affixation rules.

2.4 Provide an exemplar base form

The exemplar base form is used to test the paradigm, by pressing the right button after it. It must be a base form, over which the rules will be applied.

2.5 Provide examples

The examples illustrate other uses of the paradigm (in addition to the exemplar base form). Examples of the same category must be isolated by comma, and examples of different categories are isolated by semicolon. For instance:

  • PAR: M2
  • Examples: book,books; table,tables;
Software