Tokenization

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(Example)
Line 1: Line 1:
 
Tokenization is the process of segmenting the input into "tokens" (processing units).  
 
Tokenization is the process of segmenting the input into "tokens" (processing units).  
 
  
 
== Tokenization of natural language input ==
 
== Tokenization of natural language input ==
 
In the UNL framework, natural language input is tokenized according to five principles:
 
In the UNL framework, natural language input is tokenized according to five principles:
*The tokenization algorithm goes '''from left to right'''
+
 
*The tokenization algorithm tries to '''match first the longest entries in the dictionary'''
+
The tokenization algorithm goes '''from left to right'''
*The tokenization algorithm assigns the feature '''TEMP''' (temporary) to the strings that were not found in the dictionary
+
The tokenization algorithm tries to '''match first the longest entries in the dictionary'''
*The tokenization algorithm '''blocks sequences of tokens prohibited by [[Grammar_Specs#Disambiguation_Rules]]'''
+
The tokenization algorithm assigns the feature '''TEMP''' (temporary) to the strings that were not found in the dictionary
*In case of several possible candidates, the tokenization algorithm picks the ones induced by [[Grammar_Specs#Disambiguation_Rules]], if any
+
The tokenization algorithm '''blocks sequences of tokens prohibited by [[Grammar_Specs#Disambiguation_Rules]]'''
 +
In case of several possible candidates, the tokenization algorithm picks the ones induced by [[Grammar_Specs#Disambiguation_Rules]], if any
  
 
=== Example ===
 
=== Example ===
Line 22: Line 22:
 
[cd]{}""(...)<...>;<br />
 
[cd]{}""(...)<...>;<br />
 
[de]{}""(...)<...>;<br />
 
[de]{}""(...)<...>;<br />
[a]{}""(...)<...>;<br />
+
[a]{}""(A1,...)<...>;<br />
[b]{}""(...)<...>;<br />
+
[a]{}""(A2,...)<...>;<br />
[c]{}""(...)<...>;<br />
+
[b]{}""(B1,...)<...>;<br />
[d]{}""(...)<...>;<br />
+
[b]{}""(B2,...)<...>;<br />
[e]{}""(...)<...>;<br />
+
[c]{}""(C1,...)<...>;<br />
 +
[c]{}""(C2,...)<...>;<br />
 +
[d]{}""(D1,...)<...>;<br />
 +
[d]{}""(D2,...)<...>;<br />
 +
[e]{}""(E1,...)<...>;<br />
 +
[e]{}""(E2,...)<...>;<br />
 
<br />
 
<br />
  
Line 53: Line 58:
 
:disambiguation rule: NONE
 
:disambiguation rule: NONE
 
:tokenization output: [ab][X][c][Y][de] (where [X] and [Y] are temporary entries)
 
:tokenization output: [ab][X][c][Y][de] (where [X] and [Y] are temporary entries)
 +
 +
=== Properties of tokenization ===
 +
Positive disambiguation rules only apply over candidates to the same position with the same length
 +
;Case 7
 +
:input: "abcde"
 +
:disambiguation rule: ("abcd")("e")=100; (induces the sequence [abcd][e])
 +
:tokenization output: [abcde] (positive disambiguation rules do not prevail over the principle of longest first)
 +
;Case 8
 +
:input: "abcde"
 +
:disambiguation rule: ("/.{2,5}/")=0; (prohibits any token made of 2, 3, 4 or 5 characters)
 +
:tokenization output: [a,A1][b,B1][c,C1][d,D1][e,E1] (among the candidates, these entries come first in the dictionary)
 +
;Case 9
 +
:input: "abcde"
 +
:disambiguation rule: ("/.{2,5}/")=0; (A2)(B2)=1; (induces the sequence [A2][B2])
 +
:tokenization output: [a,A2][b,B2][c,C1][d,D1][e,E1] (A2 and B2 were chosen instead of A1 and B1)
 +
Positive disambiguation rules apply according to their probability and, inside the same probability, according to their order in the grammar
 +
;Case 10
 +
:input: "abcde"
 +
:disambiguation rule: ("/.{2,5}/")=0; (A2)(B2)=1; (A2)(B1)=1; (induces the sequence [A2][B1])
 +
:tokenization output: [a,A2][b,B2][c,C1][d,D1][e,E1] (the rule [A2][B2] has the some probability as [A2][B1] and comes first in the grammar)
 +
;Case 11
 +
:input: "abcde"
 +
:disambiguation rule: ("/.{2,5}/")=0; (A2)(B2)=1; (A2)(B1)=2; (induces the sequence [A2][B1])
 +
:tokenization output: [a,A2][b,B1][c,C1][d,D1][e,E1] (the priority of the rule [A2][B1] is higher than [A2][B2])
 +
;Case 12
 +
:input: "abcde"
 +
:disambiguation rule: ("/.{2,5}/")=0; (A2)(B2)=1; (B1)(C2)=2; (induces the sequence [A2][B1])
 +
:tokenization output: [a,A1][b,B1][c,C2][d,D1][e,E1] (the priority of the rule [B1][C2] is higher than [A2][B2])
 +
The attribute #FINAL is used to indicate context
 +
:input: "abcde"
 +
:disambiguation rule: ("/.{2,5}/")=0; (A2)(B2)=1; (B1,#FINAL)(C2)=5; (if B1 is fixed, C2 must be chosen)
 +
:tokenization output: [a,A2][b,B2][c,C1][d,D1][e,E1] (the same as case 11; compare with case 12 - note that the rule (B1,#FINAL)(C2) does not induce the choice of B1; it applies only to C2, i.e., it indicates that, if B1, then C2)

Revision as of 23:49, 4 April 2012

Tokenization is the process of segmenting the input into "tokens" (processing units).

Tokenization of natural language input

In the UNL framework, natural language input is tokenized according to five principles:

The tokenization algorithm goes from left to right
The tokenization algorithm tries to match first the longest entries in the dictionary
The tokenization algorithm assigns the feature TEMP (temporary) to the strings that were not found in the dictionary
The tokenization algorithm blocks sequences of tokens prohibited by Grammar_Specs#Disambiguation_Rules
In case of several possible candidates, the tokenization algorithm picks the ones induced by Grammar_Specs#Disambiguation_Rules, if any

Example

Dictionary:
[abcde]{}""(...)<...>;
[abcd]{}""(...)<...>;
[bcde]{}""(...)<...>;
[abc]{}""(...)<...>;
[bcd]{}""(...)<...>;
[cde]{}""(...)<...>;
[ab]{}""(...)<...>;
[bc]{}""(...)<...>;
[cd]{}""(...)<...>;
[de]{}""(...)<...>;
[a]{}""(A1,...)<...>;
[a]{}""(A2,...)<...>;
[b]{}""(B1,...)<...>;
[b]{}""(B2,...)<...>;
[c]{}""(C1,...)<...>;
[c]{}""(C2,...)<...>;
[d]{}""(D1,...)<...>;
[d]{}""(D2,...)<...>;
[e]{}""(E1,...)<...>;
[e]{}""(E2,...)<...>;

Case 1
input: "abcde"
disambiguation rule: NONE
tokenization output: [abcde] (the longest entry in the dictionary)
Case 2
input: "abcde"
disambiguation rule: ("abcde")=0; (prohibits the sequence "abcde")
tokenization output: [abcd][e] (the second longest possibility from left to right according to the dictionary)
Case 3
input: "abcde"
disambiguation rules: ("abcde")=0;("abcd")("e")=0;
tokenization output: [a][bcde] (the third longest possibility from left to right according to the dictionary)
Case 4
input: "abcde"
disambiguation rules: ("abcde")=0;("abcd")("e")=0;("a")("bcde")=0;
tokenization output: [abc][de] (the fourth longest possibility from left to right according to the dictionary)
Case 5
input: "abcde"
disambiguation rules: ("/.{3,5}/")=0; (prohibits any token made of 3, 4 or 5 characters)
tokenization output: [ab][cd][e]
Case 6
input: "abXcYde"
disambiguation rule: NONE
tokenization output: [ab][X][c][Y][de] (where [X] and [Y] are temporary entries)

Properties of tokenization

Positive disambiguation rules only apply over candidates to the same position with the same length
Case 7
input: "abcde"
disambiguation rule: ("abcd")("e")=100; (induces the sequence [abcd][e])
tokenization output: [abcde] (positive disambiguation rules do not prevail over the principle of longest first)
Case 8
input: "abcde"
disambiguation rule: ("/.{2,5}/")=0; (prohibits any token made of 2, 3, 4 or 5 characters)
tokenization output: [a,A1][b,B1][c,C1][d,D1][e,E1] (among the candidates, these entries come first in the dictionary)
Case 9
input: "abcde"
disambiguation rule: ("/.{2,5}/")=0; (A2)(B2)=1; (induces the sequence [A2][B2])
tokenization output: [a,A2][b,B2][c,C1][d,D1][e,E1] (A2 and B2 were chosen instead of A1 and B1)
Positive disambiguation rules apply according to their probability and, inside the same probability, according to their order in the grammar
Case 10
input: "abcde"
disambiguation rule: ("/.{2,5}/")=0; (A2)(B2)=1; (A2)(B1)=1; (induces the sequence [A2][B1])
tokenization output: [a,A2][b,B2][c,C1][d,D1][e,E1] (the rule [A2][B2] has the some probability as [A2][B1] and comes first in the grammar)
Case 11
input: "abcde"
disambiguation rule: ("/.{2,5}/")=0; (A2)(B2)=1; (A2)(B1)=2; (induces the sequence [A2][B1])
tokenization output: [a,A2][b,B1][c,C1][d,D1][e,E1] (the priority of the rule [A2][B1] is higher than [A2][B2])
Case 12
input: "abcde"
disambiguation rule: ("/.{2,5}/")=0; (A2)(B2)=1; (B1)(C2)=2; (induces the sequence [A2][B1])
tokenization output: [a,A1][b,B1][c,C2][d,D1][e,E1] (the priority of the rule [B1][C2] is higher than [A2][B2])
The attribute #FINAL is used to indicate context
input: "abcde"
disambiguation rule: ("/.{2,5}/")=0; (A2)(B2)=1; (B1,#FINAL)(C2)=5; (if B1 is fixed, C2 must be chosen)
tokenization output: [a,A2][b,B2][c,C1][d,D1][e,E1] (the same as case 11; compare with case 12 - note that the rule (B1,#FINAL)(C2) does not induce the choice of B1; it applies only to C2, i.e., it indicates that, if B1, then C2)
Software