Tokenization

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(Created page with "Tokenization is the process of segmenting the input into "tokens" (processing units). == Tokenization of natural language input == In the UNL framework, natural language in...")
 
(Example)
Line 11: Line 11:
  
 
=== Example ===
 
=== Example ===
Given the dictionary:<br />
+
Dictionary:<br />
 
[abcde]{}""(...)<...>;<br />
 
[abcde]{}""(...)<...>;<br />
 
[abcd]{}""(...)<...>;<br />
 
[abcd]{}""(...)<...>;<br />
Line 27: Line 27:
 
[d]{}""(...)<...>;<br />
 
[d]{}""(...)<...>;<br />
 
[e]{}""(...)<...>;<br />
 
[e]{}""(...)<...>;<br />
 +
<br />
 +
 +
;Case 1
 +
:input: "abcde"
 +
:disambiguation rule: NONE
 +
:tokenization output: [abcde] (the longest entry in the dictionary)
 +
;Case 2
 +
:input: "abcde"
 +
:disambiguation rule: ("abcde")=0; (prohibits the sequence "abcde")
 +
:tokenization output: [abcd][e] (the second longest possibility from left to right according to the dictionary)
 +
;Case 3
 +
:input: "abcde"
 +
:disambiguation rules: ("abcde")=0;("abcd")("e")=0;
 +
:tokenization output: [a][bcde] (the third longest possibility from left to right according to the dictionary)
 +
;Case 4
 +
:input: "abcde"
 +
:disambiguation rules: ("abcde")=0;("abcd")("e")=0;("a")("bcde")=0;
 +
:tokenization output: [abc][de] (the fourth longest possibility from left to right according to the dictionary)
 +
;Case 5
 +
:input: "abcde"
 +
:disambiguation rules: ("/.{3,5}/")=0; (prohibits any token made of 3, 4 or 5 characters)
 +
:tokenization output: [ab][cd][e]
 +
;Case 6
 +
:input: "abXcYde"
 +
:disambiguation rule: NONE
 +
:tokenization output: [ab][X][c][Y][de] (where [X] and [Y] are temporary entries)

Revision as of 23:17, 4 April 2012

Tokenization is the process of segmenting the input into "tokens" (processing units).


Tokenization of natural language input

In the UNL framework, natural language input is tokenized according to five principles:

  • The tokenization algorithm goes from left to right
  • The tokenization algorithm tries to match first the longest entries in the dictionary
  • The tokenization algorithm assigns the feature TEMP (temporary) to the strings that were not found in the dictionary
  • The tokenization algorithm blocks sequences of tokens prohibited by Grammar_Specs#Disambiguation_Rules
  • In case of several possible candidates, the tokenization algorithm picks the ones induced by Grammar_Specs#Disambiguation_Rules, if any

Example

Dictionary:
[abcde]{}""(...)<...>;
[abcd]{}""(...)<...>;
[bcde]{}""(...)<...>;
[abc]{}""(...)<...>;
[bcd]{}""(...)<...>;
[cde]{}""(...)<...>;
[ab]{}""(...)<...>;
[bc]{}""(...)<...>;
[cd]{}""(...)<...>;
[de]{}""(...)<...>;
[a]{}""(...)<...>;
[b]{}""(...)<...>;
[c]{}""(...)<...>;
[d]{}""(...)<...>;
[e]{}""(...)<...>;

Case 1
input: "abcde"
disambiguation rule: NONE
tokenization output: [abcde] (the longest entry in the dictionary)
Case 2
input: "abcde"
disambiguation rule: ("abcde")=0; (prohibits the sequence "abcde")
tokenization output: [abcd][e] (the second longest possibility from left to right according to the dictionary)
Case 3
input: "abcde"
disambiguation rules: ("abcde")=0;("abcd")("e")=0;
tokenization output: [a][bcde] (the third longest possibility from left to right according to the dictionary)
Case 4
input: "abcde"
disambiguation rules: ("abcde")=0;("abcd")("e")=0;("a")("bcde")=0;
tokenization output: [abc][de] (the fourth longest possibility from left to right according to the dictionary)
Case 5
input: "abcde"
disambiguation rules: ("/.{3,5}/")=0; (prohibits any token made of 3, 4 or 5 characters)
tokenization output: [ab][cd][e]
Case 6
input: "abXcYde"
disambiguation rule: NONE
tokenization output: [ab][X][c][Y][de] (where [X] and [Y] are temporary entries)
Software