Tokenization
From UNL Wiki
Tokenization is the process of segmenting the input into "tokens" (processing units).
Tokenization of natural language input
In the UNL framework, natural language input is tokenized according to five principles:
- The tokenization algorithm goes from left to right
- The tokenization algorithm tries to match first the longest entries in the dictionary
- The tokenization algorithm assigns the feature TEMP (temporary) to the strings that were not found in the dictionary
- The tokenization algorithm blocks sequences of tokens prohibited by Grammar_Specs#Disambiguation_Rules
- In case of several possible candidates, the tokenization algorithm picks the ones induced by Grammar_Specs#Disambiguation_Rules, if any
Example
Given the dictionary:
[abcde]{}""(...)<...>;
[abcd]{}""(...)<...>;
[bcde]{}""(...)<...>;
[abc]{}""(...)<...>;
[bcd]{}""(...)<...>;
[cde]{}""(...)<...>;
[ab]{}""(...)<...>;
[bc]{}""(...)<...>;
[cd]{}""(...)<...>;
[de]{}""(...)<...>;
[a]{}""(...)<...>;
[b]{}""(...)<...>;
[c]{}""(...)<...>;
[d]{}""(...)<...>;
[e]{}""(...)<...>;