Tokenization

From UNL Wiki

Revision as of 23:07, 4 April 2012 by Martins (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Jump to: navigation, search

Tokenization is the process of segmenting the input into "tokens" (processing units).

Tokenization of natural language input

In the UNL framework, natural language input is tokenized according to five principles:

The tokenization algorithm goes from left to right
The tokenization algorithm tries to match first the longest entries in the dictionary
The tokenization algorithm assigns the feature TEMP (temporary) to the strings that were not found in the dictionary
The tokenization algorithm blocks sequences of tokens prohibited by Grammar_Specs#Disambiguation_Rules
In case of several possible candidates, the tokenization algorithm picks the ones induced by Grammar_Specs#Disambiguation_Rules, if any

Example

Given the dictionary:
[abcde]{}""(...)<...>;
[abcd]{}""(...)<...>;
[bcde]{}""(...)<...>;
[abc]{}""(...)<...>;
[bcd]{}""(...)<...>;
[cde]{}""(...)<...>;
[ab]{}""(...)<...>;
[bc]{}""(...)<...>;
[cd]{}""(...)<...>;
[de]{}""(...)<...>;
[a]{}""(...)<...>;
[b]{}""(...)<...>;
[c]{}""(...)<...>;
[d]{}""(...)<...>;
[e]{}""(...)<...>;

Tokenization

Tokenization of natural language input

Example

Views

Personal tools

Search

UNL

Lingware

Software

UNL Program

Navigation

Toolbox

Print/export