Tokenization
(→Tokenization of digits) |
|||
Line 44: | Line 44: | ||
=== Tokenization of digits === | === Tokenization of digits === | ||
− | Any sequence of digits is considered one single token and receives the | + | Any sequence of isolated digits (i.e., sided by non-temporary tokens) is considered one single token and receives the features TEMP and DIGIT. The UW is considered to be the digit. |
{|align="center" border="1" cellpadding="2" | {|align="center" border="1" cellpadding="2" | ||
!input | !input | ||
Line 52: | Line 52: | ||
|1234 | |1234 | ||
|empty | |empty | ||
− | |[1234] {} "1234" (DIGIT) <,,>; | + | |[1234] {} "1234" (TEMP,DIGIT) <,,>; |
|- | |- | ||
|1,234 | |1,234 | ||
|empty | |empty | ||
− | |[1 | + | |[1,234]{}""(TEMP)<,,>; |
|- | |- | ||
− | | | + | |1,234 |
+ | |[,]{}""(COMMA)<,,> | ||
+ | |[1]{}""(TEMP,DIGIT)<,,>;<br />[,]{}""(COMMA)<,,>;<br />[234]{}"234"(TEMP,DIGIT)<,,>; | ||
+ | |- | ||
+ | |abc123 | ||
|empty | |empty | ||
− | |[ | + | |["abc123"]{}""(TEMP)<,,>; |
+ | |- | ||
+ | |abc123 | ||
+ | |[abc]{}"abc"{}(A)<,,>; | ||
+ | |[abc]{}"abc"(A)<,,>;<br />[123]{}"123"(DIGIT)<,,>; | ||
|} | |} | ||
Revision as of 15:36, 18 April 2012
Tokenization is the process of segmenting the input into "tokens" (processing units).
Contents |
General principles
In the UNL framework, tokenization follows the principles below:
- The tokenization algorithm goes from left to right.
- Except for digits, the tokenization algorithm is strictly dictionary-based (spaces and punctuation signs have to be inserted in the dictionary in order to be treated as non-temporary entries)
- The tokenization algorithm tries to match first the longest entries in the dictionary.
- The tokenization algorithm observes the frequency of the entries informed the dictionary (the highest frequent entries come first).
- The tokenization algorithm observes the order of the entries in the dictionary (the system selects the first to appear in case of same frequency)
- The tokenization algorithm assigns the feature TEMP (temporary) to the strings that were not found in the dictionary.
- The tokenization algorithm blocks tokens or sequences of tokens prohibited by disambiguation rules.
- In case of several possible candidates, the tokenization algorithm picks the ones induced by disambiguation rules, if any.
Tokenization of natural language input
Tokenization of spaces, punctuation signs and symbols
The tokenization does not have any system-defined tokens except for digits. Any token has be to inserted in the dictionary in order to recognized as a non-temporary entry.
input | dictionary | tokenization output |
---|---|---|
a b | empty | ["a b"], where "a b" = TEMP |
a b | [a] {} "a" (A) <,,>; | [a][" b"], where [" b"] = TEMP |
a b | [b] {} "b" (B) <,,>; | ["a "][b], where ["a "] = TEMP |
a b | [a] {} "a" (A) <,,>; [b] {} "b" (B) <,,>; |
[a][" "][b], where [" "] = TEMP |
a b | [a] {} "a" (A) <,,>; [b] {} "b" (B) <,,>; [ ] {} "" (BLK) <,,>; |
[a][ ][b] (no temporary entries) |
Tokenization of digits
Any sequence of isolated digits (i.e., sided by non-temporary tokens) is considered one single token and receives the features TEMP and DIGIT. The UW is considered to be the digit.
input | dictionary | tokenization output |
---|---|---|
1234 | empty | [1234] {} "1234" (TEMP,DIGIT) <,,>; |
1,234 | empty | [1,234]{}""(TEMP)<,,>; |
1,234 | [,]{}""(COMMA)<,,> | [1]{}""(TEMP,DIGIT)<,,>; [,]{}""(COMMA)<,,>; [234]{}"234"(TEMP,DIGIT)<,,>; |
abc123 | empty | ["abc123"]{}""(TEMP)<,,>; |
abc123 | [abc]{}"abc"{}(A)<,,>; | [abc]{}"abc"(A)<,,>; [123]{}"123"(DIGIT)<,,>; |
Tokenization of temporary entries
Any sequence of characters except digits (and including special symbols, such as space and punctuation signs) not found in the dictionary is considered a temporary entry and receives the feature TEMP. The UW is considered to be empty.
input | dictionary | tokenization output |
---|---|---|
asdfg | empty | ["asdfg"]{}""(TEMP) <,,>; |
asdfg hijkl | empty | ["asdfg hijkl"]{}""(TEMP)<,,>; |
asdfg hijlk | [ ]{}""(BLK)<,,>; | ["asdfg"]{}""(TEMP)<,,>; [ ]{}""(BLK)<,,>; ["hijlk"]{}""(TEMP)<,,>; |
Tokenization of natural language entries
- Dictionary
- [abcde]{}""(...)<...>;
- [abcd]{}""(...)<...>;
- [bcde]{}""(...)<...>;
- [abc]{}""(...)<...>;
- [bcd]{}""(...)<...>;
- [cde]{}""(...)<...>;
- [ab]{}""(...)<...>;
- [bc]{}""(...)<...>;
- [cd]{}""(...)<...>;
- [de]{}""(...)<...>;
- [a]{}""(A1,...)<...>;
- [a]{}""(A2,...)<...>;
- [b]{}""(B1,...)<...>;
- [b]{}""(B2,...)<...>;
- [c]{}""(C1,...)<...>;
- [c]{}""(C2,...)<...>;
- [d]{}""(D1,...)<...>;
- [d]{}""(D2,...)<...>;
- [e]{}""(E1,...)<...>;
- [e]{}""(E2,...)<...>;
- Case 1
- input: "abcde"
- disambiguation rule: NONE
- tokenization output: [abcde] (the longest entry in the dictionary)
- Case 2
- input: "abcde"
- disambiguation rule: ("abcde")=0; (prohibits the sequence "abcde")
- tokenization output: [abcd][e] (the second longest possibility from left to right according to the dictionary)
- Case 3
- input: "abcde"
- disambiguation rules: ("abcde")=0;("abcd")("e")=0;
- tokenization output: [a][bcde] (the third longest possibility from left to right according to the dictionary)
- Case 4
- input: "abcde"
- disambiguation rules: ("abcde")=0;("abcd")("e")=0;("a")("bcde")=0;
- tokenization output: [abc][de] (the fourth longest possibility from left to right according to the dictionary)
- Case 5
- input: "abcde"
- disambiguation rules: ("/.{3,5}/")=0; (prohibits any token made of 3, 4 or 5 characters)
- tokenization output: [ab][cd][e]
- Case 6
- input: "abXcYde"
- disambiguation rule: NONE
- tokenization output: [ab][X][c][Y][de] (where [X] and [Y] are temporary entries)
Properties of tokenization
Positive disambiguation rules only apply over candidates to the same position with the same length
- Case 7
- input: "abcde"
- disambiguation rule: ("abcd")("e")=100; (induces the sequence [abcd][e])
- tokenization output: [abcde] (positive disambiguation rules do not prevail over the principle of longest first)
- Case 8
- input: "abcde"
- disambiguation rule: ("/.{2,5}/")=0; (prohibits any token made of 2, 3, 4 or 5 characters)
- tokenization output: [a,A1][b,B1][c,C1][d,D1][e,E1] (among the candidates, these entries come first in the dictionary)
- Case 9
- input: "abcde"
- disambiguation rule: ("/.{2,5}/")=0; (A2)(B2)=1; (induces the sequence [A2][B2])
- tokenization output: [a,A2][b,B2][c,C1][d,D1][e,E1] (A2 and B2 were chosen instead of A1 and B1)
Positive disambiguation rules apply according to their probability and, inside the same probability, according to their order in the grammar
- Case 10
- input: "abcde"
- disambiguation rule: ("/.{2,5}/")=0; (A2)(B2)=1; (A2)(B1)=1; (induces the sequence [A2][B1])
- tokenization output: [a,A2][b,B2][c,C1][d,D1][e,E1] (the rule [A2][B2] has the some probability as [A2][B1] and comes first in the grammar)
- Case 11
- input: "abcde"
- disambiguation rule: ("/.{2,5}/")=0; (A2)(B2)=1; (A2)(B1)=2; (induces the sequence [A2][B1])
- tokenization output: [a,A2][b,B1][c,C1][d,D1][e,E1] (the priority of the rule [A2][B1] is higher than [A2][B2])
- Case 12
- input: "abcde"
- disambiguation rule: ("/.{2,5}/")=0; (A2)(B2)=1; (B1)(C2)=2; (induces the sequence [A2][B1])
- tokenization output: [a,A1][b,B1][c,C2][d,D1][e,E1] (the priority of the rule [B1][C2] is higher than [A2][B2])
The attribute #FINAL is used to indicate context
- Case 13
- input: "abcde"
- disambiguation rule: ("/.{2,5}/")=0; (A2)(B2)=1; (B1,#FINAL)(C2)=5; (if B1 is fixed, C2 must be chosen)
- tokenization output: [a,A2][b,B2][c,C1][d,D1][e,E1] (the same as case 11; compare with case 12 - note that the rule (B1,#FINAL)(C2) does not induce the choice of B1; it applies only to C2, i.e., it indicates that, if B1, then C2)
The tokenization algorithm backtracks in case of dead-ends.
In case of backtracking, the tokenization starts merging nodes from left to right, but considering their size, i.e., the shortest combinations appear first.
- Case 14
- input: "abcde"
- dictionary:
[a]{}""(...)<...>;
[b]{}""(...)<...>;
[c]{}""(...)<...>;
[d]{}""(...)<...>;
[e]{}""(...)<...>;
- disambiguation rule: ("a")("b")("c")("d")("e")=0;
- tokenization output: [ab][b][c][d][e] (the first combination of shortest temporary nodes from left to right after [a][b][c][d][e], which was blocked by the disambiguation rule. Note that [ab] is not in the dictionary and, therefore, will be treated as TEMP)
Retokenization
Only temporary nodes (i.e., nodes having the feature TEMP) may be split, but the feature TEMP may be assigned to any node.
Examples:
- Infixation
("abcd",INFIXED,^TEMP):=(%x,+TEMP); (assigns TEMP to the node %x if it is not TEMP and make it available for retokenization)
After that, the node may be retokenized by rules such as:
("ab",INFIXED,%x)("cd",INFIXED,%y):=(%x)("1",%z)(%y); (the node ("abcd") becomes ("ab")("1")("cd"))
In order to merge the node again, we have to use a merge rule:
("ab",INFIXED,%x)(%z)("cd",INFIXED,%y):=(%x&%z&%y,-INFIXED,+REMOVE_TEMP); (the three nodes ("ab")("1")("cd") become a single node ("ab1cd")
Then, we may remove the feature TEMP:
(REMOVE_TEMP,%x,TEMP):=(%x,-TEMP);
- Duplication of the last character
("abcd",DLC,%x,^TEMP):=(%x,+TEMP); (assigns TEMP to the node %x if it is not TEMP and make it available for retokenization)
After that, the node may be retokenized by rules such as:
("/.+/",DLC,%x)("/./",DLC,%y):=(%x)(%y)(%y); (the node ("abcd") becomes ("abc")("d")("d"))
In order to merge the node again, we have to use a merge rule:
(DLC,%x)(DLC,%z)(DLC,%y):=(%x&%y&%z,-DLC,+REMOVE_TEMP); (the three nodes ("abc")("d")("d") become a single node ("abcdd")
Then, we may remove the feature TEMP:
(REMOVE_TEMP,%x,TEMP):=(%x,-TEMP);
Observations:
- Splitting operations affect only strings. All the features of the source node, including headword and UW, are preserved in the target nodes
Given the node ("abcd",[abcd],[[abcd]],A),
the retokenization rule ("ab",%x)("cd",%y):=(%x)("1",%z)(%y);
does not affect the original headword, UW and feature, i.e.:
the nodes ("ab") and ("cd") still have, each one, the features [abcd],[[abcd]],A.
Tokenization of UNL input
- Case 15
- Dictionary:
[x]{}"x"(...)<...>;
[xa]{}"x.@a"(...)<...>;
[xab]{}"x.@a.@b"(...)<...>;
[y]{}"y"(...)<...>;
- Input:
agt(x.@a.@b,y)
- Tokenization output
agt([xab],[y])