Example of a full canonical decomposition (Unicode)
Last updated at 7:53 pm UTC on 9 December 2015
http://www.unicode.org/reports/tr15/tr15-18.html#Introduction
1. Take the string with the characters "ác´¸" (a-acute, c, acute, cedilla)
testString :=
16r00E1 asCharacter asString,
'c',
16r0301 asCharacter asString,
16r0327 asCharacter asString.
testString size
4
2. The data file contains the following relevant information:
code; name; ... canonical class; ... decomposition.
0061;LATIN SMALL LETTER A;...0;...
0063;LATIN SMALL LETTER C;...0;...
00E1;LATIN SMALL LETTER A WITH ACUTE;...0;...0061 0301;...
0107;LATIN SMALL LETTER C WITH ACUTE;...0;...0063 0301;...
0301;COMBINING ACUTE ACCENT;...230;...
0327;COMBINING CEDILLA;...202;...
testString :=
16r00E1 asCharacter asString,
'c',
16r0301 asCharacter asString,
16r0327 asCharacter asString.
3. Applying the canonical decomposition mappings, we get "a´c´¸" (a, acute, c, acute, cedilla).
This is because 00E1 (a-acute) has a canonical decomposition mapping to 0061 0301 (a, acute)
testString asDecomposedUnicode asOrderedCollection collect: [:code | code asInteger printStringRadix: 16]
an OrderedCollection('16r61' '16r301' '16r63' '16r301' '16r327')
4. Applying the canonical ordering, we get "a´c¸´" (a, acute, c, cedilla, acute).
This is because cedilla has a lower canonical ordering value (202) than acute (230) does. The positions of 'a' and 'c' are not affected, since they are starters.