Multilingual support - Character objects


	links to this page:

Last updated at 10:56 am UTC on 6 December 2015

Character class comment (Squeak 5.0)

I represent a character by storing its associated Unicode as an unsigned 30-bit value. Characters are created uniquely, so that all instances of a particular Unicode are identical. My instances are encoded in tagged pointers in the VM, so called immediates, and therefore are pure immutable values.

The code point is based on Unicode. Since Unicode is 21-bit wide character set, we have several bits available for other information. As the Unicode Standard states, a Unicode code point doesn't carry the language information. This is going to be a problem with the languages so called CJK (Chinese, Japanese, Korean. Or often CJKV including Vietnamese). Since the characters of those languages are unified and given the same code point, it is impossible to display a bare Unicode code point in an inspector or such tools. To utilize the extra available bits, we use them for identifying the languages. Since the old implementation uses the bits to identify the character encoding, the bits are sometimes called "encoding tag" or neutrally "leading char", but the bits rigidly denotes the concept of languages.

The other languages can have the language tag if you like. This will help to break the large default font (font set) into separately loadable chunk of fonts. However, it is open to the each native speakers and writers to decide how to define the character equality, since the same Unicode code point may have different language tag thus simple #= comparison may return false.