Last updated at 11:21 am UTC on 27 January 2017
(Squeak 5.0, adapted from the class comment)
I represent a character by storing its associated Unicode as an unsigned 30-bit value. Characters are created uniquely, so that all instances of a particular Unicode are identical. My instances are encoded in tagged pointers in the VM, so called immediates, and therefore are pure immutable values.
The code point (Unicode.org glossary), this wiki) is based on Unicode. Since Unicode is 21-bit wide character set, we have several bits available for other information. As the Unicode Standard states, a Unicode code point doesn't carry the language information.
This is going to be a problem with the so called CJK languages. Since the characters of those languages are unified and given the same code point, it is impossible to display a bare Unicode code point in an inspector or such tools. To utilize the extra available bits, we use them for identifying the languages.
Since the old implementation used the bits to identify the character encoding, the bits are sometimes called "encoding tag" or neutrally "leading character", but the bits rigidly denote the concept of languages.
The other languages can have the language tag if you like. This will help to break the large default font (font set) into separately loadable chunk of fonts. However, it is open to the each native speakers and writers to decide how to define the character equality, since the same Unicode code point may have different language tag thus simple #= comparison may return false.
The new Spur memory model supports immediate Characters.
Unicode e acute example