Character


	links to this page:

Character

Last updated at 12:11 pm UTC on 23 March 2020

(Squeak 5.0, adapted from the class comment)

I represent a character by storing its associated Unicode as an unsigned 30-bit value. Characters are created uniquely, so that all instances of a particular Unicode are identical. My instances are encoded in tagged pointers in the VM, so called immediates, and therefore are pure immutable values.

The code point (Unicode.org glossary), this wiki) is based on Unicode. Since Unicode is 21-bit wide character set, we have several bits available for other information. As the Unicode Standard states, a Unicode code point doesn't carry the language information.

This is going to be a problem with the so called CJK languages. Since the characters of those languages are unified and given the same code point, it is impossible to display a bare Unicode code point in an inspector or such tools. To utilize the extra available bits, we use them for identifying the languages.

Since the old implementation used the bits to identify the character encoding, the bits are sometimes called "encoding tag" or neutrally "leading character", but the bits rigidly denote the concept of languages.

The other languages can have the language tag if you like. This will help to break the large default font (font set) into separately loadable chunk of fonts. However, it is open to the each native speakers and writers to decide how to define the character equality, since the same Unicode code point may have different language tag thus simple #= comparison may return false.

The new Spur memory model supports immediate Characters. (details see below)

Examples

Unicode e acute example

Character cr
$9 isDigit
$K isAlphaNumeric

On Fri, Feb 28, 2020 at 08:56:46PM +0200, Vaidotas Did??balis wrote:
> Hello,
> Character allInstances size=0. Why?
> regards,
> Vaidotas

Hi Vaidotas,

If you are accustomed to Character as represented in earlier Squeak
(or other Smalltalk) versions, then this result will be surprising.

The reason is that the internal representation of Character has
changed in recent versions of Squeak. We now use an object memory
design (called "Spur") developed by Eliot Miranda that enables a
number of improvements and optimizations.

One important optimization is that instances of Character can now
be represented as "immediate" objects, in which the data for an
instance of a class is actually hidden directly in the "object
pointer" that refers to the instance. That instance is in every
way a real object, but is optimized in such a way that the information
(in this case the character value) is encoded directly ("immediately")
inside the object pointer that specifies the instance.

This optimization allows certain simple objects to be encoded very
efficiently. And it also has the somewhat surprising side effect
of making it appear that there are no "real" instances of the class
when you use #allInstances to scan the object memory looking for
instances of the class.

In earlier versions of Squeak, only class SmallInteger was able
to benefit from this optimization. So in older versions, you will
see this for the optimized SmallInteger:

      SmallInteger allInstances size ==> 0

But for Character, which did not benefit from an internal representation
as an "immediate" class, you might see something like this:

      Character allInstances size ==> 257

Since that time, Character has now been updated to be an immediate
class, and it now behaves much like SmallInteger. Thus on modern Squeak
you will now have:

     Character allInstances size ==> 0

Spur enabled another similar optimization for floating point objects.
On older Squeak versions, you will see that all instances of Float
are ordinary non-immediate objects. So you might see something like
this in an older image:

  Float allInstances size ==> 8362

But on newer Squeak images with the Spur object format, most floating
point objects can now be optimized with an immediate representation.
It depends on the actual numeric values involved, but most Float
objects can be represented by class SmallFloat64, which is an optimized
"immediate" representation that (like SmallInteger) appears to have
no instances when you scan the objects memory:

  SmallFloat64 allInstances size ==> 0

However, some floating point values cannot be packed inside an object
pointer, and these need to be represented by "normal" objects. This
is done with class BoxedFloat64, so your image might show a number
of instances of these non-immediate objects:

  BoxedFloat64 allInstances size ==> 34

Modern Squeak images on Spur still have a class called Float. But
all of the actual instances of Float are represented by two concrete
subclasses. Floating point values that can be packed into an object
pointer are now instances of SmallFloat64, and values that cannot
fit into that optimized format are represented by the "normal"
class BoxedFloat64.

And of course the original class Float is now abstract, so it has
no instances at all:

  Float allInstances size ==> 0

Dave.

Examples

See also