Multilingual support - UTF-8
Last updated at 9:49 am UTC on 11 December 2015
UTF-8 is a character encoding capable of encoding all possible characters, or code points, in Unicode.
The encoding is variable-length and uses 8-bit code units. It was designed for backward compatibility with ASCII, and to avoid the complications of endianness and byte order marks in the alternative UTF-16 and UTF-32 encodings. The name is derived from: Universal Coded Character Set + Transformation Formató8-bit.(http://Unicode.org)
UTF-8 is the dominant character encoding for the World Wide Web, accounting for 85.1% of all Web pages in September 2015 (Source)
XML is, by default, encoded as UTF-8.
Discussion in 1999 about implementing UTF-8 in Squeak
Date: Tue, 14 Sep 1999 12:24:38 -0700
From: Duane Maxwell
Subject: Re: Unicode support
I would suggest instead looking to implement one of the useful
transformations of Unicode, such as UTF-8. It's a variable-length encoding
which could still use the current ByteArray character string
representation, still be able to encode the entire Unicode space if
necessary, as well as be efficient for the extremely common 7-bit ASCII
case. The Unicode specification describes various algorithms for
conversion and manipulation of the various transformations, as well as
mappings to platform specific extended character sets.
Both XML and BeOS use UTF-8 as their default encoding.
Title Unicode support
Author "John Duncan"
From: "John Duncan"
Subject: Re: Unicode support
Date: Tue, 14 Sep 1999 21:39:07 -0400
UTF-8 is what is known as an output transformation. It is used to put
whatever is in memory into some other form that is more readily
digestible by other devices that expect 7-bit ASCII and its associated
zero-null-byte convention. The UTF-8 format specifies ways of storing
up to 4-byte characters without any nulls aligned on bytes.
UTF-8 is also more compact for the European languages, but it is very
lengthy for traditional Chinese, as all characters require 2 bytes and
some characters inevitably require 3 bytes.
The problem with UTF-8 is that it is non-indexable. If you have a
string of characters, you can't make an assumption about where the nth
character is. To find out, you have to do a linear search. That makes
string indexing O(n) instead of O(1), which is unacceptable. If you
were to sort a UTF-8 string for some reason, the bubble sort would
actually have a lower order of magnitude than the quicksort.
So UTF-8 is not a very good memory format for characters. NT uses
UCS-2 (the 2-octet character set) in its native encoding, and UTF-8
for a lot of transfers to disk and the network. I'm not so sure that
Be uses UTF-8 in memory. I think you'd actually find they use UCS-2.
So I don't think it'd be good for someone to go through the hassle of
implementing a UTF-8 set of string methods. I like the idea of
bringing Unicode into Squeak. But there's a lot more involved than
just adding 2-byte arrays.
For example, you will want to store method string in UTF-8, because
they aren't allowed to carry characters larger than 7 bits. But you'd
have to make sure that they get transformed properly for other
purposes. You will have to provide alternate input/output routines for
files because you shouldn't store text files in UCS-2. There are many
considerations and I recommend that you read the standard, and all,
before going ahead and doing it.