Universal Character Set (UCS) standard


	links to this page:

Last updated at 10:12 am UTC on 7 December 2015

ISO/IEC 10646, the Universal Character Set (UCS) standard, defines two forms of encoding.

The more capacious requires 31-bits per character, permitting the definition of a very large repertoire. Because a 31-bit character occupies four octets, this form is known as UCS-4.

The other form requires 16 bits (two octets) per
character; hence it is called UCS-2. The 65,535 values that can be represented in UCS-2 are enough to encompass most of the characters used in contemporary languages, and UCS code values for them have been assigned in that range. The set of possible UCS-2 values therefore has another name, the Basic Multilingual Plane (BMP) of the UCS.

No UCS character assignments outside the BMP have been made. The character repertoires and code value assignments in the BMP and in Unicode are the same.

In this sense Unicode and the UCS BMP are effectively synonymous. Unicode includes a stratagem, "surrogates," that can provide access to roughly a million non-BMP characters that may be assigned in the future. Such assignments are likely as coverage of ideographs becomes more comprehensive.

Another ISO/IEC 10646 concept is the UCS Transformation Format (UTF). UTFs are alternative representations of UCS-4 and UCS-2. They are designed to enable communication protocols to transfer UCS data without confusion or loss. A feature of UTF representation is that not all characters require the same number of bits.

https://en.wikipedia.org/wiki/Universal_Coded_Character_Set

Squeak uses UCS.

https://en.wikipedia.org/wiki/UTF-32

Check if this page needs to be updated.