Unicode collation

Unicode collation

Last updated at 10:34 am UTC on 16 December 2015

As of December 2015 no Unicode collation sequences have been implemented in Squeak.

http://unicode.org/charts/collation/

Unicode collation algorithm
http://unicode.org/reports/tr10/

Note by Dale H. about implementation problems:
(Mailing list December 2015)

I think that the issue (from a performance perspective) is that you can't depend upon the value of the code point when doing collation — the main algorithm[5] is pretty much table based — In addition to the different sort orders based on characters there are even more arcane sort rules where characters at the end of a word can affect the sort order of the word (for more info see[4]).

It is worth looking at the Conformance section of the Unicode spec[1] as there are different levels of collation conformance .....

ICU conforms[2] to to UTS #10[3], the highest level of conformance ...

It looks like TwitterCLDR[6] uses the Main Algorithm[5] with tailoring[7]. They don't claim to be conformant to the Unicode Collation Algorithm[3], but they are covering a big chunk of the standard use cases ....

Dale

[1] http://unicode.org/reports/tr10/#Conformance
[2] http://userguide.icu-project.org/collation
[3] http://www.unicode.org/reports/tr10/
[4] http://www.unicode.org/reports/tr10/#Introduction
[5] http://www.unicode.org/reports/tr10/#Main_Algorithm
[6] https://blog.twitter.com/2012/twittercldr-improving-internationalization-support-in-ruby
[7] http://unicode.org/reports/tr10/#Tailoring