Multilingual Support - Implementation strategy


	links to this page:

Last updated at 2:56 pm UTC on 23 September 1999

Which strategy to should be chosen for implementing Multilingual Support.

The following two emails show opposing views

Title Unicode support
Author Dean_Swan@mitel.ca

From: Dean_Swan@mitel.ca
Date: Wed, 22 Sep 1999 18:37:05 -0400
Subject: Re: Re: Unicode support

From: Dean Swan@MITEL on 09/22/99 06:37 PM

Ok, I've quietly read through enough of this thread that I feel compelled to
add my $.02 here. John's message just pushed me over the edge. I need to say
this loud and clear because this whole thread has swerved so far off into la la
land that I just can't be quiet any longer.

1. Complex systems are EVOLUTIONARY - they grow incrementally. They DO NOT
spring fully formed from Zeus's forehead.

2. Nothing useful ever gets done by committee.

3. Just DO something.

4. Don't be a Ted Nelson. Ted Nelson, while a fairly bright guy, spent over
twenty years tripping over himself with the Xanadu Project because he kept going
full circle on himself, much like this thread. (i.e. I need a GeneralizedString
class.... Now, if only I had a GeneralizedString class, I'd know what one should
be..... and on, and on.)

5. Anybody ever hear the term "Exploratory Programming" in conjunction with
Smalltalk? Try it, you'll like it!

Just like the whole nonsense thread about eliminating assignments a while back,
if there is a problem out there that needs solving, build a solution and share
it with the rest of us. (BTW, has anybody actually implemented any code that
elimnates assignments in any of the fashions discussed previously? What
problem(s) is it helping you solve?).

And as for the "Squeak Word Processor", even with BookMorphs, and all the other
cool stuff in Squeak, I don't think Squeak even does a good job as a word
processor for our old-fashioned-non-object-oriented-byte-encoded Strings as it
is (too much code, not enough UI).

Ok, enough of my cynical rant for now. I don't mean to take the wind out of
anybody's sails, but I've seen too many projects start out with "Let's really do
this right and show 'em all!" only to end up going nowhere, fast because
they're trying to do too much all at once. Small, incremental improvements
always have, and always will rule. Revolution is the result of undersampling the
observed system.

-Dean Swan
dean_swan@mitel.com

"John Duncan" <"jddst19+"@pitt.edu> on 09/22/99 12:50:40 AM

Please respond to squeak@cs.uiuc.edu

To: squeak@cs.uiuc.edu
cc: (bcc: Dean Swan/Ogd/Mitel)

Subject: Re: Re: Unicode support

> If, for instance, one needed to convey a
> language which wasn't at all character based –
> say, Egyptian hieroglyphics perhaps? – don't
> you think the string model would just break?
> I don't think it could be used for general
> glyphs, because they would suffer kerning and
> other ills a good deal.

Hmm. I don't know. Oddly enough, there is a proposed script to fit in
ISO-10646-2, UCS-4, Plane 1, which would provide Basic Egyptian
Hieroglyphics, containing 798 glyphs. It does not describe
implementation details for writing in BEH, but it shows that there is
consideration. Linear B is also proposed for Plane 1. Plane 1 is where
many characters that are not used in popular communication go. Most
people will only need BEH for academic stuff.

Someone on the list already suggested developing a fast cache
algorithm, and they suggested 256 characters. I suggest that this is
an implementation detail, and we may well find that (1) the ISO-8859-1
characters should be cached separately in 256 positions, and (2)
another cache should be developed that's expandable at least to 4
32K, one character per 32-bit word. In most Japanese communication,
there should be no use for more than 2400 characters; 100 kana and
approx. 2200 standardized kanji. But in Chinese and especially Korean
communication, that number could easily grow to 20,000. The system
should cache the characters actually in use.

Indexing of text in languages other than English is a sticky business
in any encoding, because most languages other than French, German and
Spanish use composed characters that should be treated as one. Thus,
the indexing can fail.

Developing algorithms to properly collate languages is another
concern. Unicode specifies an algorithm, and it requires normalizing
the string and then collating it. This is because there are at least
two ways of encoding "office" in Unicode, o-f-f-i-c-e, and o-ffi-c-e.

As I said, this is not a situation where someone should "get his hands
dirty" and get something out the door. There's too much prior research
to ignore. Let's not talk implementation until we have a Wiki sub-site
that discusses the important features of communication encoding and
text implementation. I think that text is relatively broken in all
systems, and here we have a chance to do it right, open source, for
all to see. Then, we can come out with a Squeak word processor in
which 25 experts from 25 countries produce the demo in their paper
about archaic languages. And we can make it fast, beautiful,
easy-to-use, and everything, as long as we have our heads on straight
while we do the work.

(Having the Swiki up and running would greatly aid this measure:) )

John