links to this page:    
View this PageEdit this PageUploads to this PageHistory of this PageTop of the SwikiRecent ChangesSearch the SwikiHelp Guide
Multilingual Support - Notion of a word
Last updated at 3:53 pm UTC on 3 September 2004
Question: Aug 6, 2004 I'm trying to write a parser for binary files (Excel.) As I read from the stream byte by byte how can I 'parse' the byte i.e. give it it's meaning or create the proper object? Does it realy come down to if byte = xxx ... else ... else ...? I mean this doesn't seem very OO to me.

Answer: Markus Gaelli Maybe you want to use a Parser Generator like SmaCC, available on SqueakMap (SmaCC Smalltalk Compiler-Compiler-Development). Info about SmaCC is on http://www.refactory.com/Software/SmaCC, especially on http://www.refactory.com/Software/SmaCC/Scanner.html How to scan binary data (hex-chars).

But maybe the easiest for you would to export the excel files as comma separated files,and then just parse them (assuming you have them in a string) a la:
  ('a,b c,d' findTokens: (String with: Character cr))
      collect: [:aLine | aLine findTokens: ',']

[23-Sep-1999 / hjh] The following email states that the notion of a word might not be obvious in all languages. Todd Blanchard

S:Some languages do not have delimiters between words at all.
Q: What languages apart from ideographic languages don't use delimiters?
A:German has the lovely habit of running multiple words together to
make bigger and bigger words. You often want to navigate on the
consituent words in the superword. There are software text editors
that do this correctly. Some eastern asian languages also have
different ways of busting up things into words that don't relate to

Further, such a class doesn't need the notion of tokenization, by definition, I suppose, unless there are END-OF-TOKEN forms of letters, in which case the model of tokens suggested would not be applicable. They would, of course, resolve that issue in a subclass implementation that ignores the token parameter. But rather than get pedantic about that - let's divorce "tokenization" from the concept of word-spotting.
Why? I think the point that was made here by others, and with
which I agree, is that tokenization is an appropriate string operation, and semantic "word-spotting" is probably not.
I don't agree.

String is a mechanism for representing language and languages
typically have words. Tokens are something else - more
arbitrary. We got here because of this:
A newbie recently asked how to compute the equivalent of:
word 4 of line 7
set word 4 of line 7 to "foobar"
Which I think is certainly an appropriate operation for String.

OTOH, this operation cannot always be as simpleminded as delimiter
based tokenization.

Although - by your definition of what a string is - perhaps that's not appropriate.

Q: Take: hebr3 "This is a fine mess" hebr2 hebr1 hebr0
If you iterate through the tokens, in what order would you expect to get the tokens?
A: You are imposing a particular string based sequencing of information onto the the semantics of an underlying language, and then asking me to describe how the tokens that are derived. My suggestion is that if you want the tokens to be semantically meaningful, then your program (or a subclass of string) must first organize the sequence of characters so that the purely mechanical, non-semantic sequencing will yield a meaningful result. Thus, the answer is this: they would be the tokens, read left to right or right to left in sequence, as defined by the delimiters. Which is probably useless for anything but single-direction languages.

It is not the duty of the String object to understand the semantics of the underlying language in which characters are represented, but only to provide underlying operations in which most reasonable operations (including semantics-based operations) might be accomplished.

Well which is it? String is either a class for representing chunks of languages, or its a mechanism for representing arrays of
characters (whatever those are - a whole other topic). I think String is implemented as the latter but used as the former and we English speakers are lucky in that these just happen to coincide. Unfortuneatly, the coincidence is a rather lucky fluke with English and not something you can rely on globally.

Peter William Lount
Arrays are not suitable for GeneralStrings because they are not easily expandable in size. Strings must be growable in size.

OrderedCollection would be a better bet on being like a GeneralString or even a GeneralString super class.

Q: Am I misunderstanding Peter's suggestions, or did he mean to say 4-bytes per CHARACTER in a string as opposed to 1 byte per string. As understood, all string objects are themselves objects: it is only their contents that are byte data.

Yes I meant to clearly say 4 bytes per Character object in a string contrasted with 1 byte per byte character code in a string. The 4 bytes are for the object pointer to the character objects (or any other object that understands the character protocols which can make things interesting). The character objects themselves would take up the normal space that any object takes with it's instance variables.

It's interesting that Mike Klein mentions "words". One of the reasons to move to a fully object model for strings is that it would allow for "objects" of many kinds to be put into strings as long as they understand "character object protocols". For example, a "word object" that represents "hi", a character object that represents the "space" character, and a word object that represents "there" would occupy only three 32 bit pointers or 12 bytes in an object based string. In a byte string these would occupy 8 characters. Byte encoding wins the space race with short strings. But as you add words to the "symbol set" or "dictionary of words" the effeciency of storing "words" in strings would mean that "generic strings" could use less storage than "byte encoded strings". The storage space saved could be quite large.

In a sense strings or objects that behave as strings or characters could be nested within strings. One of the protocols that "character objects" would need to respond to is "bottom out in characters" - that is replace yourself with the characters that represent you (thus eliminating the nesting and flattening out the string). The concept of a word is just a group of characters that may or may not have some meaning to us humans.

Simply put by using "word objects" a generic string object can be made to be much more space efficient than any potential character based string - byte or object oriented. Thank's for reminding me of this one Mike.

Furthermore, word objects work fine with any human language that forms "characters" into words. Why stop at words when clumps of words or phrases and entire sentences and paragraphs could be nested and nested and ....

Another advantage of storing words in strings is that they are already in "token" format and might provide some time savings for parsing and lexical processing.

Yet another idea for words is the "dynamic words" I mentioned in an earlier message on this topic. A place holder "word" object could be put into a string. This place holder object is actually a "variable word" that is linked to some "source" which supplies the "characters" that make up the "dynamic variable word". Say a "total amount" for an "invoice". So when the string is displayed it shows the "current total amount" as number characters provided by the "accounting source object" when the "dynamic variable word" is asked for it's characters. Formatting information could also be present in an "environment" that is passed into the "display" method. This would allow for different formatting based on "preferences or localizations based on the country or language choices of the user". If the user clicks on the "characters" that make up this "dynamic variable word" the system could find that they are really clicking on the "total amount" word object which is linked into the accounting objects in the system. This allows the user and the system to quickly get to the objects behind the "graphical user interface" and activate appropiate windows or whatever....

There are also advantages to storing string information in a hierarchical format in a general object based string. This nesting is what XML and HTML essentially do. The string form is their flattened form, while a hierarchy form exists that lets you manipulate the structure represented by the "flattened string". Some food for thought as this could impact Smalltalk's success as an internet solution tool.

Actually, reflecting on this it might even be more space efficient for a PDA minimal footprint version of Smalltalk to use object based strings with "character and word objects" rather than just a byte encoded string approach. We will have to get our calculators out to test this idea.

Certainly Perl is powerful and is often touted as a "powerful string
manipulation language". It's name is from "Practical Extraction and Report Language". Extraction of what? Text. All most all of the Web technologies like HTTP, HTML, SGML, XML, FTP, SMTP, etc... are "text" or "string" based technologies. String manipulation languages and sub languages like Perl and it's "regular expressions" are very effective in dealing with these Internet technologies. I would consider these essential to any power string object.

Text parsing is also an area that can assist with Web technologies.

By strengthing and expanding Smalltalk's abilities to work concisely with "text" information we can improve it's success and usefullness in implementing web and internet solutions. One of the reasons for Perl's success is concise string manipulation.