Squeak
  links to this page:    
View this PageEdit this PageUploads to this PageHistory of this PageTop of the SwikiRecent ChangesSearch the SwikiHelp Guide
TextFiles Package
Last updated at 1:26 am UTC on 17 January 2006

Using TextFiles

[Note: This documentation can be obtain by evaluating TextFile openDoc].

TextFiles is a small package that simplifies dealing with text files in Squeak, in particular in the presence of diverging line end conventions as well as different encodings.

Table of Contents:

TextFile features


TextFile provides the following features:

TextFile creation


There are various ways of creating text files. One is to use the protocol support by file stream directly, for example:

	myFile := TextFile readOnlyFileNamed: 'test.txt'.

will open the file named 'test.txt' in read-only mode. All of the common file open operations of FileStream are supported by TextFile including
Class TextFile provide an interface that is quite similar to that of FileStream, namely:

The second way of creating a TextFile is to give it a stream to operate on.

Example:

	| myStream outFile inFile |
	myStream := ReadWriteStream on: String new.

	"Create a TextFile on myStream"
	outFile := TextFile on: myStream.

	"Set the encoding to ISO-8859-1"
	outFile encoding: #iso8859.

	"Write out some umlauts"
	outFile nextPutAll: 'ÄÖÜ'.
	outFile flush.

	"show the ISO8859-1 encoded contents (looks like garbage in Squeak)"
	Transcript cr; show: myStream contents.

	"Create another text file to read it back in"
	inFile := TextFile on: myStream.

	"Again, set the encoding to ISO-8859-1"
	inFile encoding: #iso8859.

	"And print the contents to the Transcript which will be umlauts again."
	Transcript cr; show: inFile contents.


The third way is to simply ask some stream for it's TextFile representation, e.g., instead of using 'TextFile on: someStream' use 'someStream asTextFile'.

Example:
	| myFile |
	"Ask FileList2 to give us a file and make sure it's a TextFile"
	myFile := FileList2 modalFileSelector asTextFile.
	"Print out each line of the file"
	[myFile atEnd] 
		whileFalse:[Transcript cr; show: myFile nextLine].
	myFile close.

TextFile encodings


TextFile is designed to support a variety of conversions, the most important of which are line end translations and character conversions.

By default, TextFile handles line end conversions by mapping all of Cr, Lf, and CrLf to simple Cr (which is the line end terminator in Squeak). When reading a text file, you will NEVER see a Lf in the input. When writing a TextFile will use the line end convention of the existing stream, or, if none can be determined, it will use the default platform convention (CrLf on Windows, Lf on Unix, etc). The line end convention for writing can be set manually, by providing either one of #cr, #lf, or #crlf respectively. For the rare case of not wanting any line end translation whatsoever the line end convention #none can be used. When using #none, clients must expect to see all combinations of Cr, Lf, or CrLf in the input stream (this is essentially binary mode).

Character encodings can be defined by supplying an appropriate encoding type to the TextFile, as seen in the above. The currently supported encodings include:

Hopefully, more encodings (such as UTF-8 etc) will be added in the future. Support for the encoders requires only a minimal interface so writing your own kinds of encoders for text is reasonably simple. See class TextEncoder and its subclasses for examples.

TextFile positioning


When dealing with (potentially encoded) text files, changing the 'file position' is a somewhat tricky problem since the correlation between the position of the TextFile and the position of the underlying stream is not necessarily 1:1. For example, when we have a file from Windows containing CrLf, the sequence CrLf will be condensed into a single character (Cr). Some encodings (such as UTF-8) have similar effects, splitting a character into a sequence of bytes, or merging a sequence of bytes into a single character.

TextFile therefore provides an explicit TextFilePosition when asked for its position. While numbers can be used to position a text file, this is typically (unless you position it to the very beginning) much less efficient than using a position obtained from the text file.

Example:
	| myFile pos1 pos2 index1 index2 quickSeek slowSeek |

	"Read 10k lines out of the Sources file"
	myFile := TextFile readOnlyFileNamed: Smalltalk sourcesName.
	10000 timesRepeat:[myFile nextLine].
	pos1 := myFile position. "a TextFilePosition"

	"Read another 10k lines out of the sources file"
	10000 timesRepeat:[myFile nextLine].
	pos2 := myFile position. "a TextFilePosition"

	"Now position it efficiently between pos1 and pos2"
	quickSeek := [10 timesRepeat:[myFile position: pos1; position: pos2]] timeToRun.
	Transcript cr; show: 'Time to seek using TextFilePosition: ', quickSeek printString, ' msecs'.

	"The integer index into the file stream can be obtained from a TextFilePosition"
	index1 := pos1 globalPosition.
	index2 := pos2 globalPosition.

	"See what happens when we use integer indexing"
	slowSeek := [10 timesRepeat:[myFile position: index1; position: index2]] timeToRun.
	Transcript cr; show: 'Time to seek using Integer: ', slowSeek printString, ' msecs'.

Note that the time for "slow seek" operations is dependent on the size of the file - the larger your files get the slower the operation will be. So in general, you should use positions obtained from the file instead of some integer index.

There is one degenerate case when using TextFilePosition which mostly occurs when seeking 'backwards', e.g., using something like 'myFile skip: -1'. If the backwards positioning operation crosses a buffer boundary, then positioning will be as slow as if specifying an integer index. Even though the statistic probability for this to happen is less than 0.025% (using the default buffer size) you should generally avoid to use #skip: with negative values and instead remember the file position explicitly.

TextFile chunking


A common situation for positioning text files can not be captured by merely using a TextFilePosition, namely that of 'chunking'. Chunking means that we have a file which consists of 'chunks' of text (such as Email-messages, or log file entries) which we want to index by chunk positions. For example, if we have a database of Email-messages, we will want to seek efficiently to each message, then read and convert it. One solution to this problem is to use a binary stream and do all the conversions internally.

Example:
	"Position the binary file to the start of the message"
	emailFile := FileStream readOnlyFileNamed: 'MyEmail.db'.
	emailFile position: emailMsgStart.
	"Read the raw (untranslated) messag
	msgRaw := binaryFile next: emailMsgLength.
	"Now convert the raw message into readable text"
	converter := TextFile on: msgRaw.
	converter encoding: #iso8859.
	msg := converter contents.

While the above is usable in all cases where the exact size of the binary data is known, it can be problematic when the size of the data is not known (for example, a Unix mbox file uses a 'From' at the beginning of a line to indicate a new message start, log files may use a certain number of lines in the file etc.) For reading a text file, this can be handled by the following code.

Example:
	"Open and position the binary file to the start of the message"
	emailFile := FileStream readOnlyFileNamed: 'MyEmail.db'.
	emailFile position: emailMsgStart.

	"Get us a TextFile at the current position of the emailFile"
	textFile := TextFile on: emailFile.

	"Now read out the text lines until we hit 'From'"
	[line := textFile nextLine.
	(line beginsWith: 'From ') or:[file atEnd] whileFalse:[
		"process one line of message contents"
	].

About the only problem we're left with here is how to determine the 'emailMsgStart' to begin with when using a TextFile. For this purpose we can ask the TextFile for a 'seekable file position' which will (albeit relatively slow) ensure that we are able to seek within the underlying stream to the exact position where the current message starts. Here is an example which will compute the indexes in an email database.

Example:
	"Open the text email file"
	emailFile := TextFile readOnlyFileNamed: 'MyEmail.db'.
	emailIndex := OrderedCollection new.

	[emailFile atEnd] whileFalse:[

		"remember the seekable position"
		emailIndex add: emailFile seekableFilePosition.

		"Skip forward until we find the next 'From' line"
		[textPos := emailFile position. "need to remember it so we can seek back quickly"
		(emailFile nextLine beginsWith: 'From') or:[file atEnd]] whileFalse.

		"If we weren't at the end of the file, seek back to before the line started"
		emailFile atEnd ifFalse:[emailFile position: textPos].
	].
	"we remember the last position too, so we know where to append stuff"
	emailIndex nextPut: emailFile seekableFilePosition.

Finally, since we already have a TextFile in the above, we can also use the indexes obtained to position our text file (contiued from the above).

Example:
	"position to the first message"
	emailFile seekableFilePosition: emailIndex first.

	"read its contents"
	emailFile nextLine. "first line contains 'From' so skip it"
	[(emailFile nextLine beginsWith: 'From') or:[emailFile atEnd]] whileFalse.

	"position after the last message"
	emailFile seekableFilePosition: emailIndex last.

	"Append a new message"
	emailFile nextPutAll: newMessage.

	"And remember the new end position"
	emailIndex add: emailFile seekableFilePosition.

More information


Here are some other places to look at:



That's it folks. Enjoy!