Chunky Squeak


	links to this page:

Chunky Squeak

Last updated at 11:23 pm UTC on 20 February 2009

This is the initial draft of a technical proposal for a change in the memory system of the Squeak VM. The idea is that breaking up the image into large grained "chunks" will take less effort than full modularization while giving us some of its advantages with minimal changes in current community practices. The goal is not to eliminate the need for a full modularization, but to offer an incremental step that will actually make that easier to achieve. In particular, any of the proposals for New Modules should be fully compatible with what is described below.

Chunks

A typical use case supposes that some computer might have several Squeak images in its disk, and sometimes even in its memory. There are a lot of repeated bits among these images and it would be nice to factor these out into shared "chunks". This is similar in spirit to the factoring of many .changes files into a common .sources to save disk.

As run time structures, these chunks would be similar to the project (.pr) files and ImageSegments, but they would be far less flexible. Each chunk would have a global identifier which would be a hash of its contents. So you can never change a chunk but only create new ones. Each chunk also includes the identifier for some other chunk that must be loaded into memory before it can be too, though that identifier can be zero for chunks that should be loaded into a completely empty memory (we can call these "base chunks"). Note that this single "parent identifier" is sufficient to completely specify every single bit of memory before loading the chunk, so this load always happens in a totally controlled environment (unlike ImageSegments, which must adapt to different conditions).

The bulk of the contents of a chunk indicate how memory should be changed after it is loaded. A rather simple way of doing that would be to encode the XOR of the memory before and after the chunk and then to compress the result using some standard algorithm. More efficient schemes can easily be devised.

Though chunks are stored as files, the names of these files should not be taken into account by Chunky Squeak. Only the identifier included in the first few bytes (it could also be recalculated on demand from the rest of the contents) matter. Some mechanism to translate identifiers to local file names or URLs must exist, of course. But it shouldn't be hardwired into the design for Chunky Squeak.

Multiple Images in Memory

A chunk can be as small or as a large as needed. In particular, a chunk can be exactly the same thing as a current image. In that case it would have everything needed and no parent chunk, so just loading it into an empty memory would make everything work as it does now. In the case of smaller chunks, you have one that represents an end-user project and to load that one you have to previously load its parent, and the grand parent before that all the way up to a base chunk. After loading, however, everything would work exactly the same as it does now. The exception is that "save image" would generate a new chunk which would normally be pretty small compared to a typical image file.

A more interesting case is when more than one image is loaded into memory. In that case it might be common for a given chunk to be already present in memory when it is supposed to be loaded. If each image is loaded into its own address space, then page sharing can be used to save duplication. And "copy on write" can deal with the fact that later chunks might want to make (possibly incompatible) changes to a given page.

An alternative is to load all the chunks into a single address space and run the images as separate threads (like in the Hydra VM) rather than separate tasks/processes. This will greatly reduce the switching overhead but will require replacing the direct pointers that have been a traditional feature of Squeak with old style object tables. These object tables will both allow the "shared but differently patched" parent chunks and will compensate for the different object addresses for each run.

This strict tree structure is very limiting. If you want to use together packages which exist in non related trunks, you have to use traditional alternatives (fileOut/fileIn, Monticello, project files, etc) to move the needed code and objects around. An alternative is to develop a different style of using Squeak where each piece can remain in its own image and yet can be combined into a common result. This would need something like the far references in Islands or in the wormholes in Spoon. An option would be for each image to "think" it had the system screen all to itself but then have its graphical output diverted to a master "gui image" that would be the only one really interacting with the user. Projects such as Nebraska probably already include most of the needed code.

Chunk Operations

The most basic chunk operation is "save image", as mentioned above. Many options are possible:

it could create a new chunk that has as its parent the original chunk that was loaded (great for full version control)
it could have the same parent as that chunk (so it would be an independent replacement, just like images are now)
it could not have a parent at all but represent a full image
it could have as its parent some ancestor of the original chunk (so it would include more basic code itself)

Other interesting operations might allow us to "reparent" a chunk or to split one into several. These would be needed when moving from the current image system to Chunky Squeak. In this case it would be important to have a good memory visualization tool.

Goals

to allow breaking up the image into reloadable pieces even in the presence of circular dependencies
to allow project based development style to continue relatively unchanged
to allow image based development style to continue relatively unchanged
to allow code repository development style to continue relatively unchanged
to gradually migrate bits which change at different rates into different chunks, which can then evolve at their natural pace (no need for a new base every six months)
to allow forking while sharing as much as possible
to rescue neat Squeak projects from the past

Technical Problems

this "the exact same bits" stuff doesn't take into account endian problems, 32 vs 64 bit images
backwards compatibility at the level shown in the figure would depend on efforts to convert older images to chunks