Squeak's threading architecture -- why can't Squeak do SMP yet
Last updated at 8:35 pm UTC on 4 July 2007
See also Squeak Threading Model Thinking about massively parallel Smalltalk Using Multiple OS ThreadsEd Boyce August 04, 2004 8:44 AM I have a few questions about Squeak VM architecture. My motivation has to do with wanting to share physical simulations and visualizations over distributed hardware with Croquet when available, but my questions about scalability seem to be more Squeak VM related.
What is stopping Squeak from doing SMP? (or Async MP for that matter?)The Squeak VM classes seem to be pretty finely threaded, and increasingly modularized. I understand some of the potential risks to stability when you tell a computer try to walk and chew gum at the same time, but some threads (and processes for that matter) should be parallizable when they have no data interdependencies. Why can't one make a copy of the OS/platform specific guts of the Squeak bytecode interpreter or JIT compiler object for each physical processor (maybe make it live in the processor cache since it isn't THAT big) and have the scheduler serve threads (or whole processes if they're independent enough) to available processors using a priority scheme of one's choice (which, this being Squeak, should be able to be changed out or altered on the fly if available hardware or loads change during execution).
Of course, locks are necessary around structures whose shared memory components may be in flux when another object tries to interrogate it. But Squeak's message-passed requests are more polite than conventional function calls to object methods, and this can facilitate a more polite "be with you in a few microseconds" response if something is locked than a simple fail. Of course, not all classes and methods and SqueakVM threads are suitable for forking into OS and processor threads. Those that are or aren't can be tagged with one bit (maybe call it "isSMPThreadable").
Has this been tried and failed? Another question: Has anyone tried to port Squeak to IBM Power5 processors (under Linux or anything else)? What are the broad engineering challenges in the way if I want to run a computationally intensive simulation object (e.g. large N molecular dynamics simulation with Squeak object hooks) on, say 32 processors of my IBM p690, and let that object communicate the jist of its state (by passing messages over IP) to clients which do the final (not so computationally intensive) final rendering. I know that the client-server bit is exactly what the Croquet team is working on. But one can't involve high performance computing in the mix if threads can't be distributed over multi processors. And that part seems to rest with the Squeak VM architecture. (Please point out any misconceptions in anyof the above).
Anyone who has used a pervasively threaded microkernel operating system, such as BeOS and QNX, understands the substantial power for users and their programs of multithreading on multiprocessors done right. There are clearly lots of parallels to be drawn here between Squeak and various OS architectures.
[... make a copy of the OS/platform specific guts of the Squeak bytecode interpreter ...and have the scheduler serve threads ..] First problem is that in general processor caches are not under our control; the cpu has hardware that caches lines of some size and as process switches occur differeent cachelines get memory from many places and maybe some flush some lines as part of a context switch and some don't.... basically the idea that "hey our little interpreter can fit in the cache" is not real. Except, interestingly enough in my favourite, the ARM. The latest ARM architecture allows for a sort-of cache that IS under application control and WOULD allow for the VM to be loaded along with crucial data and kept there. Up to 4Mb of it, which is certainly enough for a lot of useful stuff. Of course, there are then issues of which application gets control of this TCM etc. And it's a bit tricky to actually go out and buy an ARM v6 cpu right now.
Next problem is sharing the object memory efficiently and reliably. Address spaces? Garbage collection? Referential integrity? Is there a single object space or many? Do all cpus think of themsleves as sharing the same actual memory or do they have separate memory?
Even if you could have multiple execution units sharing exactly the same memory space (hmm, another ARM, the MPCore springs to mind) I think it would be a goodly bit of work.
To some extent you could easily benefit from 'normal' multithreading in the VM (the OSX & windows VMs certainly do some) to handle user input, socket signals, stuff like that. Perhaps dedicating a cpu to tracking memory usage and modifying GC policy, watching code usage and asynchronously heavily optimising some chunks of translated code, even perhaps doing things like cleaning up memory left behind by comapaction (so object allocation could avoid havign to scan the area).
HP used to sell a distributed Smalltalk (I think they passed it back to Cincom but I'm not sure) but that was more a multiple Smalltalks talking to each other via something like CORBA.
If you want it to work with shared memory, then I don't think anyone is working on that in Squeak. It could be done if you want to invest some engineering into it. The main thing that is missing is a thread-safe object memory.
[..locks are necessary ..] In practice, it is VERY hard to achieve proper locking for shared memory threading, even in tiny programs. So don't go down that path without a good reason – after all, do you care how fast the answer is computed if it may have been garbled? Polite message passing is much easier to get right.
More effort would be required then to make the platform code thread safe, some of that uses local scoped or static variables, but you could start with a limited set excluding things involving the UI.
However if you wanted to run a smalltalk process on a different thread within the same image then:
a) the interp.c code isn't thread safe, assuming each thread uses the same foo structure.
b) Smalltalk code within a single image isn't thread safe. This is a bit harder to figure out, but I think there are places where the smalltalk process switching rules prevent race conditions which you would uncover with multiple processes running. (as others have pointed out).
Reinout Heeck: You may want to look into the Merlin runtime for Self. Adaptive Compilation in the Merlin System for Parallel Machines (10 pages)http://www.merlintec.com/lsi/jpaper1.ps.gzTo whet the appetite: "So, even with top level messages defined as sequential, one in every four sends potentially adds to the prallelism of a program"
Jecel Assumpcao: [.. the Merlin runtime for Self.] Thanks for mentioning this - now I am forced to add some details ;-). The paper you cited is rather old, so interested people might want to look at the tinySelf 1 effort in http://www.merlintec.com/lsi/tiny.html#rel1
I got sidetracked by other things and never did finish debugging this sothat full applications could be tested, but the results for simple expressions show just how much potential parallelism there is in Smalltalk.
The project is currently called Neo Smalltalk (software) and Plurion (hardware) and the focus is on making the best use of a set of slow processors. The FPGA versions will have between 2 and 6 processors running at 100MHz and there are three interesting possibilities for doing custom chip versions which would be larger and faster.
Future compatibility with Squeak is a very high priority goal, but it is likely that the first efforts in this direction won't make use of multiple processors.