Thinking about massively parallel Smalltalk
Last updated at 8:44 pm UTC on 4 July 2007
Tim Rowledge January 16, 2005 Interesting article a friend pointed out to me: http://www.gotw.ca/publications/concurrency-ddj.htm
Given that current high-end cpus like P4 and G5 take many, many watts (50 - 100) to achieve a 'mere' 2-3GHz and high-end ARM type cpus take about 50mA for 800MHz, I think there might be some interesting options for say a hundred ARMs working together. Even if one only achieves 20% effectiveness you'd get serious compute power for 5A or so.
Michael Latta The main question is whether to have the threading model inside or outside the VM. Is the VM responsible for maintaining a pool of threads that can run on separate CPUs, and the Smalltalk program just
sees Processors for each OS thread? This would probably be the lowest effort and lowest overhead way to use multiple CPUs. The threads in the image remain in the image space and are coordinated in their access to heap etc. The Smalltalk application would also need to use various control structures to ensure synchronization of accesses. The other approach is to change the Smalltalk semantic to be more concurrent. This can be done with proxy objects that defer computations until needed. A prototype in the Self community saw up to 5 way concurrency in simple usage with no change in coding.
The question is do programmers want a concurrent language, or a sequential one? If multi-core computers are where all vendors are going, we software types should get on board. Rather than treating concurrency as an exception case to the nice sequential way of viewing a problem, we should explore turning it on its head. If we embrace concurrency we get the performance benefits from any concurrent hardware, we just need to readjust our thinking.
On the practical side of things, since the Squeak interpreter is basically one big C loop, how would we make use of concurrency? Would we have to fork() the interpreter for each processor, and have the interpreter access block contexts using synchronized accesses? If we could ensure that each block context is only handled by one C thread we
could avoid that overhead. But, to get highly concurrent systems we would need some way to context-switch between Smalltalk executions within a C thread. This would seem to imply that we will not be able to guarantee that various segments of the context execute on the same C thread.
Finalization is not resource management [Agreeing that Smalltalk is basically one big C loop, adds] I believe Smalltalk also runs the GC in its own OS thread.
Michael Latta Does the current image format associate active threads with processors? Does the Squeak image even allow for more than one Processor? When the image starts up it would need to assign each Processor in the image
to a thread, and deal with the cases where the image has more/less Processor objects for the number of CPUs in the executing computer. This could require that threads be re-assigned to a different Processor than they were using before. Or the image file could be independent of the number of processors, and assignment done on startup in all cases. Running the GC in a separate OS thread makes sense, if the synchronization overhead is not higher than the GC overhead. For example you can not do heap copying with other threads concurrently accessing those objects. It might make more sense to have all C threads coordinate pauses for GC so that you get GC activity concurrently on all processors when no "user" logic is running. Either that or you would need to lock each object while it is in motion, and each access would need to check for the lock. With spin locks that may be acceptable if you can control the collision rate.
Alan Grimes [since the Squeak interpreter is basically one big C loop, how would we make use of concurrency?]
Have multiple seperate loops, one per processor pluss others as needed for IO...
[Would we have to fork() the interpreter for each processor, and have the interpreter access block contexts using synchronized accesses?]
Fork(), AFAIK, creates two completely independant OS-level hardware VM's (not to be confused with smalltalk VMs). They would not be able to cooperate after that except through horribly primitive IPC mechanisms. A problem arising here is that unixoid OSes use the POSIX threading library (pthreads), while Windows has a different (inferior) threading model...
[to get highly concurrent systems we would need some way to context-switch between Smalltalk > executions within a C thread.]
Well, I was thinking 1 thread per physical processor (as opposed to class Processor) and a multi-threaded scheduler to allaocate Squeak tasks to them. (see a very old post I made on this subject.)On the interpreter level, the current system was not designed to do this at all.
When I attempted to paralellize Squeak last spring, my first instinct was to divorce class Interpreter from class ObjectMemory.
Right now, there are many variables in structure "foo" that are obviously "global" to all threads, and many others that obviously need to be unique per thread. My first effort was intended to trim down both Interpreter and ObjectMemory as much as possible and then place all global information in ObjectMemory and all thread-local information in "Interpreter".
One obfuscating factor is that it is not at all obvious from the squeak/slang sourcecode what is constant and what is variable. Many of the variables you see in the class definitions are compiled as #defines! =P I started attacking that by taking the class variable typeMask, and creating a method: "typeMask:" defined as:
^ bits bitAnd: 3.
Which is somewhat cleaner and is unequivocally constant...
Dave Hylands The threading model used in Windows NT/2K/XP is VERY similar to that used in pthreads. Having previously written abstraction layers (OK I was programming in C - not Smalltalk) for both OSes, they are quite similar.
Tim Rowledge You're all thinking much to small - or maybe too big, depending. I think for _massive_ parallelism we want to be able to schedule each method invocation to any available compute resource. Prims would potentially be parallelisable as well in some cases. Forget all this crap about "one cpu for garbage collection and another for running the interpreter" - I'm talking 100, a 1000 a million cpus. The current sort of VM would be utterly pointless in such a world. Quite possibly the current sort of Smalltalk would be useless.
John M McIntosh Actually JP Morgan uses Smalltalk to distribute problem solving to lots of machines to solve complex
problems. http://www.whysmalltalk.com/articles/JPMorgan.pdf Much of the Smalltalk code itself was written
assuming there was only one CPU. If you have two real CPUs attempting to run two Smalltalk threads I think interesting things would happen. An easier solution as Tim points out is to run 2 or 5 or 1000 images and distribute/share data via sockets or even shared memory. I recall once hearing about a Gemstone solution that used shared memory, to enable the sharing of mostly static data 100MB+ across lots of images. That reduced the overall footprint on the server by lots.
This of course moves the whole parallelism issue out of the VM and base Smalltalk code into the application domain...
Michael Latta [Responding to Tim's redirecting the discussion to massively parallel approaches] That was where my comment about the Self community came from. One of the people on that list took the Self language and made every method return a "future" object immediately. This future object would only do the message send when asked for its value (by being the target of another message send or primitive). Passing a future as an argument does NOT evaluate it. This simple change in semantics allows each future to be on its own concurrency path. The computation is done in
the background between the time it is scheduled and when it is needed. The result was as much as 5 way concurrency with the existing code base. For code designed to be concurrent it can be much higher. For example, by allowing the result of a select: collect: or detect: to be evaluated on the same basis would greatly increase concurrency.
These are the type of things that Smalltalk can address that C or Java could never hope to deal with. We can change the semantics of these control structures, because they are in the language, not in the VM. Of course in the Squeak case we can change anything! Now if we take the bytecode speed up from the recent JIT project and the new message semantics, we are getting there. Making ObjectMemory concurrent will be the biggest part of the effort.
Thinking about massively parallel Smalltalk [Regarding the proposal tochange the Smalltalk semantic to be more concurrent using proxy objects] I assume the "proxy objects" here are as following:
A method is "invoked". Instead of actually invoking that method, a proxy object is given in return, which will eventually be replaced with the real result of the method. The method can then be executed on a separate
CPU or deferred until later.
Alternatively, separate O/S threads (one on each CPU) would run the interpreter loop. They would use memory locking to prevent data corruption. When one is idle, it could steal an invocation from somewhere in the call stack, replacing it immediately with a return-value proxy that the other process would eventually fill with the return value. ...except that it wouldn't work because executing Smalltalk code has side-effects. Hmm.. more thinking required.
[Regarding the question of a concurrent or asequential language], Have a look at erlang, http://www.erlang.org. Also, it wouldn't hurt to use more [[do something] fork] in code, especially for long-running calculations such as downloads, compilations or software updates. This is one of my pet peeves - Smalltalkers have everything there to do concurrent programming but tend not to think outside of their sequential square too much.
[Regarding, how would we make use of concurrency?] One idea would be to make better use of pipelining. An experiment would be to rewrite the interpreter a bit, doubling up each step and using two stacks. I.e:
Load bytecode A
Load bytecode B
Process bytecode A using stack A
Process bytecode B using stack B
Jump to the next bytecode for A
Jump to the next bytecode for B
My theory is that this would be slightly faster because the CPU's pipeline would be more full, and more cache would be used.
JecelAssumpcaoJr [Regarding,I'm talking 100, a 1000 a million cpus.] Like this? http://www.merlintec.com:8080/hardware/19
[Regarding, ...possibly the current sort of Smalltalk would be useless.] No, but it would be nice to program in a more APL-like (Fscript-like?) style instead of the Fortran/C one that people are used to.
Michael Latta mentioned my tinySelf 1 work http://www.merlintec.com/lsi/tiny.html#rel1 where the idea was to see how well a large amount of code written in a very sequential style could be parallelized. Sadly this was interrupted by another project before it was debugged enough to run anything more than trivial expressions. It was probably some simple bug - contexts would sometimes come back to life after having been returned from. If I had used a more OO programming style (less explicit queue manipulations) it would have taken me about two hours for fix this. As it was, it probably would have taken me two days.
While the trivial expressions did have a parallelism of four or five, I believe that real applications would have reached several dozens or more. The idea of treating each object as a complete computer that talks to others over some network can get us very far.
Michael Latta Jecel,Thanks for jumping in. The reference was interesting. In particular it unifies why I like APL and Smalltalk. What I like about APL is the ability to operate on a large number of items as a unit. I like
similar constructs in Smalltalk (collect:, select:, etc). The C/Java languages still make me deal with each item individually. While the difference in syntax is annoying because of the wordiness of Java, the conceptual difference has an amazing ability to drain productivity. I think it ties back to the studies that programmers got the same number
of lines coded in any language. I can produce the same number of conceptual operations per unit time, and in APL/Smalltalk that produces more results than in C/Java because I only deal with one conceptual operation to manipulate a whole structure.
If we approach collections using future proxies we can get a lot of concurrency. For example collect/select/detect would result in a proxy that binds the input collection to the operator (block). This use of proxies will also require copy on modify semantics or a radical head rewiring for the programmer. I am not sure which I would vote for at
this time. The do: proxy would do the same with binding a collection to a block. The result would be proxies holding onto proxies onto proxies. These chains could then be linear or not as you chose. The result would be pipelined chains of operations that could be used to efficiently fill CPU pipelines.
JecelAssumpcaoJr [Matej Kosik I find the idea of the computation based on asynchronous messages also appealing. I would like to ask if objects in your system (TinySelf) were able to change their state or they were immutable.]
Note that these were not really asynchronous messages, but rather synchronous ones where using the reply was decoupled from sending the message. You don't get the kinds of programming errors with this that you do with true asynchronous messages because we keep the semantics as close to the original as possible (the goal is to run unmodified Smalltalk/Self applications). See "Process Model" in http://www.merlintec.com/lsi/selfdiff.html
TinySelf 1 objects were able to change state, but defining exactly what "state" is has been the major problem I have faced so far. Each object had its own thread and the state that was protected was only its local "instance variables". That is not good enough (what about instance variables in the objects it points to or that are passed as arguments?) and at the same time it is too much (recursive methods will cause deadlocks).
The Eiffel system I got some of these ideas from solved the first problem by locking the receiver and all arguments before executing a method, and also by having non active objects that can't be shared among the active ones. They ignored the second problem (since you were supposed to program active objects differently from the normal style: don't use recursion) and I solved it by detecting deadlocks and allowing the new method to interrupt the old, blocked one. The idea is that this hack wouldn't cause any errors that wouldn't also be present in a sequential execution of the code, so this was acceptable.
My current idea is to divide the objects into "groups" and have one thread per group. Then we can define an object's "state" to be the value of the instance variables of every object in that group. This is a lot like vats in E. I no longer think that migration among groups is important, so when you create an object you can use #newLocal to have it in the same group as the sender, #newRemote to have it be the first object in an entirely new group or just #new if you don't care either way (different kinds of objects will have this default to one of the previous two).
[Michael Latta This use of proxies will also require copy on modify semantics or a radical head rewiring for the programmer.] It is certainly worth looking at what people have already done in this area. There were two interesting projects, both called ConcurrentSmalltalk:
This has a lot in common with E and with the various Actor languages and is very worth studying for its mailboxes and secretaries and the ^^ "return result but continue" trick.
This is the place to look if you want to see Smalltalk (with Lisp syntax... nobody's perfect) running on a 1024 processor machine. I don't see any papers specifically about "concurrent aggregates", which are what you are talking about, but they are a major factor in getting such a high level of parallelism from normal looking code. The idea is that when you create some ConcurrentArray with 4096 elements, four are stored in each node. When you send #collect: to this array, each processor gets a copy of the message and only has to execute the block for its 4 local elements. The result is a new ConcurrentArray.
It is interesting to contrast this with Croquet. In that case each node would have a TArray with all 4094 elements and when receiving the #collect: message there would be no speed up at all (which isn't the idea, after all). But both have this in common: they allow an object to be addressed from any node as if it were local, even though it is made up from pieces present in all nodes. A common framework that would allow Croquet, concurrent aggregates and other stuff (remember LindaTalk?) would be very nice to have.
firstname.lastname@example.org GemStone has a VM plus a whole bunch of other processes running in a shared-memory-pages-scheme with a transaction model keeping it all in synch.The net effect is that multiple VMs share (through transactional
replication) "an image" or what is often referred to as an object memory. This means there is no "automatic" parallellism in actual computation going on - I mean, you still have to turn your Smalltalk code into multiple processes/threads - which your post also explained, just wanted to stress it once more. But at least a multitude of VMs can run (each on its own CPU and possibly own box) sharing the same object space in a coordinated fashion. This is of course also what Magma/GOODS etc can give you - a shared transactional memory.
Given my limited knowledge in the area it seems that the semantics of message passing would have to be changed in order to get "automatic full scale" parallellism. Typically each object would execute in its own Process and messages would be asynchronous. I also assume these lines of thought is what SqueakELib http://map1.squeakfoundation.org/sm/packagebyname/squeakelib www.erights.org, Croquet in some way and also Tweak is experimenting with. (Milan Zimmermann There is E-on-Squeak effort but http://www.erights.org/e-impls/e-on-squeak/index.html)
Matej Kosik Concerning E I have come to the following conclusion:
"E objects do not migrate (for security reasons). It makes it unscalable. Also the existence of ''vat''-s is awful. It does not seem to be suitable for fine-grained parallel computation because E's constructs for asynchronous message passing are not powerful enough in order for the programmer to be able to express everything solely by means of them. Furthermore, the computation inside the ''vat''-s happens in so called ''turns''. These turns cannot execute concurrently because inside each turn programmers usually do normal call/return-s and thus they would need to use additional synchronization mechanisms on the shared mutable data. E seems to be a good platform for creating loosly coupled (information) systems." (my personal opinion)
Brian Rice While E has a narrow focus on security, the concurrency model may be taken without the security/crypto design implications that go with it. In my case, I am working with such an option for Slate as a concurrent
language extension, and eventually cryptographic security can be added as an optional transport layer. I also want to focus on Erlang-style signalling between processes, which goes a step further than broken promise resolution in E.
Of course, the E fellows aren't directly helpful with this kind of effort, simply because their whole focus is security and their example of E. It's difficult to bring up an issue with them and not get just the security slant, so I've had to work it out myself.
Also, from what I understand from following the Croquet and E projects, it seems that the interest from Croqueteers towards E has not seen a real follow-through, at least from the E perspective, so don't count on them to deliver a full solution to the integration issues. It's a tough task and requires a lot of focus.
Lex Spoon I disagree with [Matej Kosik's] analysis. E's networking approach does not solve the problem and tie a bow on it, but it moves you in the right direction.
First, vats are a practical necessity for making parallel programs that actually give correct answers. If you have multiple user-visible threads munging through the same heap, you are very likely to screw things up. Call them aweful if you like, but if you want to write programs that are correct, then you need something like vats. (Or, you need to dip into formal verification....)
Second, a vat can be arbitrarily small. It is a unit of organization, much like a thread in a pthreads program. If a programmer wants to have lots of parallelism, and is willing to micro-optimize, then they can create zillions of tiny vats which can then perfectly and safely run in parallel.
Third, Smalltalk semantics is that bytecodes happen one after another. Smalltalk implementations are allowed to cheat, however, and run bytecodes in parallel so long as they don't interfere. Likewise, a sophisticated E VM could run multiple turns in parallel, if it can determine that they won't interfere. In general, automatic parallelization in E is likely to be easier than in Smalltalk, because the division into vats gives you a head start. In particular, vats give you free alias analysis.
Of course, someone should really go ask on e-lang. I have found the E guys to be very imaginative and to be ready to discuss a variety of topics. They aren't just security nerds; if they were, I doubt they would have come up with the architecture they have.
Anyway, as far as I can tell on the subject of the thread, you want to have a special language to get massive parallelism, not just a spiffy VM. If there were good compilers or VM's around that automatically parallelized, then parallel processing would be much more popular than it is. Someone mentioned that Self got automatic parallelization to about 5 processors. There's a long way to go to get 100. So, try thinking of cool parallel operations, especially ones that avoid side effects and thus can safely happen in arbitrary orders.
Matej Kosik [vats are a practical necessity] I am thinking of a computational model without processes/threads/vats.
It is based on asynchronous message passing where necessary synchronization happens differently than in normal concurrent languages. My current attempt (not yet finished, I did not get to lot of important things and there are no good examples) http://altair.dcs.elf.stuba.sk/wiki/Kosik/V2 makes me believe that I am on the right way. It is an attempt to provide the programmer with a concurrent computational model which is safe (inherently interference-free) and live (inherently dead-lock free). I have noticed that the philosophy of the Smalltalk could be transformed
from invoke/return model to eventual-send/future-like-return. The system I am thinking of will have similar beauty as Smalltalk (everything will be expresed by eventual-sends, everything is an object). That was my previous attempt http://altair.dcs.elf.stuba.sk/~kosik/V1.pdf which moved me in the right direction. Its main problems were that
parallel activities/activations of objects were not able to cooperate with shared object automatically safely (without interference). My contemporary attempt solves that major problem.
The proposed computational model is not more difficult to use (to make programs for it). It introduces fine-grain parallelity of the computation (which could make the programs scalable on proper hardware) and that it may cure the main Smalltalk's disseas. Smalltalk's images were not able to communicate in normal ways safely and lively. It
inherited all the problems of sequential languages.My computational model is suitable for general-purpose programming. It is no tougher than normal sequential programming. It is currently immature.
[... a vat can be arbitrarily small.] I do not agree. Did you try it? The mechanisms provided by E are not
powerful enough to do so. For example, you will not be able express branching by eventual-sends safely.
[ automatic parallelization in E is likely to be easier than in Smalltalk,]I do not think that the Smalltalk's virtual machine could be patched here and there to provide what I need. The whole model of computation must change.
[ try thinking of cool parallel operations, especially ones that avoid side effects and thus can safely happen in arbitrary orders.] The past showed two such directions (for fine-grained parallelism).
I think that the second approach (though with some modifications) will be able to:
- Data-parallel computation (the Cm-LISP running on Connection-Machine http://en.wikipedia.org/wiki/Connection_Machine) is perhaps the most beautiful specimen). Its main disadvantage is that it is inherently not suitable for general-purpose programming.Solutions of some problems could be naturally expressed Cm-LISP but most of them not.
- Hewitt's Actors inspired many systems. The problem is how to extend it to be able to deal with shared mutable objects safely and lively. That is a problem I am solving and I believe that it is solvable.
swallow the first approach conquer back the lost areas in personal computing
conquer yet actually uncharted areas of distributed computing (for example explicitely cooperating information systems)