Server Profiling


	links to this page:

Server Profiling

Last updated at 2:15 am UTC on 4 March 2009

Server profiling is tricky with MessageTally. There are several issues to keep in mind, some of which you can fix, some of which you can't. Here we go:

Observing all Processes

First, MessageTally only observes the block you are running, not the entire system. For example, if you were to profile this:

done := Semaphore new.
worker:= [
    10 timesRepeat:[World changed; displayWorld].
    done signal.
] fork.
MessageTally spyOn:[done wait].

This is what you'd get:

Tree
100.0% {2367ms} primitives

Leaves
100.0% {2367ms} UndefinedObject>>DoIt

Obviously it's not actually measuring what is going on during the period of time which is a real problem if you are trying to measure server load in general. But it's fixable.

Priority issues

MessageTally uses userInterruptPriority and won't see anything running (for example) at lowIOPriority or higher. This is quite common for server images for handling incoming requests etc.
The good news is, both of the above issues can be fixed easily. Just file in the attached patch for MessageTally (I will also post this on Mantis) and it will report the work being done accurately when you use it via:

  MessageTally spyAt: Processor timingPriority-1 on:[done wait].

The priority argument allows you to specify at which priority to run your spy so that you can measure what you need to measure (a word of warning though: You mustn't try to run at #timingPriority as this will interfere with the timing mechanism itself).
With these changes you can run a tally on, e.g.,

  MessageTally spyOn:[(Delay forSeconds: 10) wait].

and see what happens over the next 10 seconds. This is effectively what we do for profiling on our server images.

VM timing problems

Unfortunately, recent VMs have a problem with spy accuracy since an important timer check was removed from the VM (I didn't notice that at the time it happened). This heavily affects the profiler, for example, try this:

    MessageTally spyOn:[
        1 to: 100 do:[:i|
            Display fillWhite.
            'Hello', ' World'.
        ].
    ].

If you have a recent VM (I haven't tracked how far back this goes) the result is this:

Tree
100.0% {534ms} ByteString(SequenceableCollection)>>,

Leaves
100.0% {534ms} ByteString(SequenceableCollection)>>,

The problem is caused by the VM no longer doing a timer check after a primitive has been run. Consequently, the time spent in any long-running primitive (like BitBlt) is allocated to the next real send when an interrupt check occurs (as you can see in the above). This can be extremely confusing. If you want even remotely accurate profiles I recommend going back a few generations in VMs and pick one that behaves properly in this regard. For example, running the above in 3.6 I get:

Tree
100.0% {218ms} DisplayScreen(DisplayMedium)>>fillWhite
  100.0% {218ms} DisplayScreen(DisplayMedium)>>fill:fillColor:
    100.0% {218ms} DisplayScreen(Form)>>fill:rule:fillColor:
      100.0% {218ms} BitBlt>>copy:from:in:fillColor:rule:

Leaves
100.0% {218ms} BitBlt>>copy:from:in:fillColor:rule:

Obviously, this is the correct tally. [Hm ... but this points out another issue. For some reason 3.6 runs this in half the total time??? What the heck... I'll have to look at this]
You must use a VM that doesn't have the above problem or the results will be completely confusing.

Measuring load with (sub-)msecs responses

Lastly, server images can have a particular issue if they are set up to do relatively fast request handling. For example, our routers generally act in sub-msecs responses. When you start putting load on servers that react that quickly, your whole msec-based sampling approach goes out of the window.
What's worse than that is that if (some of) your server activities are driven by the clock (using Delay etc) running under load can produce utter garbage result. Another example:

blamedProcess := [
  [true] whileTrue:[(Delay forMilliseconds: 1) wait].
] forkAt: Processor userInterruptPriority.

[MessageTally spyAt: Processor timingPriority -1 on:[
    (Delay forSeconds: 5) wait.
]] ensure:[blamedProcess terminate]

When you run this, MessageTally will tell you that 100% CPU was spent in Delay wait for the blamedProcess. If you look at CPU utilization you should see practically no CPU activity. How come? When the timer event fires, it puts both message tally's profiling process as well as the blamed process in the runnable queue. MessageTally (being at higher priority) will go first and see that the process that is currently active is the blamed process. It will assume that the entire last msec was spent in it since it has no other timing base to use.
Be wary that the above can happen in many, many variations and depends heavily on your system's timer accuracy (if you want to play with this, increase the blamedProcess' wait time step by step and compare the reported CPU utilization various platforms). In general, when you see Delay wait show up in your tallies you can be sure there is something wrong with the measure.
Unfortunately, there is nothing much you can do about this other than be aware of it and wait. Because we ran into the same issues I have written a better profiler at Qwaq (which is both sub-msecs as well as primitive-aware) and I think we'll be releasing the necessary support for this in Cog but this will still take a while. For comparison, when you run our profiler on the above code you get:

Reporting - 45,975 tallies, 5,000 msec.

Tree
98.9 (4,945)  ProcessorScheduler class  startUp
  98.9 (4,945)  ProcessorScheduler class  idleProcess
    98.9 (4,945)  ProcessorScheduler class relinquishProcessorForMicroseconds:

Leaves
98.9 (4,945)  ProcessorScheduler class  relinquishProcessorForMicroseconds:

So it correctly reports that the code was doing nothing (executing the idle process).

Summary

In order to profile your server you definitely need to take of 1) through 3) or else you won't be able to find out anything. With these issues taken care of, just add a remotely executable action that runs message tally for the given number of seconds (we have this set up on our servers so that when you go to /profile?secs=xxx it runs the profiler for that amount of time and spits out the result).
Unless you do work that has very fast responses and is designed to work with high load this should be good enough to get you a pretty good understanding of what is going on.