Improving Kom performance under Linux


	links to this page:

Last updated at 1:23 pm UTC on 1 November 2006

I've (C. David Shaffer) been working at optimizing Squeak as a web server (both for files and Kom "servlets"). Here's a log of my findings and progress. If you're interested in the work, drop me a line. Contact info on my home page

Software/hardware Tools:

autobench (which is really just an httperf driver...please, no comments, I have used most open source web load testing tools...this is the one I chose for this work)
Dell inspiron 8500 running Gentoo linux (2.6.12 kernel)
Squeak-3.8a1 VM from Ian
Squeak 3.8 image (#6665) with KomHttpServer-gk.6
Jigsaw 2.2.5
Ruby 1.8.2 WEBrick 1.12
Python 2.3.5 BaseHttpServer (servlets) 0.3 and SimpleHttpServer (file server)
VisualWorks 7.3.1 WebToolkit (bundled with VW7.3.1)

Yes, these benchmarks were run on a single machine. I get far more done that way than waiting until I'm at work and have access to a network of PC's. These results will be skewed in many ways but, frankly, if I can improve performance in this configuration it will help in networks as well.

Configuration and whatnot

The python server can be run single threaded (select-based), multi-threaded or multi-process (forking). I used the single threaded mode because, well, it was easy to set up. If Squeak ever outperforms this mode I'll try the others to see if they're faster.

The Jigsaw server has a ton of configuration options. I didn't play with any of them. It is a multithreaded server where each thread services possibly multiple connections using a select architecture. (Note to self: confirm this, it's been a while).

In the Squeak case I ran most of the benchmarks with Goran's FastSocketStream although it made little difference in the cases where the generated output was small.

Logging was disabled in all cases.

What I compared

I found that average response time (time from the initial request being sent to the complete HTTP response being received) was a good indicator on several accounts:

Servers with large values of this response time seemed more "sluggish" on interactive use than those with smaller values. One of my goals is to lower this number
As a function of request rate, large increases in response time signaled the beginning of the "death" of a server. That is, when the response times begin to climb they usually do so rapidly and the server begins timing out on many connections. My other goal is to increase the response rate at which this happens hopefully making it comparable to other web servers.

It was also useful to keep an eye on the rate at which connections were being accepted. Sometimes the response time remained reasonably fast on average even though many connections were not being serviced at all.

First observations

serving a 1k file (no caching on any of the servers)

Server	Typ. response time	Max request rate	Figures (NEED DESCRIPTIONS)
Jigsaw	1-2ms	>400/s	jigsaw1k_rates.ps jigsaw1k_times.ps
Python	1-2ms	290/s	python1k_rates.ps python1k_times.ps
Ruby	3-10ms	110/s	ruby1k_rates.ps ruby1k_times.ps
Squeak	>14ms	130/s	squeakOrig1k_rates.ps squeakOrig1k_times.ps
Squeak (FastSocketStream)	>14ms	150/s	fss1k_rates.ps fss1k_times.ps
VisualWorks	2-3ms	270/s	vw1k_rates.ps vw1k_times.ps

serving approx. 3k from a "servlet"

Server	Typ. response time	Max request rate	Figures
Python	0.1ms	310-350/s	pythonServlet1_rates.ps pythonServlet1_times.ps
VisualWorks	2-3ms	270/s	vwServlet_rates.ps vwServlet_times.ps
Squeak	7-10ms	170-210/s	squeakOrigServlet_rates.ps squeakOrigServlet_times.ps

Squeak VM: the first pass

Squeak's socket I/O plugin uses Ian's aio (asynchronous I/O) library. Basically registered file descriptors are polled for readiness (using select()) at "convenient times". This polling is done via the function aioPoll in aio.c. All socket I/O is done in non-blocking mode...this keeps the single-threaded Squeak VM from blocking on a read or a write but means we need to periodically check if the socket is ready to be read from or written to. The problem is, when do we check. If the VM is busy executing bytecodes the "checkForInterrupts" routine is called at 1ms intervals. In the default configuration this routine will poll for socket I/O every 200ms. Wow...this seems too long. Still, if you look further you can see that aioPoll() is called at various other times in the display driver code. The net effect is that it is called pretty often. Also, aioPoll() takes an argument indicating how long it should sit in select(). Most of the time this argument is zero but there are times when it isn't. In the cases where it isn't squeak is being dedicated to socket I/O for the specified interval. So, the question is, will adjusting the 200ms frequency used by checkForInterrupts effect anything?

There was some activity on the vm-dev list indicating that it does. They also indicated that setting TCP_NODELAY helps as well. I made my own patches (none were supplied on vm-dev) and got a significant improvement in Squeak's I/O in the servlet case:

Application	poll delay	Typ. response time	Max request rate	Figures (NEED DESCRIPTIONS)
servlet	5ms	7-10ms	250-290/s	delayFive_rates.ps delayFive_times.ps
servlet	1ms	6-10ms	230-270/s	delayOne1_rates.ps delayOne1_times.ps
1k file	5ms	>12ms	110/s	delayFive1k_rates.ps delayFive1k_times.ps
1k file	1ms	>11ms	110/s	delayOne1k_rates.ps delayOne1k_times.ps

This improves things in the servlet case (but at the expense of more polling). Take a look at the graphs in the 5ms case and compare them to the original VM graphs above. You'll see that this version makes squeak much more responsive to requests for larger request rates.

Async file I/O: proof of concept

Still, file I/O remains a problem. Squeak's standard file I/O is blocking...at the VM level. That is, the whole VM blocks when a file read or write blocks. This is (obviously) a bad thing in the case of the 1k file example. The linux VM comes with AsyncFilePlugin which makes it pretty easy to use the same mechanism used with sockets (Ian's AIO library). I whipped up a quick file server (hardcoded file name...sorry) just to prove to myself that it would perform better.

CDSAsyncFileKomPlug.st

I also tried with with and without the O_NOATIME (don't update the UNIX atime - access time - on opening files for read only) and O_NDELAY flags added to open in sqUnixAsyncFile.c. I added the O_NDELAY to prevent the open from blocking. Probably has no effect on a file. All of these were run with a "poll delay of 5ms". Here are the results:

Application	options	Typ. response time	Max request rate	Figures (NEED DESCRIPTIONS)
1k file	none	9-16ms	250/s	asyncDelayOne1k_rates.ps asyncDelayOne1k_times.ps (files are misnamed, sorry)
1k file	O_NOATIME O_NDELAY	5-20ms	270/s	ndelay1_rates.ps ndelay1_times.ps

Clearly a significant improvement. Just compare these graphs to the ones in the previous section. We are serving files much more quickly and remaining responsive at much higher request rates. Still, we can't even compete with python here. We've got to do better...

From now on I'm going to use the 5ms poll delay and the O_NOATIME and O_NDELAY flags as described here unless I note otherwise.

Async files: second pass

I've now taken this a bit further and written a "near ModFile" replacement (with some caveats, of course). In the process I developed a buffering Stream than is based on the AsyncFilePlugin. This stream might be more generally useful. Unfortunately the AsyncFilePlugin gave me some problems. In particular there didn't seem to be a way to detect end of file. It also didn't include methods for determing file size. Here are patches which add these to the plugin. You'll need to rebuild the plugin with this patch:

AsyncFilePlugin Changes.cs

and add this function to sqUnixAsyncFile.c:

/*CDS added function*/
squeakFileOffsetType asyncFileSize(AsyncFile *f)
{
  struct stat s;
  FilePtr fp;
  validate(f);
  if ((fp = (FilePtr)f->state))
    {
      if (fstat(fp->fd, &s) < 0)
	{
	  fprintf(stderr, "Error stating file.\n");
	}
      return s.st_size;
    }
}

and modify the following function in that same file:

int asyncFileReadResult(AsyncFile *f, int bufferPtr, int bufferSize)
{
  FilePtr fp= 0;
  int n= 0;
  validate(f);
  fp= (FilePtr)f->state;
  n= read(fp->fd, (void *)bufferPtr, bufferSize);
  if      ((n < 0) && (errno == EWOULDBLOCK))
    return fp->rd.status= Busy;
  else if (n < 0) /*CDS changed from <= to < */
    return fp->rd.status= Error;
  else /* (n >= 0) */  /*CDS changed comment */
    fp->rd.pos += n;

  return fp->rd.status= n;
}

Once you get a new VM up and running with these patches you'll need to load the AsyncFile patches:

AsyncFile Changes.cs

into any image which uses this stream. Here's the stream and Comanche module itself:

AsyncFileStream.st

Finally, here are my benchmark results:

Application	Typ. response time	Max request rate	Figures (NEED DESCRIPTIONS)
1k file	8-20ms	290/s	asyncStreamWithIAF2_rates.ps asyncStreamWithIAF2_times.ps

Very good news. Seems relatively performant. Note though, that this example is quite biased. I used exactly a 1k buffer to read a 1k file. It'll probably need some optimization for general files, for example.

Squeak VM: second pass

What's next? Clearly async file I/O is needed for a performant server. I'm at a bit of a juncture. I could rewrite the file I/O code (which is currently cross platform) so that the UNIX version uses aio (leaving me to deal with parts of Squeak which expect it to block the image)

There's something missing here still, though. I don't think the current aio library is sufficient. Polling is a waste of CPU time, especially if there are better things to do. Simply increasing the polling frequency is probably not going to make Squeak a more responsive server (in fact, my evidence above shows that it doesn't as one goes from 5ms to 1ms). I'm now starting to think that reworking Ian's aio code to use some other mechanism:

I/O threads
POSIX aio support
signal on readable/writeable support in Linux

The last two would mean that Squeaks I/O is signal driven. Research seems to indicate that select (poll or epoll) systems are faster but I don't think that really applies here. The problem we're having is that there is work to be done (connection to be accepted, socket ready to be read or written to) but we're not seeing it. Then, when we do see it we get a burst of activity to deal with. I don't think we've reached the performance threshold where massive signaling will become a problem.

No matter which we choose we need a way to signal Squeak semaphores in a thread or signal safe way in UNIX. Maybe we can mimic the Window's code here?

In order to continue to make forward progress I am currently working on a somewhat hybrid approach to this last problem. I'm leaving file I/O as it is for now but I'm going to use my simplistic AsyncFilePlugin-based server. I'm arranging for signals on any readable, writeable, error-able? or acceptable file descriptor and the signals will force an interrupt check as well as an aioPoll. This is where I am today (9/10/2005).