Last updated at 3:50 am UTC on 22 March 2007
NuBlt is a plan by Eddie Cottongim for an optimized bitblt. I'm not working on this heavily right now, but would be willing to help if others have a particular interest in further enhancements.
The motivation for this was that the LookEnhancements in 3.9 put a heavy load on BitBlt for things like window resizing and other common things that people expect to be very fluid.
The system is implemented in Slang, similar to BitBlt. Substantial parts are generated from templates to cut down the maintenance required. The main things done to help performance is removal of all possible per-pixel branches and function calls. This turned BitBlt "inside-out", with all those branches moved to "outer" levels instead of the innermost ones.
Here's the current state, with estimated difficulty on a scale of 1 to 5 for ToDo items:
17 of 34(ish) combination rules implemented. (Any one: (1) Finish them all: (2))
All depth conversion code is implemented
Constant color fills ("nosource") not implemented (2)
WarpBlt not implemented (5)
A few bit-twiddling bugs remain (2)
Special case performance enhancements (2)
Halftoning not implemented (2)
A few crash bugs remain (3)
It is a multipass design that tries to have an optimal routine for each pass. There's a single, simple "main loop" that calls optimized versions of each function (for example, depth conversion, alignment, combination).
The classic Bitblt has the entire logic combined into one loop, which tends to have very general code. Several copies of the big loop exist with various optimizations (for example, one copy has the source=null optimization)
The multipass design benefits from the assumption of a sizeable L1 cache, where memory latency is seen only on the first pass, with subsequent passes operating on data already in the cache.
The multipass design can also add additional passes relatively easily; a halftone pass should be trivial, and a warp pass should be possible (though not as trivial).
Special attention has been paid to small blts (such as typical for font blts). In these cases, only a single deference to each optimized function is required per blt. This was done by moving the 'y loop' inside the worker functions.
The downside of this is that there is a lot of code to write; about 30 depth conversions. The code for some of these can be generated. More code for the combination rules may also be required; For example, I did 3 'inner loops' for Rule 25 and 2 for Rule 3.
Writing all the specific fast code, as well as dealing with the large number of unusual modes (halftones, constant color fills, warp blt) has been more than I felt like dealing with. It should all be possible with this design, though.