Home    Blog    Projects    About
« My imports have arrived! A perfect run »

Optimizing a screen scaling routine  (September 3rd, 2007 at 8:49 pm)

I can reveal two things about the game I’m currently working on:

  • It’s awesome
  • It has several user-selectable video modes

I’ve been working on the former for a couple of months now, but I only put some real work into the latter a couple of days ago.

My game presents a unique challenge because it runs in a really small window, 400×300 to be exact. While it looks great as it is, I know many people (myself included to an extent) like playing games in fullscreen, or at least in substantially larger windows. To keep everybody happy, I included two video modes apart from the default window mode:

  • Window 2x, which scales the screen by 200%
  • Fullscreen

These video modes work by drawing the game to a 400×300 rendering surface (called screen) and then scaling to the 800×600 backbuffer/screen (called fs_screen). Sound easy, right? I mean all you’d have to do is write a simple routine that takes each pixel in screen and turns it into four pixels in fs_screen. Kind of like this:

Iteration #1: the naive routine

sys_lockSurface(fs_screen);
sys_lockSurface(screen);
 
Uint16* src  = (Uint16*)   screen->pixels;
Uint16* dest = (Uint16*)fs_screen->pixels;
 
for (unsigned int y=0; y<SCR_HEIGHT; y++)
{
   for (int t=0; t<2; t++)
   {
      for (unsigned int x=0; x<SCR_WIDTH; x++)
      {
         *(dest+(x<<1)  ) = *(src+x);
         *(dest+(x<<1)+1) = *(src+x);
      }
 
      dest += (fs_screen->pitch>>1);
   }
 
   src += (screen->pitch>>1);
}
 
sys_unlockSurface(screen);
sys_unlockSurface(fs_screen);

This worked fine in fullscreen, but it was a bit sluggish in window 2x mode. My rudimentary profiling method (read: SDL_GetTicks) told me that it took 3-4 milliseconds. Cue for me to put on my “optimization hat” and get cracking! The first improvement I saw was a rather obvious one.

Iteration #2: the memcpy

sys_lockSurface(fs_screen);
sys_lockSurface(screen);
 
Uint16* src  = (Uint16*)   screen->pixels;
Uint16* dest = (Uint16*)fs_screen->pixels;
 
unsigned int dest_scanline = (fs_screen->pitch>>1);
 
for (unsigned int y=0; y<SCR_HEIGHT; y++)
{
   for (unsigned int x=0; x<SCR_WIDTH; x++)
   {
      *(dest+(x<<1)  ) = *(src+x);
      *(dest+(x<<1)+1) = *(src+x);
   }
 
   memcpy( (void*)(dest + dest_scanline), (void*)dest, fs_screen->pitch );
 
   dest += (dest_scanline << 1);
   src  += (screen->pitch>>1);
}
 
sys_unlockSurface(screen);
sys_unlockSurface(fs_screen);

That is, why bother writing the same row (or scanline) pixel-by-pixel twice? It only needs to be done once, with the second scanline being filled using the rather fast memcpy function. That got me down to 2-3 ms. A good start, but not good enough.

“Wait a minute … 32-bit processors! Of course!” And with that thought …

Iteration #3: the 32-bit integer

sys_lockSurface(fs_screen);
sys_lockSurface(screen);
 
Uint16* src  = (Uint16*)   screen->pixels;
Uint32* dest = (Uint32*)fs_screen->pixels;
 
unsigned int dest_scanline = (fs_screen->pitch >> 2);
 
for (unsigned int y=0; y<SCR_HEIGHT; y++)
{
   for (unsigned int x=0; x<SCR_WIDTH; x++)
      *(dest+x) = *(src+x) + (*(src+x) << 16);
 
   memcpy( (void*)(dest + dest_scanline), (void*)dest, fs_screen->pitch );
 
   dest += (dest_scanline << 1);
   src  += (screen->pitch >> 1);
}
 
sys_unlockSurface(screen);
sys_unlockSurface(fs_screen);

I honestly didn’t think this one would help much, but it was surprisingly effective. By writing a single 32-bit integer instead of two 16-bit integers for every two pixels, I shaved another millisecond off the routines time. I was now pushing 1-2 ms.

In hindsight this makes sense. As most modern processors are 32-bit, they can write 32-bit of memory much faster than they can 16-bits, so even if it does take a bit of extra work to create that 32-bit piece of data, copying it will end up being faster.

My final optimization was pretty neat. It started with me reasoning that copying from one surface to another would be slower than copying data within a surface. The plan then, was to get rid off the 400×300 surface (screen) entirely. Instead, I would draw the game onto the 800×600 surface (fs_screen), and then do an in-place scale from bottom-to-top.

Iteration #4: the one I wasn’t arsed to write

/* ... errr ... */

Umm, so I played around with this for a while until I was tired and went to bed instead. I did manage to get some hilariously incorrect output but in the process I noticed that my performance still wasn’t good enough despite the fact that the routine was going incredibly fast at this point. Curiously, the game was smooth in fullscreen mode, which uses the exact same routine. I’m going to dig deeper into this, but I suspect the blame lies in something silly like the window priority level for the different video modes.

So my game didn’t end up gaining significant performance boosts despite all the optimizations above. Was it a waste of time then? I’d say not. Apart from being very rewarding, optimizing code is a skill that every programmer should have. And there’s no better way to build up a skill than by practicing it.

Plus, I got a blog post out of it.

Posted in Game development.

No Comments

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment

Site and contents © Mobeen Fikree. Blog powered by Wordpress.