Draw circles in C

Today, we are going to draw circles. To be more specific, we are going to study algorithms designed to fill a rasterized circle in a given buffer. If you already have some experience with C or C++, this article is for you.

How to draw a circle

Simple, two for loops and a square root, right? Sure, but that's the easy way. We want it fast. As fast as possible. And such an algorithm checks an entire square just to draw a circle.

void draw_slow(uint8_t* buf, int radius)
{
	int i;
	int k;

	for (i = 0; i < radius; ++i)
	{
		for(k = 0; k < radius; ++k)
		{
			if (ceil(sqrt(k * k + i * i)) < radius)
			{
				buf[(radius + i) * radius * 2 + radius + k] = 1;
				buf[(radius + i) * radius * 2 + radius - k] = 1;
				buf[(radius - i) * radius * 2 + radius + k] = 1;
				buf[(radius - i) * radius * 2 + radius - k] = 1;
			}
		}
	}
}

For convenience, we will make each cell of the buffer two characters wide when showing a circle

          ##
      ##########
    ##############
    ##############
  ##################
    ##############
    ##############
      ##########
          ##

Obvious optimizations

The first idea one has when looking at this result is checking only a quarter of the circle and using symmetry.

          #O
      #####OOOOO
    #######OOOOOOO
    #######OOOOOOO
  #########OOOOOOOOO
    ##############
    ##############
      ##########
          ##

If we take a closer look however, we can notice an eighth is enough. And that's where things become interesting.

          ##
      ##########
    ############OO
    ##########OOOO
  #########OOOOOOOOO
    ##############
    ##############
      ##########
          ##

Looping through a rectangle containing an eighth of a circle is pretty cheap, as the covered area is only (sin(PI/4) * radius) * radius, which translates to (sqrt(2)/2) * radius * radius. Doing that, we already save some computing power, but there still are some cells being checked for nothing inside the circle.

          ##
      ##########
    #######.....OO
    #######...OOOO
  #########OOOOOOOOO
    ##############
    ##############
      ##########
          ##

Easy optimizations

And the fix to that is both easy and helpful for further optimization. It's time for some clever loop tweaking. Instead of checking all the cells inside the rectangle, we can start at the upper edge of the circle's section, up to the right border.

          ##
      ##########
    #######.....*O
    #######...*OOO
  #########*OOOOOOOO
    ##############
    ##############
      ##########
          ##

To do that, we need two things:

Now we simply need these three steps:

Tricky optimizations

But at least four things should shock us here:

          ##
      ##########
    #######     *O..
    #######   *OOO..
  #########*OOOOOOOO
    ##############
    ##############
      ##########
          ##

To these issues, there is one simple, single fix: adding an ending cell index. This index will be computed with the mandatory, expensive square root, but all the other cells will then be filled directly, until this index is re-computed for the next row.

          ##
      ##########
    #######     ** 
    #######   *OO* 
  #########*OOOOOOO* 
    ##############
    ##############
      ##########
          ##

Beautiful, right? There is an issue. Each "diagonal" will be computed twice, and this is unacceptable. To prevent that, we will fill the diagonal cells after the rows, to benefit from the first loop without performing extra operations.

Cleaning the code

Now, remember the starting cell had equal x and y coordinates. This is very useful, and can save us the first loop index. Just make it start at 0 (as usual), increment at each loop, and test if it is equal to the ending bound to end the algorithm. VoilĂ !

void draw_fast(uint8_t* buf, int radius)
{
	int i;
	int l;
	int k = radius;

	for (i = 0; i < k; ++i)
	{
		for(l = i + 1; l < k; ++l)
		{
			buf[((radius + i) * radius * 2) + radius + l] = 1; // fill sector WSW
			buf[((radius + i) * radius * 2) + radius - l] = 1; // fill sector ESE
			buf[((radius - i) * radius * 2) + radius + l] = 1; // fill sector WNW
			buf[((radius - i) * radius * 2) + radius - l] = 1; // fill sector ENE
			buf[((radius + l) * radius * 2) + radius + i] = 1; // fill sector SSW
			buf[((radius + l) * radius * 2) + radius - i] = 1; // fill sector SSE
			buf[((radius - l) * radius * 2) + radius + i] = 1; // fill sector NNW
			buf[((radius - l) * radius * 2) + radius - i] = 1; // fill sector NNE
		}

		// fill diagonals only once
		if (i + 1 < k)
		{
			buf[((radius + i) * radius * 2) + radius + i] = 1;
			buf[((radius + i) * radius * 2) + radius - i] = 1;
			buf[((radius - i) * radius * 2) + radius + i] = 1;
			buf[((radius - i) * radius * 2) + radius - i] = 1;
		}

		if (ceil(sqrt((i+1)*(i+1) + k*k)) > radius)
		{
			--k;
		}
	}
}

Give me numbers

Of course I ran benchmarks :) - (times are in microseconds)

10 million rounds with a radius of 20 units
without compiler optimizations (-O0)
better algorithm: 15934912
simple algorithm: 45828194

with compiler optimizations (-O2)
better algorithm: 5510307
simple algorithm: 25197102

1 million rounds with a radius of 100 units
without compiler optimizations (-O0)
better algorithm: 39195045
simple algorithm: 110287512

with compiler optimizations (-O2)
better algorithm: 11948697
simple algorithm: 64951124

Wait a minute...

As you noticed, instead of computing the ending cell index (k), I made it start at the same value as the radius, and chose to decrement it when it gets out of the circle. This works because all the ending bounds will be shifted from maximum one cell compared to the last one. And of course it allows us to perform even more optimizations, by computing how precise the square root has to be for the algorithm to work properly.

We are not going to to that, however. It would be quite complicated, complex, and not really worth it. What we are about to do, however, definitely is. And you're not ready.

Carmack's touch

Some of you might have guessed where I was going as soon as they read the words "square root". Upon the release of Quake III's source code, some people noticed a weird piece of utter genius:

float Q_rsqrt( float number )
{
	long i;
	float x2, y;
	const float threehalfs = 1.5F;

	x2 = number * 0.5F;
	y  = number;
	i  = * ( long * ) &y;                       // evil floating point bit level hacking
	i  = 0x5f3759df - ( i >> 1 );               // what the fuck? 
	y  = * ( float * ) &i;
	y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
//	y  = y * ( threehalfs - ( x2 * y * y ) );   // 2nd iteration, this can be removed

	return y;
}

We could discuss these few lines for hours, but others have done it before. In short, it is an increadibly fast inverse square root algorithm, which uses a very good approximation as the initial value of one round of Newton's method. This approximation relies on a few binary tricks, hence the magic number and commenting distress.

Why not using this with our Algorithm and check the difference? We need to enable the commented line, because that extra precision is needed here. But otherwise, pre-compute an inverse radius, change the comparison signs and go!

What's the point?

This article is about hacking. And here is a benchmark for you.

10 million rounds with a radius of 20 units
without compiler optimizations (-O0)
better + Q_rsqrt: 19034944
better algorithm: 15934912

with compiler optimizations (-O2)
better + Q_rsqrt: 7500707
better algorithm: 5510307

1 million rounds with a radius of 100 units
without compiler optimizations (-O0)
better + Q_rsqrt: 40792434
better algorithm: 39195045

with compiler optimizations (-O2)
better + Q_rsqrt: 12767433
better algorithm: 11948697

Yes, you read that right. And you didn't know it was performed under linux on a Xeon processor. How is it possible? Did Wikipedia lie to us? Where's the processing time going?

What could make the results worse?!

When I was in front of that, I was pretty disappointed. And I remembered a book I read, called "Hacker's delight". In this book, you will find a lot of simple programming tricks. Among them, there is the answer to the famous question "float or double". On some PCs, floats, as opposed to doubles, are actually not implemented directly, which makes even the most basic operations very performance-consuming. And this is exactly what happened with this processor. Intel manufactures a lot of variations of their products. On the specs sheet, they look the same. But for hackers, they are different.

You know what? Let's change Quake's inverse square root. Let's make it use doubles.

10 million rounds with a radius of 20 units
without compiler optimizations (-O0)
better + Q_rsqrt: 12573643
better algorithm: 15934912

with compiler optimizations (-O2)
better + Q_rsqrt: 4195733
better algorithm: 5510307

1 million rounds with a radius of 100 units
without compiler optimizations (-O0)
better + Q_rsqrt: 26138690
better algorithm: 39195045

with compiler optimizations (-O2)
better + Q_rsqrt: 7803073
better algorithm: 11948697

We did it. We hacked the code. We understood what was going on in the computer, and used this knowledge to achieve something. Here, it was performance. But you, what are you searching for?