## TBB Mandelbrot Set

In an earlier post, I created a simple prime finding program using Intel’s TBB (Thread Building Block). The main benefit of using TBB is that threading and thread synchronization mechanism are abstracted away within the TBB library so we do not need to deal with threads explicitly. Also, TBB is optimized for performance and scales nicely as the number of processing unit increases. In this post, I will show you how to create a Mandelbrot Set generator using TBB and how to optimize the algorithm using loop unrolling.

The standard algorithm for generating Mandelbrot Set is extremely easy to adapt to using TBB. In fact the loops look almost identical to those in the single-threaded approach, except that the iterations are calculated within a 2D range block (blocked_range2d) instead of the entire two dimensional space.

void operator() (const blocked_range2d<size_t> &r) const
{
drawing_area drawing(0, 0, screen_size,screen_size);

for (size_t x = r.rows().begin(); x < r.rows().end(); x++)
{
for (size_t y = r.cols().begin(); y < r.cols().end(); y++)
{
float zx = 0;
float zy = 0;

float cx = (float)x / (float)screen_size * x_range + x_min;
float cy = (float)y / (float)screen_size * y_range + y_min;
int i = 0;

while (zx * zx + zy * zy <= 4 && i < max_iteration)
{
float xtemp = zx * zx zy * zy + cx;
zy = 2 * zx * zy + cy;
zx = xtemp;
i++;
}

int itr = i % 255;
color_t c = itr << 16 | itr << 8 | itr;

drawing.set_pixel(x,y,c);
}
}
}

Because xlib by itself is not thread-safe, special attention must be made when trying to update the display concurrently. One way to address this issue is to employee a shared memory region (X11/extensions/XShm.h and sys/shm.h), the display is first built in memory and then the shared memory is attached to the display. In my examples above I used code (video.h, xvideo.cpp) from the sample code that come with the TBB library, which uses the shared memory method I mentioned earlier to make the X11 calls thread-safe.

Many optimization methods can be used to further enhance the performance of the algorithm. One of the most efficient methods is to utilize SSE instructions found on all modern Intel processors (examples can be found here: Using SSE3 Technology in Algorithms with Complex Arithmetic). This approach however might be difficult to implement and debug since parallel data structures must be used in order to benefit from SSE instructions. Also, explicit assembly level coding makes porting code to other machine architectures a daunting task. Modern compilers can already take full advantage of the underlying machine architecture. For example, the gcc compiler (4.2.3) already generates SSE instructions for the code snippet above. While hand tweaking using SSE instructions might further improve the performance, we would certainly sacrifice code simplicity and portability.

The approach I am going to take to further optimize the code is to use loop unrolling. Since the inner loop of the standard algorithm is pretty short, unrolling the inner loop should lessen the burden of loop overhead and decrease the chances of stalling the pipeline (when branching must be predicted). So a high-level loop unrolling should be able to improve the performance.

Here is the code after the inner loop is unrolled:

void operator() (const blocked_range2d<size_t> &r) const
{
drawing_area drawing(0, 0, screen_size,screen_size);

for (size_t x = r.rows().begin(); x < r.rows().end(); x+=2)
{
for (size_t y = r.cols().begin(); y < r.cols().end(); y++)
{
float zx1 = 0;
float zx2 = 0;
float zy1 = 0;
float zy2 = 0;

float cx1 = (float)x / (float)screen_size * x_range + x_min;
float cx2 = (float)(x + 1) / (float)screen_size * x_range + x_min;

float cy = (float)y / (float)screen_size * y_range + y_min;

int i1 = 0;
int i2 = 0;
bool loop_stop1 = false;
bool loop_stop2 = false;

while ( !(loop_stop1 && loop_stop2))
{
float xtemp1;
float xtemp2;

if ((zx1 * zx1 + zy1 * zy1 <= 4) && (i1 < max_iteration))
{
xtemp1 = zx1 * zx1 zy1 * zy1 + cx1;
zy1 = 2 * zx1 * zy1 + cy;
zx1 = xtemp1;
i1++;
}
else
{
loop_stop1 = true;
}

if ((zx2 * zx2 + zy2 * zy2 <= 4) && (i2< max_iteration))
{
xtemp2 = zx2 * zx2 zy2 * zy2 + cx2;
zy2 = 2 * zx2 * zy2 + cy;
zx2 = xtemp2;
i2++;
}
else
{
loop_stop2 = true;
}
}

int itr = 0;

itr = (i1) % 255;
color_t c = itr << 16 | itr << 8 | itr;
drawing.set_pixel(x,y,c);

itr = i2  % 255;
c = itr << 16 | itr << 8 | itr;
drawing.set_pixel(x+1,y,c);
}
}
}

This code generates identical results as the code mentioned previously. As you can see, the inner loop is not unrolled to handle two data points at a time.

As it turned out, this algorithm runs almost twice as fast as the code mentioned earlier(280ms versus 510ms on Intel Q9450 @ 3.4GHz). 