I have been playing around with Intel’s Threading Building Block for a while and have started to really appreciate its simplicity and elegance: Instead of thinking in threads and thread synchronizations, one can just simply concentrate on the problem on the hand. Take finding prime numbers for example, while the problem itself (using the most rudimentary algorithm) is quite simple, getting it to work in a multi-threaded fashion does take a little bit of work. In this particular example, the prime finding algorithm can be easily paralleled by utilizing threads and thread synchronization is almost a non-issue since the problem domain can be divided into totally disjoint regions, but in general dividing the problem domain into multiple sub-domains and performing load balancing among them could take significant work. In the following example, I created two C++ classes that both find prime numbers for a given interval (since all prime numbers are odd numbers except 2, 2 is omitted in the calculates below), one sequential and the other parallel. In the main function, both methods are timed and the results are outputted.(download tbbprime.cpp)

#include <stdio.h>
#include <stdlib.h>
 
#include <iostream>
#include <iomanip>
#include <math.h>
 
#include "tbb/task_scheduler_init.h"
#include "tbb/tick_count.h"
#include "tbb/blocked_range.h"
#include "tbb/parallel_for.h"
#include "tbb/partitioner.h"
 
using namespace std;
using namespace tbb;
 
static const int MAX_SIZE = 1000000;
 
class prime_single_thread
{
public:
    void run(const blocked_range<size_t> &r)
    {
        bool is_prime = false;
 
        for (size_t x = r.begin(); x < r.end(); x+=2)
        {
            is_prime = true;
 
            for (int i = 3; i <= sqrt((double) x); i+=2)
            {
                if (x % i == 0)
                {
                    is_prime = false;
                    continue;
                }
            }
 
            // Output prime numbers:
//            if (is_prime)
//            {
//                cout << x << endl;
//            }
        }
    }
};
 
class prime_tbb
{
public:
    void test_prime(int num) const
    {
        bool is_prime = true;
 
        for (int i = 3; i <= sqrt((double) num); i+=2)
        {
            if (num % i == 0)
            {
                is_prime = false;
                continue;
            }
        }
 
        // Output prime numbers:
//        if (is_prime)
//        {
//            cout << num << endl;
//        }
    }
 
    void operator() (const blocked_range<size_t> &r) const
    {
        for (size_t i = r.begin(); i < r.end(); i+=2)
        {
            test_prime(i);
        }
    }
 
    void run(blocked_range<size_t> r)
    {
        prime_tbb prime_tbb;
 
        parallel_for(r, prime_tbb);
    }
 
};
 
int main()
{
    task_scheduler_init init;
    static tick_count t_start, t_end;
 
    prime_single_thread p1;
    prime_tbb p2;
 
    cout.setf(ios::fixed);
    cout.setf(ios::showpoint);
    cout.precision(2);
 
    //starting from 3, with a granularity of 100.
    blocked_range<size_t> r(3, MAX_SIZE, 100);
 
    t_start = tick_count::now();
    p1.run(r);
    t_end = tick_count::now();
    cout << (t_end t_start).seconds() * 1000 << " ms" << endl;
 
    t_start = tick_count::now();
    p2.run(r);
    t_end = tick_count::now();
    cout << (t_end t_start).seconds() * 1000 << " ms" << endl;
 
    return 0;
}
 

For a very large interval (e.g. 3~1,000,000), the TBB version of the prime program achieved a 4x speed up given a reasonably large grain size (e.g. 100). Smaller grain size resulted in slightly more overhead. On a quad-core machine (Q9450 @ 3.2GHz), it took 217.83 ms for the single threaded routine to find all the prime numbers within 1,000,000, whereas it only took 58.32 ms for the TBB version, which runs roughly four times as fast. The TBB framework took care of dividing the task according to the number of processors automatically.

Be Sociable, Share!