Matrix Multiplication Performance in C++

A few days ago, I ran across this article by Dmitri Nesteruk. In his article, he compared the performance between C# and C++ in matrix multiplication. From the data he provided, matrix multiplication using C# is two to three times slower than using C++ in comparable situations.

Even though a lot of optimizations have been done in the .Net runtime to make it more efficient, it is apparent that scientific programming still favors C and C++ because that the performance advantage is huge.

In this article, I will examine some matrix multiplication algorithms that are commonly used and illustrate the efficiencies of the various methods. All the tests are done using C++ only and matrices size ranging from 500×500 to 2000×2000. When the matrix sizes are small (e.g. <50), you can pretty much use any matrix multiplication algorithms without observing any significant performance differences. This is largely due to the fact that the typical stable matrix multiplication algorithms are O(n^3) and sometimes array operation overheads outweigh the benefit of algorithm efficiencies. But for matrices of larger dimensions, the efficiency of the multiplication algorithm becomes extremely important.

Since Dmitri’s article has already captured pretty detailed data using the standard matrix multiplication algorithm, I will not repeat his findings in this article. What I intended to show was the performance data of uBLAS, OpenMP, cBLAS and MATLAB.

The following sample code are compiled under Ubuntu 8.10 64 bit (kernel 2.6.24.23) on Intel Q9450@3.2GHz.

Standard Matrix Multiplication (Single Threaded)

This is our reference code. Later on, I will only show the critical portion of the code and not repeat the common portion of code that initializes/finalizes the arrays. Similarly, the timing method used is also the same across all the tests and will be omitted later on.

float **A, **B, **C;

A = new float*[matrix_size];
B = new float*[matrix_size];
C = new float*[matrix_size];

for (int i = 0 ; i < matrix_size; i++)
{
    A[i] = new float[matrix_size];
    B[i] = new float[matrix_size];
    C[i] = new float[matrix_size];
}

for (int i=0; i<matrix_size; i++)
{
    for (int j = 0 ; j < matrix_size; j++)
    {
        A[i][j]=rand();
        B[i][j]=rand();
    }
}

timeval t1, t2, t;
gettimeofday(&t1, NULL);

for (int i = 0 ; i < matrix_size; i++)
{
    for (int j = 0;  j < matrix_size; j++)
    {
        C[i][j] = 0;
        for (int k = 0; k < matrix_size; k++)
        {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}

gettimeofday(&t2, NULL);
timersub(&t2, &t1, &t);

cout << t.tv_sec + t.tv_usec/1000000.0 << " Seconds -- Standard" << endl;

for (int i = 0 ; i < matrix_size; i++)
{
    delete A[i];
    delete B[i];
    delete C[i];
}

delete A;
delete B;
delete C;

OpenMP With Two Dimensional Arrays

Using OpenMP, we are able to multiple threads via the #pragma omp directives. For the simple algorithm we used here, the speed increase is almost proportional to the number of available cores within the system.

...
#pragma omp parallel for shared(a,b,c)
for (long i=0; i<matrix_size; i++)
{
    for (long j = 0; j < matrix_size; j++)
    {
        float sum = 0;
        for (long k = 0; k < matrix_size; k++)
        {
            sum +=a[i][k]*b[k][j];
        }
        c[i][j] = sum;
    }
}
...

OpenMP With One Dimensional Arrays

Cache locality is poor using the simple algorithm I showed above. The performance can be easily improved however by improving the locality of the references. One way to achieve better cache locality is to use one dimensional array instead of two dimensional array and as you will see later, the performance of the following implementation has as much as 50% speed gains over the previous OpenMP implementation using two dimensional arrays.

float *a, *b, *c;

a = new float[matrix_size * matrix_size];
b = new float[matrix_size * matrix_size];
c = new float[matrix_size * matrix_size];

for (long i=0; i<matrix_size * matrix_size; i++)
{
    a[i]=rand();
    b[i] = rand();
    c[i] = 0;
}

#pragma omp parallel for shared(a,b,c)
for (long i=0; i<matrix_size; i++)
{
    for (long j = 0; j < matrix_size; j++)
    {
        long idx = i * matrix_size;
        float sum = 0;
        for (long k = 0; k < matrix_size; k++)
        {
            sum +=a[idx + k]*b[k * matrix_size +j];
        }
        c[idx + j] = sum;
    }
}

delete a;
delete b;
delete c;

Boost Library uBLAS (Single Threaded)

Boost library provides a convenient way to perform matrix multiplication. However, the performance is very poor compared to all other approaches mentioned in this article. The performance of the uBLAS implementation is largely on par with that using C# (see benchmarks towards the end of the article). Intel’s Math Kernal Library (MKL) 10.1 does provide functionality to dynamically convert code using uBLAS syntax into highly efficient code using MKL by the inclusion of header file mkl_boost_ublas_matrix_prod.hpp. I have not tried it myself though, but the performance should be comparible to algorithms using the native MKL BLAS interface.

By default (without using MKL’s uBLAS capability) though, uBLAS is single threaded and due to its poor performance and I would strongly suggest avoid using uBLAS in any high performance scientific applications.

matrix<float> A, B, C;

A.resize(matrix_size,matrix_size);
B.resize(matrix_size,matrix_size);

for (int i = 0; i < matrix_size; ++ i)
{
    for (int j = 0; j < matrix_size; ++ j)
    {
        A(i, j) = rand();
        B(i, j) = rand();
    }
}

C =prod(A, B);

Intel Math Kernel Library (MKL) cBLAS

Intel’s Math Kernel Library (MKL) is highly optimized on Intel’s microprocessor platforms. Given that Intel developed this library for its own processor platforms we can expect significant performance gains. I am still surprised at how fast the code runs using cBLAS though. In fact, it was so fast that I doubted the validity of the result at first. But after checking the results against those obtained by other means, those doubts were putting into rest.

The cBLAS matrix multiplication uses blocked matrix multiplication method which further improves cache locality. And it is more than thirty times faster then the fastest OMP 1D algorithm listed above! Another benefit is that by default it automatically detects the number of CPUs/cores available and uses all available threads. This behavior greatly simplifies the code since threading is handled transparently within the library.

float *A, *B, *C;

A = new float[matrix_size * matrix_size];
B = new float[matrix_size * matrix_size];
C = new float[matrix_size * matrix_size];

for (int i = 0; i < matrix_size * matrix_size; i++)
{
    A[i] = rand();
    B[i] = rand();
    C[i] = 0;
}

cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 
    matrix_size,  matrix_size,  matrix_size, 1.0, A,matrix_size, 
    B, matrix_size, 0.0, C, matrix_size);

MATLAB (Single Threaded)

MATLAB is known for its efficient algorithms. In fact it uses BLAS libraries for its own matrix calculation routines. The version of MATLAB I have is a little dated (7.0.1), but nevertheless it would be interesting to see how its performance compares with that of latest MKL’s. MATLAB 7 is single threaded, and given the same matrix size, it runs roughly three times slower than the fastest MKL routine listed above (per core).

    a = rand(i,i);
    b = rand(i,i);
    tic;
    c = a*b;
    t = toc

The following table shows the results I obtained by running the code listed above. The results are time in seconds. (note, S.TH means single threaded and M.TH means multi-threaded).

Size/Algorithm	uBLAS S.TH	STD S.TH	OMP 2D	OMP 1D	MATLAB S.TH	cBLAS M.TH
500×500	3.2435	0.5253	0.1939	0.0536	0.0810	0.0206
600×600	5.7854	0.9349	0.3223	0.1655	0.1410	0.0093
700×700	9.2292	1.2928	0.3529	0.2797	0.2230	0.0122
800×800	13.7711	2.3746	0.7259	0.4135	0.3320	0.0310
900×900	20.3245	3.4983	1.0146	0.7449	0.4740	0.0306
1000×1000	28.8345	3.4983	1.4748	1.0548	0.6530	0.0700
1100×1100	38.2545	7.0240	1.9383	1.6257	0.8620	0.1250
1200×1200	50.4964	9.9319	2.8411	2.1215	1.1170	0.0440
1300×1300	64.5064	12.8344	3.6277	2.9720	1.4250	0.0440
1400×1400	81.1826	17.1119	4.8309	3.5977	1.7760	0.0938
1500×1500	100.1330	21.0622	6.1689	4.8022	2.1870	0.1111
1600×1600	120.3400	26.4316	7.3189	5.0451	2.6490	0.1699
1700×1700	145.8550	31.2706	8.7525	6.8915	3.1870	0.1452
1800×1800	174.6860	38.9293	11.1060	8.1316	3.7940	0.1989
1900×1900	206.0520	45.8589	13.0832	9.9527	4.4450	0.2725
2000×2000	240.7820	55.4392	16.0542	11.0314	5.1820	0.3359

The following figure shows the results. Since uBLAS and single threaded matrix multiplications took significantly longer to compute, I did not include them in the figure below.

The following figure shows the same data but uses log-scale Y axis instead so that all the data can show up nicely. You can get a sense of various algorithms’ efficiencies here:

Be Sociable, Share!