Matrix Multiplication Performance in C++

A few days ago, I ran across this article by Dmitri Nesteruk. In his article, he compared the performance between C# and C++ in matrix multiplication. From the data he provided, matrix multiplication using C# is two to three times slower than using C++ in comparable situations.

Even though a lot of optimizations have been done in the .Net runtime to make it more efficient, it is apparent that scientific programming still favors C and C++ because that the performance advantage is huge.

In this article, I will examine some matrix multiplication algorithms that are commonly used and illustrate the efficiencies of the various methods. All the tests are done using C++ only and matrices size ranging from 500×500 to 2000×2000. When the matrix sizes are small (e.g. <50), you can pretty much use any matrix multiplication algorithms without observing any significant performance differences. This is largely due to the fact that the typical stable matrix multiplication algorithms are O(n^3) and sometimes array operation overheads outweigh the benefit of algorithm efficiencies. But for matrices of larger dimensions, the efficiency of the multiplication algorithm becomes extremely important.

Since Dmitri’s article has already captured pretty detailed data using the standard matrix multiplication algorithm, I will not repeat his findings in this article. What I intended to show was the performance data of uBLAS, OpenMP, cBLAS and MATLAB.

The following sample code are compiled under Ubuntu 8.10 64 bit (kernel 2.6.24.23) on Intel Q9450@3.2GHz.

Standard Matrix Multiplication (Single Threaded)

This is our reference code. Later on, I will only show the critical portion of the code and not repeat the common portion of code that initializes/finalizes the arrays. Similarly, the timing method used is also the same across all the tests and will be omitted later on.

float **A, **B, **C;

A = new float*[matrix_size];
B = new float*[matrix_size];
C = new float*[matrix_size];

for (int i = 0 ; i < matrix_size; i++)
{
A[i] = new float[matrix_size];
B[i] = new float[matrix_size];
C[i] = new float[matrix_size];
}

for (int i=0; i<matrix_size; i++)
{
for (int j = 0 ; j < matrix_size; j++)
{
A[i][j]=rand();
B[i][j]=rand();
}
}

timeval t1, t2, t;
gettimeofday(&t1, NULL);

for (int i = 0 ; i < matrix_size; i++)
{
for (int j = 0;  j < matrix_size; j++)
{
C[i][j] = 0;
for (int k = 0; k < matrix_size; k++)
{
C[i][j] += A[i][k] * B[k][j];
}
}
}

gettimeofday(&t2, NULL);
timersub(&t2, &t1, &t);

cout << t.tv_sec + t.tv_usec/1000000.0 << " Seconds -- Standard" << endl;

for (int i = 0 ; i < matrix_size; i++)
{
delete A[i];
delete B[i];
delete C[i];
}

delete A;
delete B;
delete C;

OpenMP With Two Dimensional Arrays

Using OpenMP, we are able to multiple threads via the #pragma omp directives. For the simple algorithm we used here, the speed increase is almost proportional to the number of available cores within the system.

...
#pragma omp parallel for shared(a,b,c)
for (long i=0; i<matrix_size; i++)
{
for (long j = 0; j < matrix_size; j++)
{
float sum = 0;
for (long k = 0; k < matrix_size; k++)
{
sum +=a[i][k]*b[k][j];
}
c[i][j] = sum;
}
}
...

OpenMP With One Dimensional Arrays

Cache locality is poor using the simple algorithm I showed above. The performance can be easily improved however by improving the locality of the references. One way to achieve better cache locality is to use one dimensional array instead of two dimensional array and as you will see later, the performance of the following implementation has as much as 50% speed gains over the previous OpenMP implementation using two dimensional arrays.

float *a, *b, *c;

a = new float[matrix_size * matrix_size];
b = new float[matrix_size * matrix_size];
c = new float[matrix_size * matrix_size];

for (long i=0; i<matrix_size * matrix_size; i++)
{
a[i]=rand();
b[i] = rand();
c[i] = 0;
}

#pragma omp parallel for shared(a,b,c)
for (long i=0; i<matrix_size; i++)
{
for (long j = 0; j < matrix_size; j++)
{
long idx = i * matrix_size;
float sum = 0;
for (long k = 0; k < matrix_size; k++)
{
sum +=a[idx + k]*b[k * matrix_size +j];
}
c[idx + j] = sum;
}
}

delete a;
delete b;
delete c;

Boost Library uBLAS (Single Threaded)

Boost library provides a convenient way to perform matrix multiplication. However, the performance is very poor compared to all other approaches mentioned in this article. The performance of the uBLAS implementation is largely on par with that using C# (see benchmarks towards the end of the article). Intel’s Math Kernal Library (MKL) 10.1 does provide functionality to dynamically convert code using uBLAS syntax into highly efficient code using MKL by the inclusion of header file mkl_boost_ublas_matrix_prod.hpp. I have not tried it myself though, but the performance should be comparible to algorithms using the native MKL BLAS interface.

By default (without using MKL’s uBLAS capability) though, uBLAS is single threaded and due to its poor performance and I would strongly suggest avoid using uBLAS in any high performance scientific applications.

matrix<float> A, B, C;

A.resize(matrix_size,matrix_size);
B.resize(matrix_size,matrix_size);

for (int i = 0; i < matrix_size; ++ i)
{
for (int j = 0; j < matrix_size; ++ j)
{
A(i, j) = rand();
B(i, j) = rand();
}
}

C =prod(A, B);

Intel Math Kernel Library (MKL) cBLAS

Intel’s Math Kernel Library (MKL) is highly optimized on Intel’s microprocessor platforms. Given that Intel developed this library for its own processor platforms we can expect significant performance gains. I am still surprised at how fast the code runs using cBLAS though. In fact, it was so fast that I doubted the validity of the result at first. But after checking the results against those obtained by other means, those doubts were putting into rest.

The cBLAS matrix multiplication uses blocked matrix multiplication method which further improves cache locality. And it is more than thirty times faster then the fastest OMP 1D algorithm listed above! Another benefit is that by default it automatically detects the number of CPUs/cores available and uses all available threads. This behavior greatly simplifies the code since threading is handled transparently within the library.

float *A, *B, *C;

A = new float[matrix_size * matrix_size];
B = new float[matrix_size * matrix_size];
C = new float[matrix_size * matrix_size];

for (int i = 0; i < matrix_size * matrix_size; i++)
{
A[i] = rand();
B[i] = rand();
C[i] = 0;
}

cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
matrix_size,  matrix_size,  matrix_size, 1.0, A,matrix_size,
B, matrix_size, 0.0, C, matrix_size);

MATLAB is known for its efficient algorithms. In fact it uses BLAS libraries for its own matrix calculation routines. The version of MATLAB I have is a little dated (7.0.1), but nevertheless it would be interesting to see how its performance compares with that of latest MKL’s. MATLAB 7 is single threaded, and given the same matrix size, it runs roughly three times slower than the fastest MKL routine listed above (per core).

a = rand(i,i);
b = rand(i,i);
tic;
c = a*b;
t = toc

The following table shows the results I obtained by running the code listed above. The results are time in seconds. (note, S.TH means single threaded and M.TH means multi-threaded).

 Size/Algorithm uBLAS S.TH STD S.TH OMP 2D OMP 1D MATLAB S.TH cBLAS M.TH 500×500 3.2435 0.5253 0.1939 0.0536 0.0810 0.0206 600×600 5.7854 0.9349 0.3223 0.1655 0.1410 0.0093 700×700 9.2292 1.2928 0.3529 0.2797 0.2230 0.0122 800×800 13.7711 2.3746 0.7259 0.4135 0.3320 0.0310 900×900 20.3245 3.4983 1.0146 0.7449 0.4740 0.0306 1000×1000 28.8345 3.4983 1.4748 1.0548 0.6530 0.0700 1100×1100 38.2545 7.0240 1.9383 1.6257 0.8620 0.1250 1200×1200 50.4964 9.9319 2.8411 2.1215 1.1170 0.0440 1300×1300 64.5064 12.8344 3.6277 2.9720 1.4250 0.0440 1400×1400 81.1826 17.1119 4.8309 3.5977 1.7760 0.0938 1500×1500 100.1330 21.0622 6.1689 4.8022 2.1870 0.1111 1600×1600 120.3400 26.4316 7.3189 5.0451 2.6490 0.1699 1700×1700 145.8550 31.2706 8.7525 6.8915 3.1870 0.1452 1800×1800 174.6860 38.9293 11.1060 8.1316 3.7940 0.1989 1900×1900 206.0520 45.8589 13.0832 9.9527 4.4450 0.2725 2000×2000 240.7820 55.4392 16.0542 11.0314 5.1820 0.3359

The following figure shows the results. Since uBLAS and single threaded matrix multiplications took significantly longer to compute, I did not include them in the figure below.

The following figure shows the same data but uses log-scale Y axis instead so that all the data can show up nicely. You can get a sense of various algorithms’ efficiencies here:

1. Oscar Fraxedas says:

Nice work. Thanks for sharing this.
I was looking for this kind of comparison for a co-worker. I did recommend him to use MKL, but I did not know that the difference was this big.
Have you tried the Matrix Multiplication using intel intrinsics functions (SSE,SSE2,…)?

• kwong says:

To my knowledge, the Intel MKL uses SSE instructions (1,2,3 and 4) extensively, so I wouldn’t be surprised that the matrix multiplication routines utilize these instructions as well.

2. oliver says:

Hi,

I am trying to reproduce the same results on my own Intel machine. I get a significant speedup with the Intel MKL, but not nearly as good as yours. Can you please tell me which compiler you used (gcc? icc?) and which compiler flags you used?

Thanks.

oliver

• kwong says:

I am using the latest MKL 10.1 and gcc 4.3.2 (64 bit) with -O3 optimization and optimization for Intel Core 2 processor families enabled. The data was collected on a quad core PC (Q9450@3.2GHz) with 8GB memory installed. I wouldn’t be surprised that Intel relied on SSE2 SSE3 heavily and that would explain huge performance discrepancies among different processors.

• oliver says:

Thank you. I fixed my problem (my compilation flags were not quite right). Thanks for the great blog entry.

3. Alun Williams says:

Thanks for a very high quality post. I currently do most of my work on an oldish PC with 2*1.4Ghz P4 Xeons running Windows 2000. I have downloaded an evaluation copy of the latest MKL, but also have a proper licence for an old version of it (from about 1999). I tried to evaluate performance using both the old and new MKL and my own multiplication both naive and as good as I could make it, and for both float and double matrices. I don’t have a C++ compiler with #pragma omp (indeed I was not familiar with that at all before seeing your article)
In every case the latest MKL is about 10* faster than my best C++ code. With the older MKL it is only 2* faster, so I could get close to that my multi-threading.
My best C++ code takes about 22 seconds for a 2000*2000 multiply using floats, and 24 seconds using doubles. Since this is on a 1.4GHz machine and I’m not using threads my code might be of interest to anyone who cannot afford MKL and who doesn’t have a multi-core machine or a compiler that supports OMP.
The main thing wrong with “naive” multiplication, as implemented by your standard matrix multiplication code is that it utilises the CPU cache poorly, and this effect becomes very pronounced for large matrices.
Simply interchanging the j and k loops (which means initialisation has to be done separately) will give a *3 performance improvement. I managed to get a further *2 improvement by using a “blocked” algorithm and unrolling the inner loop.
Final code looks like this:

template class Large_Matrix
{
private:
unsigned nr_rows;
unsigned nr_columns;
T * data;
public:
Large_Matrix(unsigned nr_rows_,unsigned nr_columns_) :
nr_rows(nr_rows_),
nr_columns(nr_columns_),
data(new T[nr_rows_*nr_columns_])
{}
~Large_Matrix()
{
delete [] data;
}
T * operator[](unsigned row)
{
return &data[row*nr_columns];
}
const T * operator[](unsigned row) const
{
return &data[row*nr_columns];
}
static bool multiply(Large_Matrix * answer,
const Large_Matrix & lhs,
const Large_Matrix & rhs)
{
if (lhs.nr_columns != rhs.nr_rows)
return false;
if (answer == &lhs)
{
Large_Matrix temp(lhs);
}
if (answer == &rhs)
{
Large_Matrix temp(rhs);
}
unsigned nr_elements = lhs.nr_rows * rhs.nr_columns;
{
answer->data = new T[lhs.nr_rows*rhs.nr_columns];
}

{
for (unsigned i = 0; i data[i] = 0;
}

unsigned i0,j0,k0,i,j,k,imax,jmax,kmax,jmax1;
const unsigned IBLOCK=100;
const unsigned JBLOCK=128;
const unsigned KBLOCK=40;
for (i0 = 0; i0 lhs.nr_rows)
imax = lhs.nr_rows;
for (k0 = 0; k0 lhs.nr_columns)
kmax = lhs.nr_columns;
for (j0 = 0; j0 rhs.nr_columns)
jmax = rhs.nr_columns;
jmax1 = jmax & ~15;
for (i = i0; i <imax;i++,lrow += lhs.nr_columns,arow += rhs.nr_columns)
{
const T * rrow = rhs[k0];
for (k = k0; k < kmax;k++,rrow += rhs.nr_columns)
{
T v = lrow[k];
for (j = j0; j < jmax1;j+=16)
{
arow[j] += v*rrow[j];
arow[j+1] += v*rrow[j+1];
arow[j+2] += v*rrow[j+2];
arow[j+3] += v*rrow[j+3];
arow[j+4] += v*rrow[j+4];
arow[j+5] += v*rrow[j+5];
arow[j+6] += v*rrow[j+6];
arow[j+7] += v*rrow[j+7];
arow[j+8] += v*rrow[j+8];
arow[j+9] += v*rrow[j+9];
arow[j+10] += v*rrow[j+10];
arow[j+11] += v*rrow[j+11];
arow[j+12] += v*rrow[j+12];
arow[j+13] += v*rrow[j+13];
arow[j+14] += v*rrow[j+14];
arow[j+15] += v*rrow[j+15];
}
for (j = jmax1; j < jmax;j++)
arow[j] += v*rrow[j];
}
}
}
}
}
return true;
}
};

The right values for IBLOCK,JBLOCK,KBLOCK will vary from machine to machine, though there doesn’t seem to be a sharp minimum, and probably depend on the size of the matrices as well. JBLOCK should be a multiple of 16. If you know in advance the size of the matrices you will be using making the values a factor of this size is probably a good idea.

4. Alun Williams says:

Unfortunately my code has got mangled by the comment posting process.

i0,j0,k0 should be iterated up to the relevant number of rows and columns as can still be seen in the post, and then either incremented by the relevant XBLOCK variable or equivalently set to the relevant Xmax variable. If anyone is interested I’ll be happy to send the code in an email.

5. wesmo says:

Thanks for the comparision; a useful insight. In your comparison, a very old version of matlab selected was.

Have updated the MATLAB BLAS and LAPACK libraries used
Use more recent AMD and Intel MKL
Allow the user to specify an external Intel MKL
Have introduced multithreaded implementations as default for some mathematical functions

6. nasos says:

Did you try to compile the ublas code with -DNDEBUG? Because I get those figures only if I don’t use -DNDEBUG.

7. dq87jg says:

You may also want to have a look at the Armadillo C++ library:
http://arma.sourceforge.net

It has its own matrix multiply, but optionally links with ATLAS which uses optimised and hardware specific routines.

8. Steve says:

The Intel MKL library is very fast, but a warning to all of the .Net developers out there.

Its a real pig to work with. Intel do provide a sample set of example library calls, but they don’t provide the wrapper. You have to compile it yourself.

I am currently using MKL and wish I’d found a proper .Net compliant version or atleast a provider who does recognise that .Net does exist.

I’ve wasted so much time trying to get the wrapper compiled for 32 bit then 64 bit. It is not easy.

Intel fail to tell you that you must register MKLROOT as a global variable plus the Bin eg Intel\MKL\10…\em64t\bin should be in your path variable along with Intel\MKL\10…

Fast but painful.

9. Amanjit Gill says:

I think the eigen project is even faster
http://eigen.tuxfamily.org/index.php?title=Main_Page

10. Alexander Rau (@AR0x7E7) says:

Thank you very much for your article! It helps me a lot.

11. […] So this seems an awfully complicated way to optimize interpolation. But we save on many levels. First, we save the overhead of calling the function that solves for each section and the overhead of calling the function that evaluate the polynomial. Second, a good matrix/matrix multiply will be able to vectorize (i.e., use SSE+ instructions) and parallelize the matrix/matrix product in a way that is impossible if we perform separate calls for each matrix/vector product. In fact, the speed-ups can be quite impressive. […]

12. badri says:

this is nice!
the analysis that is done on quad core PC (Q9450@3.2GHz) with 8GB memory is interesting.

Thanks,

13. Sergey Kostrov says:

Here are test results for multiplication of two 2048×2048 matrices. A single-threaded Strassen Heap Based Complete ( Strassen HBC ) algorithm is used in a 32-bit test application:

*** BCC – Matrix Size 2048×2048 / Single precision ( float ) data type ***

Strassen HBC
Matrix Size : 2048 x 2048
Matrix Size Threshold: 128 x 128
Matrix Partitions : 2801
ResultSets Reflection: Enabled
Calculating…
Strassen HBC – Pass 1 – Completed: 11.95000 secs
Strassen HBC – Pass 2 – Completed: 11.10700 secs
Strassen HBC – Pass 3 – Completed: 11.10700 secs
Strassen HBC – Pass 4 – Completed: 11.10700 secs
Strassen HBC – Pass 5 – Completed: 11.09200 secs
ALGORITHM_STRASSEN_HBC – Passed

Compiler: Borland C++ v5.5
Library: ScaLib v1.13.02 ( ScaLib stands for Set of Common Algorithms )
OS: Windows 7 Professional
Hardware: Dell Precision M4700 ( CPU Core i7-3840QM / 16GB )

Note to a Moderator: Please delete my previous post. Thanks.

14. ahmed says:

i want the source code for matrix-matrix multiplication
using cBLAS M.TH can you help me