<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Matrix Multiplication Performance in C++</title>
	<atom:link href="http://www.kerrywong.com/2009/03/07/matrix-multiplication-performance-in-c/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.kerrywong.com/2009/03/07/matrix-multiplication-performance-in-c/</link>
	<description></description>
	<lastBuildDate>Mon, 20 May 2013 23:27:04 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
	<item>
		<title>By: ahmed</title>
		<link>http://www.kerrywong.com/2009/03/07/matrix-multiplication-performance-in-c/comment-page-1/#comment-203598</link>
		<dc:creator>ahmed</dc:creator>
		<pubDate>Tue, 14 May 2013 15:54:50 +0000</pubDate>
		<guid isPermaLink="false">http://www.kerrywong.com/?p=586#comment-203598</guid>
		<description><![CDATA[i want the source code for matrix-matrix multiplication 
using cBLAS M.TH can you help me]]></description>
		<content:encoded><![CDATA[<p>i want the source code for matrix-matrix multiplication<br />
using cBLAS M.TH can you help me</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sergey Kostrov</title>
		<link>http://www.kerrywong.com/2009/03/07/matrix-multiplication-performance-in-c/comment-page-1/#comment-199042</link>
		<dc:creator>Sergey Kostrov</dc:creator>
		<pubDate>Sun, 03 Feb 2013 06:43:48 +0000</pubDate>
		<guid isPermaLink="false">http://www.kerrywong.com/?p=586#comment-199042</guid>
		<description><![CDATA[Here are test results for multiplication of two 2048x2048 matrices. A single-threaded Strassen Heap Based Complete ( Strassen HBC ) algorithm is used in a 32-bit test application:

*** BCC - Matrix Size 2048x2048 / Single precision ( float ) data type ***

Strassen HBC
Matrix Size          : 2048 x 2048
Matrix Size Threshold: 128 x 128
Matrix Partitions    : 2801
ResultSets Reflection: Enabled
Calculating...
Strassen HBC - Pass  1 - Completed: 11.95000 secs
Strassen HBC - Pass  2 - Completed: 11.10700 secs
Strassen HBC - Pass  3 - Completed: 11.10700 secs
Strassen HBC - Pass  4 - Completed: 11.10700 secs
Strassen HBC - Pass  5 - Completed: 11.09200 secs
ALGORITHM_STRASSEN_HBC - Passed

Compiler: Borland C++ v5.5
Library: ScaLib v1.13.02 ( ScaLib stands for Set of Common Algorithms )
OS: Windows 7 Professional
Hardware: Dell Precision M4700 ( CPU Core i7-3840QM / 16GB )

Note to a Moderator: Please delete my previous post. Thanks.]]></description>
		<content:encoded><![CDATA[<p>Here are test results for multiplication of two 2048&#215;2048 matrices. A single-threaded Strassen Heap Based Complete ( Strassen HBC ) algorithm is used in a 32-bit test application:</p>
<p>*** BCC &#8211; Matrix Size 2048&#215;2048 / Single precision ( float ) data type ***</p>
<p>Strassen HBC<br />
Matrix Size          : 2048 x 2048<br />
Matrix Size Threshold: 128 x 128<br />
Matrix Partitions    : 2801<br />
ResultSets Reflection: Enabled<br />
Calculating&#8230;<br />
Strassen HBC &#8211; Pass  1 &#8211; Completed: 11.95000 secs<br />
Strassen HBC &#8211; Pass  2 &#8211; Completed: 11.10700 secs<br />
Strassen HBC &#8211; Pass  3 &#8211; Completed: 11.10700 secs<br />
Strassen HBC &#8211; Pass  4 &#8211; Completed: 11.10700 secs<br />
Strassen HBC &#8211; Pass  5 &#8211; Completed: 11.09200 secs<br />
ALGORITHM_STRASSEN_HBC &#8211; Passed</p>
<p>Compiler: Borland C++ v5.5<br />
Library: ScaLib v1.13.02 ( ScaLib stands for Set of Common Algorithms )<br />
OS: Windows 7 Professional<br />
Hardware: Dell Precision M4700 ( CPU Core i7-3840QM / 16GB )</p>
<p>Note to a Moderator: Please delete my previous post. Thanks.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: badri</title>
		<link>http://www.kerrywong.com/2009/03/07/matrix-multiplication-performance-in-c/comment-page-1/#comment-194002</link>
		<dc:creator>badri</dc:creator>
		<pubDate>Fri, 06 Jul 2012 15:04:21 +0000</pubDate>
		<guid isPermaLink="false">http://www.kerrywong.com/?p=586#comment-194002</guid>
		<description><![CDATA[this is nice!
the analysis that is done on quad core PC (Q9450@3.2GHz) with 8GB memory is interesting.

Thanks,
Badri]]></description>
		<content:encoded><![CDATA[<p>this is nice!<br />
the analysis that is done on quad core PC (Q9450@3.2GHz) with 8GB memory is interesting.</p>
<p>Thanks,<br />
Badri</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Fast Interpolation (Interpolation, part V) &#171; Harder, Better, Faster, Stronger</title>
		<link>http://www.kerrywong.com/2009/03/07/matrix-multiplication-performance-in-c/comment-page-1/#comment-193982</link>
		<dc:creator>Fast Interpolation (Interpolation, part V) &#171; Harder, Better, Faster, Stronger</dc:creator>
		<pubDate>Tue, 03 Jul 2012 14:29:22 +0000</pubDate>
		<guid isPermaLink="false">http://www.kerrywong.com/?p=586#comment-193982</guid>
		<description><![CDATA[[...] So this seems an awfully complicated way to optimize interpolation. But we save on many levels. First, we save the overhead of calling the function that solves for each section and the overhead of calling the function that evaluate the polynomial. Second, a good matrix/matrix multiply will be able to vectorize (i.e., use SSE+ instructions) and parallelize the matrix/matrix product in a way that is impossible if we perform separate calls for each matrix/vector product. In fact, the speed-ups can be quite impressive. [...]]]></description>
		<content:encoded><![CDATA[<p>[...] So this seems an awfully complicated way to optimize interpolation. But we save on many levels. First, we save the overhead of calling the function that solves for each section and the overhead of calling the function that evaluate the polynomial. Second, a good matrix/matrix multiply will be able to vectorize (i.e., use SSE+ instructions) and parallelize the matrix/matrix product in a way that is impossible if we perform separate calls for each matrix/vector product. In fact, the speed-ups can be quite impressive. [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Alexander Rau (@AR0x7E7)</title>
		<link>http://www.kerrywong.com/2009/03/07/matrix-multiplication-performance-in-c/comment-page-1/#comment-193145</link>
		<dc:creator>Alexander Rau (@AR0x7E7)</dc:creator>
		<pubDate>Sun, 29 Apr 2012 08:50:34 +0000</pubDate>
		<guid isPermaLink="false">http://www.kerrywong.com/?p=586#comment-193145</guid>
		<description><![CDATA[Thank you very much for your article! It helps me a lot.]]></description>
		<content:encoded><![CDATA[<p>Thank you very much for your article! It helps me a lot.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Amanjit Gill</title>
		<link>http://www.kerrywong.com/2009/03/07/matrix-multiplication-performance-in-c/comment-page-1/#comment-80400</link>
		<dc:creator>Amanjit Gill</dc:creator>
		<pubDate>Mon, 29 Nov 2010 08:31:25 +0000</pubDate>
		<guid isPermaLink="false">http://www.kerrywong.com/?p=586#comment-80400</guid>
		<description><![CDATA[I think the eigen project is even faster
http://eigen.tuxfamily.org/index.php?title=Main_Page]]></description>
		<content:encoded><![CDATA[<p>I think the eigen project is even faster<br />
<a href="http://eigen.tuxfamily.org/index.php?title=Main_Page" rel="nofollow">http://eigen.tuxfamily.org/index.php?title=Main_Page</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Steve</title>
		<link>http://www.kerrywong.com/2009/03/07/matrix-multiplication-performance-in-c/comment-page-1/#comment-45361</link>
		<dc:creator>Steve</dc:creator>
		<pubDate>Tue, 02 Feb 2010 16:31:48 +0000</pubDate>
		<guid isPermaLink="false">http://www.kerrywong.com/?p=586#comment-45361</guid>
		<description><![CDATA[The Intel MKL library is very fast, but a warning to all of the .Net developers out there.

Its a real pig to work with.  Intel do provide a sample set of example library calls, but they don&#039;t provide the wrapper.  You have to compile it yourself.

I am currently using MKL and wish I&#039;d found a proper .Net compliant version or atleast a provider who does recognise that .Net does exist.

I&#039;ve wasted so much time trying to get the wrapper compiled for 32 bit then 64 bit.  It is not easy.

Intel fail to tell you that you must register MKLROOT as a global variable plus the Bin eg  Intel\MKL\10...\em64t\bin should be in your path variable along with Intel\MKL\10...

Fast but painful.]]></description>
		<content:encoded><![CDATA[<p>The Intel MKL library is very fast, but a warning to all of the .Net developers out there.</p>
<p>Its a real pig to work with.  Intel do provide a sample set of example library calls, but they don&#8217;t provide the wrapper.  You have to compile it yourself.</p>
<p>I am currently using MKL and wish I&#8217;d found a proper .Net compliant version or atleast a provider who does recognise that .Net does exist.</p>
<p>I&#8217;ve wasted so much time trying to get the wrapper compiled for 32 bit then 64 bit.  It is not easy.</p>
<p>Intel fail to tell you that you must register MKLROOT as a global variable plus the Bin eg  Intel\MKL\10&#8230;\em64t\bin should be in your path variable along with Intel\MKL\10&#8230;</p>
<p>Fast but painful.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: dq87jg</title>
		<link>http://www.kerrywong.com/2009/03/07/matrix-multiplication-performance-in-c/comment-page-1/#comment-39287</link>
		<dc:creator>dq87jg</dc:creator>
		<pubDate>Fri, 09 Oct 2009 05:46:25 +0000</pubDate>
		<guid isPermaLink="false">http://www.kerrywong.com/?p=586#comment-39287</guid>
		<description><![CDATA[You may also want to have a look at the Armadillo C++ library: 
http://arma.sourceforge.net

It has its own matrix multiply, but optionally links with ATLAS which uses optimised and hardware specific routines.]]></description>
		<content:encoded><![CDATA[<p>You may also want to have a look at the Armadillo C++ library:<br />
<a href="http://arma.sourceforge.net" rel="nofollow">http://arma.sourceforge.net</a></p>
<p>It has its own matrix multiply, but optionally links with ATLAS which uses optimised and hardware specific routines.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: nasos</title>
		<link>http://www.kerrywong.com/2009/03/07/matrix-multiplication-performance-in-c/comment-page-1/#comment-38903</link>
		<dc:creator>nasos</dc:creator>
		<pubDate>Thu, 01 Oct 2009 22:00:31 +0000</pubDate>
		<guid isPermaLink="false">http://www.kerrywong.com/?p=586#comment-38903</guid>
		<description><![CDATA[Did you try to compile the ublas code with -DNDEBUG? Because I get those figures only if I don&#039;t use -DNDEBUG.]]></description>
		<content:encoded><![CDATA[<p>Did you try to compile the ublas code with -DNDEBUG? Because I get those figures only if I don&#8217;t use -DNDEBUG.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: wesmo</title>
		<link>http://www.kerrywong.com/2009/03/07/matrix-multiplication-performance-in-c/comment-page-1/#comment-33644</link>
		<dc:creator>wesmo</dc:creator>
		<pubDate>Mon, 15 Jun 2009 05:39:57 +0000</pubDate>
		<guid isPermaLink="false">http://www.kerrywong.com/?p=586#comment-33644</guid>
		<description><![CDATA[Thanks for the comparision; a useful insight. In your comparison, a very old version of matlab selected was.

Newer Matlab versions:

Have updated the MATLAB BLAS and LAPACK libraries used 
Use more recent AMD and Intel MKL
Allow the user to specify an external Intel MKL
Have introduced multithreaded implementations as default for some mathematical functions]]></description>
		<content:encoded><![CDATA[<p>Thanks for the comparision; a useful insight. In your comparison, a very old version of matlab selected was.</p>
<p>Newer Matlab versions:</p>
<p>Have updated the MATLAB BLAS and LAPACK libraries used<br />
Use more recent AMD and Intel MKL<br />
Allow the user to specify an external Intel MKL<br />
Have introduced multithreaded implementations as default for some mathematical functions</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Alun Williams</title>
		<link>http://www.kerrywong.com/2009/03/07/matrix-multiplication-performance-in-c/comment-page-1/#comment-31023</link>
		<dc:creator>Alun Williams</dc:creator>
		<pubDate>Sun, 03 May 2009 10:31:50 +0000</pubDate>
		<guid isPermaLink="false">http://www.kerrywong.com/?p=586#comment-31023</guid>
		<description><![CDATA[Unfortunately my code has got mangled by the comment posting process.

i0,j0,k0 should be iterated up to the relevant number of rows and columns as can still be seen in the post, and then either incremented by the relevant XBLOCK variable or equivalently set to the relevant Xmax variable. If anyone is interested I&#039;ll be happy to send the code in an email.]]></description>
		<content:encoded><![CDATA[<p>Unfortunately my code has got mangled by the comment posting process.</p>
<p>i0,j0,k0 should be iterated up to the relevant number of rows and columns as can still be seen in the post, and then either incremented by the relevant XBLOCK variable or equivalently set to the relevant Xmax variable. If anyone is interested I&#8217;ll be happy to send the code in an email.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Alun Williams</title>
		<link>http://www.kerrywong.com/2009/03/07/matrix-multiplication-performance-in-c/comment-page-1/#comment-31021</link>
		<dc:creator>Alun Williams</dc:creator>
		<pubDate>Sun, 03 May 2009 09:56:53 +0000</pubDate>
		<guid isPermaLink="false">http://www.kerrywong.com/?p=586#comment-31021</guid>
		<description><![CDATA[Thanks for a very high quality post. I currently do most of my work on an oldish PC with 2*1.4Ghz P4 Xeons running Windows 2000. I have downloaded an evaluation copy of the latest MKL, but also have a proper licence for an old version of it (from about 1999). I tried to evaluate performance using both the old and new MKL and my own multiplication both naive and as good as I could make it, and for both float and double matrices. I don&#039;t have a C++ compiler with #pragma omp (indeed I was not familiar with that at all before seeing your article)
In every case the latest MKL is about 10* faster than my best C++ code. With the older MKL it is only 2* faster, so I could get close to that my multi-threading. 
My best C++ code takes about 22 seconds for a 2000*2000 multiply using floats, and 24 seconds using doubles. Since this is on a 1.4GHz machine and I&#039;m not using threads my code might be of interest to anyone who cannot afford MKL and who doesn&#039;t have a multi-core machine or a compiler that supports OMP.
The main thing wrong with &quot;naive&quot; multiplication, as implemented by your standard matrix multiplication code is that it utilises the CPU cache poorly, and this effect becomes very pronounced for large matrices.
Simply interchanging the j and k loops (which means initialisation has to be done separately) will give a *3 performance improvement. I managed to get a further *2 improvement by using a &quot;blocked&quot; algorithm and unrolling the inner loop.
Final code looks like this:

template class Large_Matrix
{
  private:
    unsigned nr_rows;
    unsigned nr_columns;
    T * data;
  public:
    Large_Matrix(unsigned nr_rows_,unsigned nr_columns_) :
      nr_rows(nr_rows_),
      nr_columns(nr_columns_),
      data(new T[nr_rows_*nr_columns_])
    {}
    ~Large_Matrix()
    {
      delete [] data;
    }
    T * operator[](unsigned row)
    {
      return &amp;data[row*nr_columns];
    }
    const T * operator[](unsigned row) const
    {
      return &amp;data[row*nr_columns];
    }
    static bool multiply(Large_Matrix * answer,
                         const Large_Matrix &amp; lhs,
                         const Large_Matrix &amp; rhs)
    {
      if (lhs.nr_columns != rhs.nr_rows)
        return false;
      if (answer == &amp;lhs)
      {
        Large_Matrix temp(lhs);
        return multiply(answer,temp,rhs);
      }
      if (answer == &amp;rhs)
      {
        Large_Matrix temp(rhs);
        return multiply(answer,lhs,temp);
      }
      unsigned nr_elements = lhs.nr_rows * rhs.nr_columns;
      if (answer-&gt;nr_rows * answer-&gt;nr_columns != nr_elements)
      {
        delete [] answer-&gt;data;
        answer-&gt;data = new T[lhs.nr_rows*rhs.nr_columns];
      }
      answer-&gt;nr_rows = lhs.nr_rows;
      answer-&gt;nr_columns = rhs.nr_columns;

      {
        for (unsigned i = 0; i data[i] = 0;
      }
 
      unsigned i0,j0,k0,i,j,k,imax,jmax,kmax,jmax1;
      const unsigned IBLOCK=100;
      const unsigned JBLOCK=128;
      const unsigned KBLOCK=40;
      for (i0 = 0; i0  lhs.nr_rows)
          imax = lhs.nr_rows;
        for (k0 = 0; k0  lhs.nr_columns)
            kmax = lhs.nr_columns;
          for (j0 = 0; j0  rhs.nr_columns)
              jmax = rhs.nr_columns;
            jmax1 = jmax &amp; ~15;
            for (i = i0; i &lt;imax;i++,lrow += lhs.nr_columns,arow += rhs.nr_columns)
            {
              const T * rrow = rhs[k0];
              for (k = k0; k &lt; kmax;k++,rrow += rhs.nr_columns)
              {
                T v = lrow[k];
                for (j = j0; j &lt; jmax1;j+=16)
                {
                  arow[j] += v*rrow[j];
                  arow[j+1] += v*rrow[j+1];
                  arow[j+2] += v*rrow[j+2];
                  arow[j+3] += v*rrow[j+3];
                  arow[j+4] += v*rrow[j+4];
                  arow[j+5] += v*rrow[j+5];
                  arow[j+6] += v*rrow[j+6];
                  arow[j+7] += v*rrow[j+7];
                  arow[j+8] += v*rrow[j+8];
                  arow[j+9] += v*rrow[j+9];
                  arow[j+10] += v*rrow[j+10];
                  arow[j+11] += v*rrow[j+11];
                  arow[j+12] += v*rrow[j+12];
                  arow[j+13] += v*rrow[j+13];
                  arow[j+14] += v*rrow[j+14];
                  arow[j+15] += v*rrow[j+15];
                }
                for (j = jmax1; j &lt; jmax;j++)
                  arow[j] += v*rrow[j];
              }
            }
          }
        }
      }
      return true;
    }
};

The right values for IBLOCK,JBLOCK,KBLOCK will vary from machine to machine, though there doesn&#039;t seem to be a sharp minimum, and probably depend on the size of the matrices as well. JBLOCK should be a multiple of 16. If you know in advance the size of the matrices you will be using making the values a factor of this size is probably a good idea.]]></description>
		<content:encoded><![CDATA[<p>Thanks for a very high quality post. I currently do most of my work on an oldish PC with 2*1.4Ghz P4 Xeons running Windows 2000. I have downloaded an evaluation copy of the latest MKL, but also have a proper licence for an old version of it (from about 1999). I tried to evaluate performance using both the old and new MKL and my own multiplication both naive and as good as I could make it, and for both float and double matrices. I don&#8217;t have a C++ compiler with #pragma omp (indeed I was not familiar with that at all before seeing your article)<br />
In every case the latest MKL is about 10* faster than my best C++ code. With the older MKL it is only 2* faster, so I could get close to that my multi-threading.<br />
My best C++ code takes about 22 seconds for a 2000*2000 multiply using floats, and 24 seconds using doubles. Since this is on a 1.4GHz machine and I&#8217;m not using threads my code might be of interest to anyone who cannot afford MKL and who doesn&#8217;t have a multi-core machine or a compiler that supports OMP.<br />
The main thing wrong with &#8220;naive&#8221; multiplication, as implemented by your standard matrix multiplication code is that it utilises the CPU cache poorly, and this effect becomes very pronounced for large matrices.<br />
Simply interchanging the j and k loops (which means initialisation has to be done separately) will give a *3 performance improvement. I managed to get a further *2 improvement by using a &#8220;blocked&#8221; algorithm and unrolling the inner loop.<br />
Final code looks like this:</p>
<p>template class Large_Matrix<br />
{<br />
  private:<br />
    unsigned nr_rows;<br />
    unsigned nr_columns;<br />
    T * data;<br />
  public:<br />
    Large_Matrix(unsigned nr_rows_,unsigned nr_columns_) :<br />
      nr_rows(nr_rows_),<br />
      nr_columns(nr_columns_),<br />
      data(new T[nr_rows_*nr_columns_])<br />
    {}<br />
    ~Large_Matrix()<br />
    {<br />
      delete [] data;<br />
    }<br />
    T * operator[](unsigned row)<br />
    {<br />
      return &amp;data[row*nr_columns];<br />
    }<br />
    const T * operator[](unsigned row) const<br />
    {<br />
      return &amp;data[row*nr_columns];<br />
    }<br />
    static bool multiply(Large_Matrix * answer,<br />
                         const Large_Matrix &amp; lhs,<br />
                         const Large_Matrix &amp; rhs)<br />
    {<br />
      if (lhs.nr_columns != rhs.nr_rows)<br />
        return false;<br />
      if (answer == &amp;lhs)<br />
      {<br />
        Large_Matrix temp(lhs);<br />
        return multiply(answer,temp,rhs);<br />
      }<br />
      if (answer == &amp;rhs)<br />
      {<br />
        Large_Matrix temp(rhs);<br />
        return multiply(answer,lhs,temp);<br />
      }<br />
      unsigned nr_elements = lhs.nr_rows * rhs.nr_columns;<br />
      if (answer-&gt;nr_rows * answer-&gt;nr_columns != nr_elements)<br />
      {<br />
        delete [] answer-&gt;data;<br />
        answer-&gt;data = new T[lhs.nr_rows*rhs.nr_columns];<br />
      }<br />
      answer-&gt;nr_rows = lhs.nr_rows;<br />
      answer-&gt;nr_columns = rhs.nr_columns;</p>
<p>      {<br />
        for (unsigned i = 0; i data[i] = 0;<br />
      }</p>
<p>      unsigned i0,j0,k0,i,j,k,imax,jmax,kmax,jmax1;<br />
      const unsigned IBLOCK=100;<br />
      const unsigned JBLOCK=128;<br />
      const unsigned KBLOCK=40;<br />
      for (i0 = 0; i0  lhs.nr_rows)<br />
          imax = lhs.nr_rows;<br />
        for (k0 = 0; k0  lhs.nr_columns)<br />
            kmax = lhs.nr_columns;<br />
          for (j0 = 0; j0  rhs.nr_columns)<br />
              jmax = rhs.nr_columns;<br />
            jmax1 = jmax &amp; ~15;<br />
            for (i = i0; i &lt;imax;i++,lrow += lhs.nr_columns,arow += rhs.nr_columns)<br />
            {<br />
              const T * rrow = rhs[k0];<br />
              for (k = k0; k &lt; kmax;k++,rrow += rhs.nr_columns)<br />
              {<br />
                T v = lrow[k];<br />
                for (j = j0; j &lt; jmax1;j+=16)<br />
                {<br />
                  arow[j] += v*rrow[j];<br />
                  arow[j+1] += v*rrow[j+1];<br />
                  arow[j+2] += v*rrow[j+2];<br />
                  arow[j+3] += v*rrow[j+3];<br />
                  arow[j+4] += v*rrow[j+4];<br />
                  arow[j+5] += v*rrow[j+5];<br />
                  arow[j+6] += v*rrow[j+6];<br />
                  arow[j+7] += v*rrow[j+7];<br />
                  arow[j+8] += v*rrow[j+8];<br />
                  arow[j+9] += v*rrow[j+9];<br />
                  arow[j+10] += v*rrow[j+10];<br />
                  arow[j+11] += v*rrow[j+11];<br />
                  arow[j+12] += v*rrow[j+12];<br />
                  arow[j+13] += v*rrow[j+13];<br />
                  arow[j+14] += v*rrow[j+14];<br />
                  arow[j+15] += v*rrow[j+15];<br />
                }<br />
                for (j = jmax1; j &lt; jmax;j++)<br />
                  arow[j] += v*rrow[j];<br />
              }<br />
            }<br />
          }<br />
        }<br />
      }<br />
      return true;<br />
    }<br />
};</p>
<p>The right values for IBLOCK,JBLOCK,KBLOCK will vary from machine to machine, though there doesn&#8217;t seem to be a sharp minimum, and probably depend on the size of the matrices as well. JBLOCK should be a multiple of 16. If you know in advance the size of the matrices you will be using making the values a factor of this size is probably a good idea.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: oliver</title>
		<link>http://www.kerrywong.com/2009/03/07/matrix-multiplication-performance-in-c/comment-page-1/#comment-29920</link>
		<dc:creator>oliver</dc:creator>
		<pubDate>Fri, 17 Apr 2009 03:38:52 +0000</pubDate>
		<guid isPermaLink="false">http://www.kerrywong.com/?p=586#comment-29920</guid>
		<description><![CDATA[Thank you. I fixed my problem (my compilation flags were not quite right). Thanks for the great blog entry.]]></description>
		<content:encoded><![CDATA[<p>Thank you. I fixed my problem (my compilation flags were not quite right). Thanks for the great blog entry.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: kwong</title>
		<link>http://www.kerrywong.com/2009/03/07/matrix-multiplication-performance-in-c/comment-page-1/#comment-29839</link>
		<dc:creator>kwong</dc:creator>
		<pubDate>Wed, 15 Apr 2009 20:13:50 +0000</pubDate>
		<guid isPermaLink="false">http://www.kerrywong.com/?p=586#comment-29839</guid>
		<description><![CDATA[I am using the latest MKL 10.1 and gcc 4.3.2 (64 bit) with -O3 optimization and optimization for Intel Core 2 processor families enabled. The data was collected on a quad core PC (Q9450@3.2GHz) with 8GB memory installed. I wouldn&#039;t be surprised that Intel relied on SSE2 SSE3 heavily and that would explain huge performance discrepancies among different processors.]]></description>
		<content:encoded><![CDATA[<p>I am using the latest MKL 10.1 and gcc 4.3.2 (64 bit) with -O3 optimization and optimization for Intel Core 2 processor families enabled. The data was collected on a quad core PC (Q9450@3.2GHz) with 8GB memory installed. I wouldn&#8217;t be surprised that Intel relied on SSE2 SSE3 heavily and that would explain huge performance discrepancies among different processors.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: oliver</title>
		<link>http://www.kerrywong.com/2009/03/07/matrix-multiplication-performance-in-c/comment-page-1/#comment-29830</link>
		<dc:creator>oliver</dc:creator>
		<pubDate>Wed, 15 Apr 2009 15:59:44 +0000</pubDate>
		<guid isPermaLink="false">http://www.kerrywong.com/?p=586#comment-29830</guid>
		<description><![CDATA[Hi,

I am trying to reproduce the same results on my own Intel machine. I get a significant speedup with the Intel MKL, but not nearly as good as yours. Can you please tell me which compiler you used (gcc? icc?) and which compiler flags you used?

Thanks.

oliver]]></description>
		<content:encoded><![CDATA[<p>Hi,</p>
<p>I am trying to reproduce the same results on my own Intel machine. I get a significant speedup with the Intel MKL, but not nearly as good as yours. Can you please tell me which compiler you used (gcc? icc?) and which compiler flags you used?</p>
<p>Thanks.</p>
<p>oliver</p>
]]></content:encoded>
	</item>
</channel>
</rss>
