linux - Why is a naïve C++ matrix multiplication 100 times slower than BLAS?

I am taking a look at large matrix multiplication and ran the following experiment to form a baseline test:

Randomly generate two 4096x4096 matrixes X, Y from std normal (0 mean, 1 stddev).

Z = X*Y

Sum elements of Z (to make sure they are accessed) and output.

Here is the naïve C++ implementatation:

#include 
#include 

using namespace std;

int main()
{

    constexpr size_t dim = 4096;

    float* x = new float[dim*dim];
    float* y = new float[dim*dim];
    float* z = new float[dim*dim];

    random_device rd;
    mt19937 gen(rd());
    normal_distribution dist(0, 1);


    for (size_t i = 0; i < dim*dim; i++)
    {
        x[i] = dist(gen);
        y[i] = dist(gen);
    }

    for (size_t row = 0; row < dim; row++)
        for (size_t col = 0; col < dim; col++)
        {
            float acc = 0;


            for (size_t k = 0; k < dim; k++)
                acc += x[row*dim + k] * y[k*dim + col];

            z[row*dim + col] = acc;
        }

    float t = 0;

    for (size_t i = 0; i < dim*dim; i++)

        t += z[i];

    cout << t << endl;

    delete x;
    delete y;
    delete z;
}

Compile and run:

$ g++ -std=gnu++11 -O3 test.cpp
$ time ./a.out

Here is the Octave/matlab implementation:

X = stdnormal_rnd(4096, 4096);
Y = stdnormal_rnd(4096, 4096);

Z = X*Y;
sum(sum(Z))

Run:

$ time octave < test.octave

Octave under the hood is using BLAS (I assume the sgemm function)

The hardware is i7 3930X on Linux x86-64 with 24 GB of ram. BLAS appears to be using two cores. Perhaps a hyperthreaded pair?

I found that the C++ version compiled with GCC 4.7 on -O3 took 9 minutes to execute:

real    9m2.126s
user    9m0.302s
sys         0m0.052s

The octave version took 6 seconds:

real    0m5.985s
user    0m10.881s
sys         0m0.144s

I understand that BLAS is optimized to all hell, and the naïve algorithm is totally ignoring caches and so on, but seriously -- 90 times?

Can anyone explain this difference? What exactly is the architecture of the BLAS implementation? I see it is using Fortran, but what is happening at the CPU level? What algorithm is it using? How is it using the CPU caches? What x86-64 machine instructions does it call? (Is it using advanced CPU features like AVX?) Where does it get this extra speed from?

Which key optimizations to the C++ algorithm could get it on par with the BLAS version?

I ran octave under gdb and stopped it half way through computation a few times. It had started a second thread and here are the stacks (all stops it looked similar):

(gdb) thread 1
#0  0x00007ffff6e17148 in pthread_join () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007ffff1626721 in ATL_join_tree () from /usr/lib/libblas.so.3
#2  0x00007ffff1626702 in ATL_join_tree () from /usr/lib/libblas.so.3
#3  0x00007ffff15ae357 in ATL_dptgemm () from /usr/lib/libblas.so.3

#4  0x00007ffff1384b59 in atl_f77wrap_dgemm_ () from /usr/lib/libblas.so.3
#5  0x00007ffff193effa in dgemm_ () from /usr/lib/libblas.so.3
#6  0x00007ffff6049727 in xgemm(Matrix const&, Matrix const&, blas_trans_type, blas_trans_type) () from /usr/lib/x86_64-linux-gnu/liboctave.so.1
#7  0x00007ffff6049954 in operator*(Matrix const&, Matrix const&) () from /usr/lib/x86_64-linux-gnu/liboctave.so.1
#8  0x00007ffff7839e4e in ?? () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#9  0x00007ffff765a93a in do_binary_op(octave_value::binary_op, octave_value const&, octave_value const&) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#10 0x00007ffff76c4190 in tree_binary_expression::rvalue1(int) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#11 0x00007ffff76c33a5 in tree_simple_assignment::rvalue1(int) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#12 0x00007ffff76d0864 in tree_evaluator::visit_statement(tree_statement&) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#13 0x00007ffff76cffae in tree_evaluator::visit_statement_list(tree_statement_list&) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1

#14 0x00007ffff757f6d4 in main_loop() () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#15 0x00007ffff7527abf in octave_main () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1

(gdb) thread 2
#0  0x00007ffff14ba4df in ATL_dJIK56x56x56TN56x56x0_a1_b1 () from /usr/lib/libblas.so.3
(gdb) bt
#0  0x00007ffff14ba4df in ATL_dJIK56x56x56TN56x56x0_a1_b1 () from /usr/lib/libblas.so.3
#1  0x00007ffff15a5fd7 in ATL_dmmIJK2 () from /usr/lib/libblas.so.3
#2  0x00007ffff15a6ae4 in ATL_dmmIJK () from /usr/lib/libblas.so.3
#3  0x00007ffff1518e65 in ATL_dgemm () from /usr/lib/libblas.so.3

#4  0x00007ffff15adf7a in ATL_dptgemm0 () from /usr/lib/libblas.so.3
#5  0x00007ffff6e15e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#6  0x00007ffff6b41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#7  0x0000000000000000 in ?? ()

It is calling BLAS gemm as expected.

The first thread appears to be joining the second, so I am not sure whether these two threads account for the 200% CPU usage observed or not.

Which library is ATL_dgemm libblas.so.3 and where is its code?

$ ls -al /usr/lib/libblas.so.3
/usr/lib/libblas.so.3 -> /etc/alternatives/libblas.so.3

$ ls -al /etc/alternatives/libblas.so.3
/etc/alternatives/libblas.so.3 -> /usr/lib/atlas-base/atlas/libblas.so.3

$ ls -al /usr/lib/atlas-base/atlas/libblas.so.3
/usr/lib/atlas-base/atlas/libblas.so.3 -> libblas.so.3.0


$ ls -al /usr/lib/atlas-base/atlas/libblas.so.3.0
/usr/lib/atlas-base/atlas/libblas.so.3.0

$ dpkg -S /usr/lib/atlas-base/atlas/libblas.so.3.0
libatlas3-base: /usr/lib/atlas-base/atlas/libblas.so.3.0

$ apt-get source libatlas3-base

It is ATLAS 3.8.4

Here are the optimizations I later implemented:

Using a tiled approach where I preload 64x64 blocks of X, Y and Z into separate arrays.

Changing the calculation of each block so that the inner loop looks like this:

for (size_t tcol = 0; tcol < block_width; tcol++)
    bufz[trow][tcol] += B * bufy[tk][tcol];

This allows GCC to autovectorize to SIMD instructions and also allows for instruction level parallelism (I think).

Turning on march=corei7-avx. This gains 30% extra speed but is cheating because I think the BLAS library is prebuilt.

Here is the code:

#include 
#include 


using namespace std;

constexpr size_t dim = 4096;
constexpr size_t block_width = 64;
constexpr size_t num_blocks = dim / block_width;

double X[dim][dim], Y[dim][dim], Z[dim][dim];

double bufx[block_width][block_width];

double bufy[block_width][block_width];
double bufz[block_width][block_width];

void calc_block()
{
    for (size_t trow = 0; trow < block_width; trow++)
        for (size_t tk = 0; tk < block_width; tk++)
        {
            double B = bufx[trow][tk];


            for (size_t tcol = 0; tcol < block_width; tcol++)
                bufz[trow][tcol] += B * bufy[tk][tcol];
        }
}

int main()
{
    random_device rd;
    mt19937 gen(rd());
    normal_distribution dist(0, 1);


    for (size_t row = 0; row < dim; row++)
        for (size_t col = 0; col < dim; col++)
        {
            X[row][col] = dist(gen);
            Y[row][col] = dist(gen);
            Z[row][col] = 0;
        }

    for (size_t block_row = 0; block_row < num_blocks; block_row++)

        for (size_t block_col = 0; block_col < num_blocks; block_col++)
        {
            for (size_t trow = 0; trow < block_width; trow++)
                for (size_t tcol = 0; tcol < block_width; tcol++)
                    bufz[trow][tcol] = 0;

            for (size_t block_k = 0; block_k < num_blocks; block_k++)
            {
                for (size_t trow = 0; trow < block_width; trow++)
                    for (size_t tcol = 0; tcol < block_width; tcol++)

                    {
                        bufx[trow][tcol] = X[block_row*block_width + trow][block_k*block_width + tcol];
                        bufy[trow][tcol] = Y[block_k*block_width + trow][block_col*block_width + tcol];
                    }

                calc_block();
            }

            for (size_t trow = 0; trow < block_width; trow++)
                for (size_t tcol = 0; tcol < block_width; tcol++)

                    Z[block_row*block_width + trow][block_col*block_width + tcol] = bufz[trow][tcol];

        }

    double t = 0;

    for (size_t row = 0; row < dim; row++)
        for (size_t col = 0; col < dim; col++)
            t += Z[row][col];


    cout << t << endl;
}

All the action is in the calc_block function - over 90% of the time is spent in it.

The new time is:

real    0m17.370s
user    0m17.213s

sys 0m0.092s

Which is much closer.

The decompile of the calc_block function is as follows:

0000000000401460 <_Z10calc_blockv>:
  401460:   b8 e0 21 60 00          mov    $0x6021e0,%eax
  401465:   41 b8 e0 23 61 00       mov    $0x6123e0,%r8d

  40146b:   31 ff                   xor    %edi,%edi
  40146d:   49 29 c0                sub    %rax,%r8
  401470:   49 8d 34 00             lea    (%r8,%rax,1),%rsi
  401474:   48 89 f9                mov    %rdi,%rcx
  401477:   ba e0 a1 60 00          mov    $0x60a1e0,%edx
  40147c:   48 c1 e1 09             shl    $0x9,%rcx
  401480:   48 81 c1 e0 21 61 00    add    $0x6121e0,%rcx
  401487:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  40148e:   00 00 
  401490:   c4 e2 7d 19 01          vbroadcastsd (%rcx),%ymm0

  401495:   48 83 c1 08             add    $0x8,%rcx
  401499:   c5 fd 59 0a             vmulpd (%rdx),%ymm0,%ymm1
  40149d:   c5 f5 58 08             vaddpd (%rax),%ymm1,%ymm1
  4014a1:   c5 fd 29 08             vmovapd %ymm1,(%rax)
  4014a5:   c5 fd 59 4a 20          vmulpd 0x20(%rdx),%ymm0,%ymm1
  4014aa:   c5 f5 58 48 20          vaddpd 0x20(%rax),%ymm1,%ymm1
  4014af:   c5 fd 29 48 20          vmovapd %ymm1,0x20(%rax)
  4014b4:   c5 fd 59 4a 40          vmulpd 0x40(%rdx),%ymm0,%ymm1
  4014b9:   c5 f5 58 48 40          vaddpd 0x40(%rax),%ymm1,%ymm1
  4014be:   c5 fd 29 48 40          vmovapd %ymm1,0x40(%rax)

  4014c3:   c5 fd 59 4a 60          vmulpd 0x60(%rdx),%ymm0,%ymm1
  4014c8:   c5 f5 58 48 60          vaddpd 0x60(%rax),%ymm1,%ymm1
  4014cd:   c5 fd 29 48 60          vmovapd %ymm1,0x60(%rax)
  4014d2:   c5 fd 59 8a 80 00 00    vmulpd 0x80(%rdx),%ymm0,%ymm1
  4014d9:   00 
  4014da:   c5 f5 58 88 80 00 00    vaddpd 0x80(%rax),%ymm1,%ymm1
  4014e1:   00 
  4014e2:   c5 fd 29 88 80 00 00    vmovapd %ymm1,0x80(%rax)
  4014e9:   00 
  4014ea:   c5 fd 59 8a a0 00 00    vmulpd 0xa0(%rdx),%ymm0,%ymm1

  4014f1:   00 
  4014f2:   c5 f5 58 88 a0 00 00    vaddpd 0xa0(%rax),%ymm1,%ymm1
  4014f9:   00 
  4014fa:   c5 fd 29 88 a0 00 00    vmovapd %ymm1,0xa0(%rax)
  401501:   00 
  401502:   c5 fd 59 8a c0 00 00    vmulpd 0xc0(%rdx),%ymm0,%ymm1
  401509:   00 
  40150a:   c5 f5 58 88 c0 00 00    vaddpd 0xc0(%rax),%ymm1,%ymm1
  401511:   00 
  401512:   c5 fd 29 88 c0 00 00    vmovapd %ymm1,0xc0(%rax)

  401519:   00 
  40151a:   c5 fd 59 8a e0 00 00    vmulpd 0xe0(%rdx),%ymm0,%ymm1
  401521:   00 
  401522:   c5 f5 58 88 e0 00 00    vaddpd 0xe0(%rax),%ymm1,%ymm1
  401529:   00 
  40152a:   c5 fd 29 88 e0 00 00    vmovapd %ymm1,0xe0(%rax)
  401531:   00 
  401532:   c5 fd 59 8a 00 01 00    vmulpd 0x100(%rdx),%ymm0,%ymm1
  401539:   00 
  40153a:   c5 f5 58 88 00 01 00    vaddpd 0x100(%rax),%ymm1,%ymm1

  401541:   00 
  401542:   c5 fd 29 88 00 01 00    vmovapd %ymm1,0x100(%rax)
  401549:   00 
  40154a:   c5 fd 59 8a 20 01 00    vmulpd 0x120(%rdx),%ymm0,%ymm1
  401551:   00 
  401552:   c5 f5 58 88 20 01 00    vaddpd 0x120(%rax),%ymm1,%ymm1
  401559:   00 
  40155a:   c5 fd 29 88 20 01 00    vmovapd %ymm1,0x120(%rax)
  401561:   00 
  401562:   c5 fd 59 8a 40 01 00    vmulpd 0x140(%rdx),%ymm0,%ymm1

  401569:   00 
  40156a:   c5 f5 58 88 40 01 00    vaddpd 0x140(%rax),%ymm1,%ymm1
  401571:   00 
  401572:   c5 fd 29 88 40 01 00    vmovapd %ymm1,0x140(%rax)
  401579:   00 
  40157a:   c5 fd 59 8a 60 01 00    vmulpd 0x160(%rdx),%ymm0,%ymm1
  401581:   00 
  401582:   c5 f5 58 88 60 01 00    vaddpd 0x160(%rax),%ymm1,%ymm1
  401589:   00 
  40158a:   c5 fd 29 88 60 01 00    vmovapd %ymm1,0x160(%rax)

  401591:   00 
  401592:   c5 fd 59 8a 80 01 00    vmulpd 0x180(%rdx),%ymm0,%ymm1
  401599:   00 
  40159a:   c5 f5 58 88 80 01 00    vaddpd 0x180(%rax),%ymm1,%ymm1
  4015a1:   00 
  4015a2:   c5 fd 29 88 80 01 00    vmovapd %ymm1,0x180(%rax)
  4015a9:   00 
  4015aa:   c5 fd 59 8a a0 01 00    vmulpd 0x1a0(%rdx),%ymm0,%ymm1
  4015b1:   00 
  4015b2:   c5 f5 58 88 a0 01 00    vaddpd 0x1a0(%rax),%ymm1,%ymm1

  4015b9:   00 
  4015ba:   c5 fd 29 88 a0 01 00    vmovapd %ymm1,0x1a0(%rax)
  4015c1:   00 
  4015c2:   c5 fd 59 8a c0 01 00    vmulpd 0x1c0(%rdx),%ymm0,%ymm1
  4015c9:   00 
  4015ca:   c5 f5 58 88 c0 01 00    vaddpd 0x1c0(%rax),%ymm1,%ymm1
  4015d1:   00 
  4015d2:   c5 fd 29 88 c0 01 00    vmovapd %ymm1,0x1c0(%rax)
  4015d9:   00 
  4015da:   c5 fd 59 82 e0 01 00    vmulpd 0x1e0(%rdx),%ymm0,%ymm0

  4015e1:   00 
  4015e2:   c5 fd 58 80 e0 01 00    vaddpd 0x1e0(%rax),%ymm0,%ymm0
  4015e9:   00 
  4015ea:   48 81 c2 00 02 00 00    add    $0x200,%rdx
  4015f1:   48 39 ce                cmp    %rcx,%rsi
  4015f4:   c5 fd 29 80 e0 01 00    vmovapd %ymm0,0x1e0(%rax)
  4015fb:   00 
  4015fc:   0f 85 8e fe ff ff       jne    401490 <_Z10calc_blockv+0x30>
  401602:   48 83 c7 01             add    $0x1,%rdi
  401606:   48 05 00 02 00 00       add    $0x200,%rax

  40160c:   48 83 ff 40             cmp    $0x40,%rdi
  401610:   0f 85 5a fe ff ff       jne    401470 <_Z10calc_blockv+0x10>
  401616:   c5 f8 77                vzeroupper 
  401619:   c3                      retq   
  40161a:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

Answer

Here are three factors responsible for the performance difference between your code and BLAS (plus a note on Strassen’s algorithm).

In your inner loop, on k, you have y[k*dim + col]. Because of the way memory cache is arranged, consecutive values of k with the same dim and col map to the same cache set. The way cache is structured, each memory address has one cache set where its contents must be held while it is in cache. Each cache set has several lines (four is a typical number), and each of those lines can hold any of the memory addresses that map to that particular cache set.

Because your inner loop iterates through y in this way, each time it uses an element from y, it must load the memory for that element into the same set as the previous iteration did. This forces one of the previous cache lines in the set to be evicted. Then, in the next iteration of the col loop, all of the elements of y have been evicted from cache, so they must be reloaded again.

Thus, every time your loop loads an element of y, it must be loaded from memory, which takes many CPU cycles.

High-performance code avoids this in two ways. One, it divides the work into smaller blocks. The rows and the columns are partitioned into smaller sizes, and processed with shorter loops that are able to use all the elements in a cache line and to use each element several times before they go on to the next block. Thus, most of the references to elements of x and elements of y come from cache, often in a single processor cycle. Two, in some situations, the code will copy data out of a column of a matrix (which thrashes cache due to the geometry) into a row of a temporary buffer (which avoids thrashing). This again allows most of the memory references to be served from cache instead of from memory.

Another factor is the use of Single Instruction Multiple Data (SIMD) features. Many modern processors have instructions that load multiple elements (four float elements is typical, but some now do eight) in one instruction, store multiple elements, add multiple elements (e.g., for each of these four, add it to the corresponding one of those four), multiply multiple elements, and so on. Simply using such instructions immediately makes your code four times faster, provided you are able to arrange your work to use those instructions.

These instructions are not directly accessible in standard C. Some optimizers now try to use such instructions when they can, but this optimization is difficult, and it is not common to gain much benefit from it. Many compilers provide extensions to the language that give access to these instructions. Personally, I usually prefer to write in assembly language to use SIMD.

Another factor is using instruction-level parallel execution features on a processor. Observe that in your inner loop, acc is updated. The next iteration cannot add to acc until the previous iteration has finished updating acc. High-performance code will instead keep multiple sums running in parallel (even multiple SIMD sums). The result of this will be that while the addition for one sum is executing, the addition for another sum will be started. It is common on today’s processors to support four or more floating-point operations in progress at a time. As written, your code cannot do this at all. Some compilers will try to optimize the code by rearranging loops, but this requires the compiler to be able to see that iterations of a particular loop are independent from each other or can be commuted with another loop, et cetera.

It is quite feasible that using cache effectively provides a factor of ten performance improvement, SIMD provides another four, and instruction-level parallelism provides another four, giving 160 altogether.

Here is a very crude estimate of the effect of Strassen’s algorithm, based on this Wikipedia page. The Wikipedia page says Strassen is slightly better than direct multiplication around n = 100. This suggests the ratio of the constant factors of the execution times is 100³ / 100^2.807 ≈ 2.4. Obviously, this will vary tremendously depending on processor model, matrix sizes interacting with cache effects, and so on. However, simple extrapolation shows that Strassen is about twice as good as direct multiplication at n = 4096 ((4096/100)^3-2.807 ≈ 2.05). Again, that is just a ballpark estimate.

As for the later optimizations, consider this code in the inner loop:

bufz[trow][tcol] += B * bufy[tk][tcol];

One potential issue with this is that bufz could, in general, overlap bufy. Since you use global definitions for bufz and bufy, the compiler likely knows they do not overlap in this case. However, if you move this code into a subroutine that is passed bufz and bufy as parameters, and especially if you compile that subroutine in a separate source file, then the compiler is less likely to know that bufz and bufy do not overlap. In that case, the compiler cannot vectorize or otherwise reorder the code, because the bufz[trow][tcol] in this iteration might be the same as bufy[tk][tcol] in another iteration.

Even if the compiler can see that the subroutine is called with different bufz and bufy in the current source module, if the routine has extern linkage (the default), then the compiler has to allow for the routine to be called from an external module, so it must generate code that works correctly if bufz and bufy overlap. (One way the compiler can deal with this is to generate two versions of the routine, one to be called from external modules and one to be called from the module currently being compiled. Whether it does that depends on your compiler, the optimization switches, et cetera.) If you declare the routine as static, then the compiler knows it cannot be called from an external module (unless you take its address and there is a possibility the address is passed outside of the current module).

Another potential issue is that, even if the compiler vectorizes this code, it does not necessarily generate the best code for the processor you execute on. Looking at the generated assembly code, it appears the compiler is using only %ymm1 repeatedly. Over and over again, it multiplies a value from memory into %ymm1, adds a value from memory to %ymm1, and stores a value from %ymm1 to memory. There are two problems with this.

One, you do not want these partial sums stored to memory frequently. You want many additions accumulated into a register, and the register will be written to memory only infrequently. Convincing the compiler to do this likely requires rewriting the code to be explicit about keeping partial sums in temporary objects and writing them to memory after a loop has completed.

Two, these instructions are nominally serially dependent. The add cannot start until the multiply completes, and the store cannot write to memory until the add completes. The Core i7 has great capabilities for out-of-order execution. So, while it has that add waiting to start execution, it looks at the multiply later in the instruction stream and starts it. (Even though that multiply also uses %ymm1, the processor remaps the registers on the fly, so that it uses a different internal register to do this multiply.) Even though your code is filled with consecutive dependencies, the processor tries to execute several instructions at once. However, a number of things can interfere with this. You can run out of the internal registers the processor uses for renaming. The memory addresses you use might run into false conflicts. (The processor looks at a dozen or so of the low bits of memory addresses to see if the address might be the same as another one that it is trying to load or store from an earlier instruction. If the bits are equal, the processor has to delay the current load or store until it can verify the entire address is different. This delay can bollux up more than just the current load or store.) So, it is better to have instructions that are overtly independent.

That is one more reason I prefer to write high-performance code in assembly. To do it in C, you have to convince the compiler to give you instructions like this, by doing things such as writing some of your own SIMD code (using the language extensions for them) and manually unrolling loops (writing out multiple iterations).

When copying into and out of buffers, there might be similar issues. However, you report 90% of the time is spent in calc_block, so I have not looked at this closely.

Also, does Strassen’s algorithm account for any of the remaining difference?

Blog

Sunday, 23 July 2017