c - 32 byte store forwarding on Sandy Bridge

Wednesday, 16 August 2017

c - 32 byte store forwarding on Sandy Bridge

In Agner Fog's excellent microarchitecture.pdf (section 9.14) I read that:

Store forwarding works in the following cases: [...] When a write of 128 or 256 bits is followed by a read of the same size and the same address, aligned by 16.

On the other hand, Intel's Architecture Optimization Reference Manual (2.2.5.2 Intel Sandy Bridge, L1 DCache) I read that

Stores cannot forward to loads in the following cases: [...] Any load that crosses a 16-byte boundary of a 32-byte store.

Any load sounds like 32 byte load also.. I wrote the following simple code to test this, and it seems that 32 byte stores are not forwarded to subsequent 32 byte loads on the Sandy Bridge architecture. Here is the code:

#include 
#include 

int main(){

  long i;

  // aligned memory address
  double *tempa = (double*)memalign(4096, sizeof(double)*4);

  for(i=0; i<4; i++) tempa[i] = 1.0;

  for(i=0; i<1000000000; i++){ // 1e9 iterations

#ifdef TEST_AVX
    __asm__("vmovapd    %%ymm12, (%0)\n\t"
            "vmovapd    (%0), %%ymm12\n\t" 
        : 
        :"r"(tempa));
#else

    __asm__("movapd %%xmm12, (%0)\n\t"
            "movapd (%0), %%xmm12\n\t"
            :
            :"r"(tempa));
#endif
  }
}

The only thing done in the loop is reading/writing from/to a 4k-aligned memory location and a vector register. When compiled with AVX instruction set (gcc -O3 -DTEST_AVX) the execution time is 3.1s on my 2.7GHz i7-2620M. When using SSE2 instruction set, the time is 2.5s. I have looked at the performance counters. In the AVX case I count one store-forwarding block event per iteration (counter 03H 02H LD_BLOCKS.STORE_FORWARD). The counter reads 0 for the SSE2 case.

Can anybody shed some light on this? Does SB indeed not support forwarding of 32 byte stores to 32 byte loads? If the latter is the case, spilling ymm registers seems a rather expensive thing to do..

Blog

Wednesday, 16 August 2017