performance - C: memcpy speed on dynamically allocated arrays

Thursday, 26 April 2018

performance - C: memcpy speed on dynamically allocated arrays

I need help with the performance of the following code. It does a memcpy on two dynamically allocated arrays of arbitrary size:

int main()
{
  double *a, *b;
  unsigned n = 10000000, i;
  a = malloc(n*sizeof(double));
  b = malloc(n*sizeof(double));
  for(i=0; i    a[i] = 1.0;
    /* b[i] = 0.0; */

  }

  tic();
  bzero(b, n*sizeof(double));
  toc("bzero1");

  tic();
  bzero(b, n*sizeof(double));
  toc("bzero2");


  tic();
  memcpy(b, a, n*sizeof(double));
  toc("memcpy");
}

tic/toc measure the execution time.

On my computer it takes 0.035s to memcpy (Linux, gcc version 4.4.6).
If I now uncomment the line which initializes the destination array b, the code is three times faster (!) - 0.011s.

I have observed similar behavior when using a loop instead of memcpy. Usually I do not care about this since it is enough to 'initialize' the memory before using it. However, I now need to perform a simple memory copy, and do it as fast as possible. Initializing the data requires writing e.g. 0 to the memory, which is not necessary and takes time. And I would like to perform a memory copy with all available memory bandwidth.

Is there a solution to this problem? Or is it connected to the way Linux handles dynamic memory (some sort of lazy page allocation?) and can not be worked around? How is it on other systems?

Edit: The same results are obtained with gcc 4.6. I used -O3 to compile.

Edit:
Thank you all for your comments. I do understand that memory mapping takes time. I guess I just have a hard time accepting that it takes so long, much longer than the actual memory access. The code has been modified to include a benchmark of the initialization of array b using two subsequent bzero calls. The timings now show

bzero1 0.273981
bzero2 0.056803
memcpy 0.117934

Clearly, the first bzero call does much more than just stream zeros to memory - that is memory mapping and memory zeroing. The second bzero call on the other hand takes half of the time required to do a memcpy, which is exactly as expected - write only time vs. read and write time. I understand that the overhead of the second bzero call must be there due to OS security reasons. What about the rest? Can I not decrease it somehow, e.g. use larger memory pages? Different kernel settings?

I should mention that I run this on Ubuntu wheeze.

Answer

The first bzero runs longer because of (1) lazy page allocation and (2) lazy page zero-initialization by kernel. While second reason is unavoidable because of security reasons, lazy page allocation may be optimized by using larger ("huge") pages.

There are at least two ways to use huge pages on Linux. Hard way is hugetlbfs. Easy way is Transparent huge pages.

Search khugepaged in the list of processes on your system. If such process exists, transparent huge pages are supported, you can use them in your application if you change malloc to this:

posix_memalign((void **)&b, 2*1024*1024, n*sizeof(double));
madvise((void *)b, n*sizeof(double), MADV_HUGEPAGE);

Blog

Thursday, 26 April 2018