performance - C++ code execution time varies with small source change that shouldn't introduce any extra work

While working on benchmarking some code, I found that its execution time would vary with even the most innocuous code changes.

I have attempted to boil down the code below to the most minimal test case, but it is still rather lengthy (for which I apologize). Changing virtually anything largely affects the benchmark results.

#include 

#include 
#include 
#include 
#include 
#include 

constexpr double usec_to_sec = 1000000.0;

// Simple convenience timer
class Timer

{
    std::chrono::high_resolution_clock::time_point start_time;
public:
    Timer() : start_time(std::chrono::high_resolution_clock::now()) { }
    int64_t operator()() const {
        return static_cast(
        std::chrono::duration_cast(
            std::chrono::high_resolution_clock::now()-start_time).count()
        );
    }

};

// Convenience random number generator
template 
class RandGen
{
    mutable std::default_random_engine generator;
    std::uniform_int_distribution distribution;

    constexpr unsigned make_seed() const {

        return static_cast(std::chrono::system_clock::now().time_since_epoch().count());
    }
public:
    RandGen(T min, T max) : generator(make_seed()), distribution(min, max) { }
    T operator ()() { return distribution(generator); }
};

// Printer class
class Printer
{

    std::string filename;
    template     
    friend Printer &operator<<(Printer &, S &&s);
public:
    Printer(const char *filename) : filename(filename) {}
};

template 
Printer &operator<<(Printer &pm, S &&s) {
    std::cout << s;

    return pm;
}

// +------------+
// | Main Stuff |
// +------------+
void runtest(size_t run_length)
{
    static RandGen word_sz_generator(10, 20);
    static RandGen rand_char_generator(0, 25);


    size_t total_char_count = 0;
    std::vector word_list;
    word_list.reserve(run_length);

    Printer printer("benchmark.dat");
    printer << "Running test... ";

    Timer timer; // start timer
    for (auto i = 0; i < run_length; i++) {


        size_t word_sz = word_sz_generator();
        std::string word;
        for (auto sz = 0; sz < word_sz; sz++) {
            word.push_back(static_cast(rand_char_generator())+'a');
        }
        word_list.emplace_back(std::move(word));
        total_char_count += word_sz;
    }
    int64_t execution_time_usec = timer(); // stop timer


    printer << /*run_length*/ word_list.size() << " words, and " 
            << total_char_count << " total characters, were built in "
            << execution_time_usec/usec_to_sec << " seconds.\n";
}

int main(int argc, char **argv)
{
    constexpr size_t iterations = 30;
    constexpr size_t run_length = 50000000;


    for (auto i = 0; i < iterations; i++)
        runtest(run_length);

    return EXIT_SUCCESS;
}

The 1^st class, Timer, is just a small convenience class (intentionally not well-featured, for brevity) for timing the code.

I tried to do without the 2^nd class RandGen (which just generates random values), but any attempt to exclude this from the test code made the problem auto-magically disappear. So, I suspect the issue has something to do with it. But I can't figure out how.

The 3^rd class Printer seems entirely unnecessary for this question, but again, including it seems to exacerbate the issue.

So, now we're down to main() (which just runs the test) and runtest().

runtest() is hideous, so please don't look at it from a "clean code" point-of-view. Changing it in any way (ex. moving the inner for loop to its own function) results in a change in benchmark results. The simplest, and most perplexing example is the last line:

printer << /*run_length*/ word_list.size() << " words, and " 
        << total_char_count << " total characters, were built in "

        << execution_time_usec/usec_to_sec << " seconds.\n";

In the line above, run_length and word_list.size() are the same. The size of vector word_list is defined by run_length. But, if I run the code as-is, I get an average execution time of 9.8 seconds, whereas if I uncomment run_length and comment-out word_list.size(), the execution time actually increases to an average of 10.6 seconds. I can't fathom how such an insignificant code change could affect the timing of the whole program to such an extent.

In other words...

9.8 seconds:

printer << /*run_length*/ word_list.size() << " words, and " 

        << total_char_count << " total characters, were built in "
        << execution_time_usec/usec_to_sec << " seconds.\n";

10.6 seconds:

printer << run_length /*word_list.size()*/ << " words, and " 
        << total_char_count << " total characters, were built in "
        << execution_time_usec/usec_to_sec << " seconds.\n";

I have repeated the exercise of commenting and uncommenting the variables noted above, and re-running the benchmarks, many times. The benchmarks are repeatable and consistent - i.e. they are consistently 9.8 seconds and 10.6 seconds, respectively.

The code output looks like this, for the two cases:

Running test... 50000000 words, and 750000798 total characters, were built in 9.83379 seconds.
Running test... 50000000 words, and 749978210 total characters, were built in 9.84541 seconds.
Running test... 50000000 words, and 749996688 total characters, were built in 9.87418 seconds.
Running test... 50000000 words, and 749995415 total characters, were built in 9.85704 seconds.

Running test... 50000000 words, and 750017699 total characters, were built in 9.86186 seconds.
Running test... 50000000 words, and 749998680 total characters, were built in 9.83395 seconds.
...

Running test... 50000000 words, and 749988517 total characters, were built in 10.604 seconds.
Running test... 50000000 words, and 749958011 total characters, were built in 10.6283 seconds.
Running test... 50000000 words, and 749994387 total characters, were built in 10.6374 seconds.
Running test... 50000000 words, and 749995242 total characters, were built in 10.6445 seconds.
Running test... 50000000 words, and 749988379 total characters, were built in 10.6543 seconds.
Running test... 50000000 words, and 749969532 total characters, were built in 10.6722 seconds.

...

Any information on what would cause this discrepancy would be greatly appreciated.

Notes:

Even removing the unused std::string filename member object from the Printer class yields different benchmark results - where doing so, eliminates (or reduces to insignificant proportions) the difference between the two benchmarks provided above.

This does not seem to be an issue when compiling with g++ (on Ubuntu). Although, I can’t say this definitively; my tests with Ubuntu were in a VM on the same Windows machine, where the VM perhaps did not have access to all of the resources and processor enhancements.

I am using Visual Studio Community 2017 (version 15.7.4)
- Compiler version: 19.14.26431
- All tests and reported results are Release Build, 64-bit

System: Win10, i7-6700K @ 4.00 GHz, 32 GB RAM

Answer

You're probably running into some kind of code-alignment effect. Modern x86-64 CPUs are fairly robust with respect to alignment most of the time, but alignment can affect branches aliasing each other in the branch predictors (like @rcgldr mentioned), and various front-end effects.

See https://agner.org/optimize/, and performance links in the x86 tag wiki. But honestly I don't think there's any useful explanation here, other than that you've discovered your loop is sensitive to alignment effects, either from the front-end or from branch prediction. This means that even identical machine code at different alignments in your main program could have different performance.

This is a known phenomenon. An answer on Code alignment in one object file is affecting the performance of a function in another object file has some general comments about how alignment can matter, and see also Why would introducing useless MOV instructions speed up a tight loop in x86_64 assembly? There's an article somewhere about how linking object files in a different order can affect performance (and that this is an unexpected effect from the toolchain), but I couldn't find it.

You can use HW performance counters to measure branch misprediction rates to see if that explains why one version is slower than the other. Or if there's some other front-end effect.

But unfortunately there's not much you can do; trivial source differences, if they affect the asm at all, will change alignment for everything.

You can sometimes redesign things to be less sensitive to branch-prediction by replacing branches with branchless code. e.g. always generate 16 bytes of random letters, and truncate that to a random length. (Some branching on size when copying it is probably inevitable, unless creating a 16-byte std::string and then truncating it can be branchless.)

You might speed that up with SIMD, e.g. use a vectorized PRNG like with an SSE2 or AVX2 xorshift+ to generate 16 bytes of random letters at a time. (efficiently getting a uniform 0..25 distribution with packed-byte operations may be tricky, but maybe the same technique as the 0..9 distribution I used to generate 1GiB of space-separated random ASCII digits per ~0.03 seconds on a 3.9GHz Skylake would be useful. It's not perfectly uniformly distributed, though, because 65536 % 10 has a remainder (like 65536/25), but you can probably change the quality vs. speed tradeoff and still run fast.)

The asm for both versions of the inner loop in the runtest function are essentially identical, at least if the compiler asm output we see on the Godbolt compiler explorer matches what you're actually getting in the executable from MSVC. (Unlike with gcc/clang, its asm output can't necessarily be assembled into a working object file.) If your real release build does any link-time optimization that could inline some library code, it might make different optimization choices in the final executable.

I put in a #ifdef so I could use -DUSE_RL to have two MSVC 2017 outputs that built the same source different ways, and feed those asm outputs to a diff pane. (The diff pane is at the bottom in the messy layout that I linked; click the fullscreen box on it to show just that.)

The only differences in the whole function are:

ordering and register choice for a few instructions like mov edx, DWORD PTR _tls_index and mov QWORD PTR run_length$GSCopy$1$[rbp-121], rcx at the top of the function which only run once. (But not in code-size so they won't affect alignment later). This should have no effect on later code, and they end up making the same changes to architectural state, just using a different scratch reg which is not used again.

stack layout (position of local variables relative to RBP). But all the offsets are under +127, so they can all still use a [rbp + disp8] addressing mode.

Different code-gen from the actual source difference:

      mov     rdx, QWORD PTR word_list$[rbp-113]
      sub     rdx, QWORD PTR word_list$[rbp-121]  ; word_list.size() = end - start 
      ...
      sar     rdx, 5               ; >> 5   arithmetic right shift

vs.

      mov     rdx, rsi             ; copy run_length from another register

And no, these instructions alone can't possibly explain the speed difference. They're only run once per timing interval, before some I/O.

An extra npad 7 for alignment before a branch target near the bottom of the function (after a call _Xtime_get_ticks), after the above code difference.

There's a big block of red/green differences, but those are only from different numbering of labels, except for those three instructions at the start of the function.

But before runtest, the word_list.size() version includes code for a ??$?6_K@@YAAEAVPrinter@@AEAV0@$QEA_K@Z PROC function which doesn't appear anywhere for the version using run_length. (C++ name-mangling turns types into funky characters in the asm names of functions.) This is doing something for class Printer.

You said removing the unused std::string filename from Printer removed the code-gen difference. Well that function probably goes away with that change. IDK why MSVC decided to emit it at all, let alone only in one version vs. another.

Probably g++ -O3 doesn't have that code-gen difference, and that's why you don't see a difference. (Assuming your VM is hardware virtualization, g++-generated machine code is still running natively on the CPU. Getting a new page of memory from the OS might take a tiny bit longer in the VM, but the main time spent in the loop is probably in user-space in this code.)

BTW, gcc warns

:72:24: warning: comparison of integer expressions of different signedness: 'int' and 'size_t' {aka 'long unsigned int'} [-Wsign-compare]


     for (auto i = 0; i < run_length; i++) {
                      ~~^~~~~~~~~~~~

I didn't look closely at the asm output to see if that led to worse code-gen with gcc or MSVC, or if it's just going to be unsafe if you pass large inputs.

Blog

Tuesday, 19 September 2017