assembly - Why does Intel's compiler prefer NEG+ADD over SUB?

Wednesday, 4 April 2018

assembly - Why does Intel's compiler prefer NEG+ADD over SUB?

In examining the output of various compilers for a variety of code snippets, I've noticed that Intel's C compiler (ICC) has a strong tendency to prefer emitting a pair of NEG+ADD instructions where other compilers would use a single SUB instruction.

As a simple example, consider the following C code:

uint64_t Mod3(uint64_t value)
{
    return (value % 3);
}

ICC translates this to the following machine code (regardless of optimization level):

mov       rcx, 0xaaaaaaaaaaaaaaab

mov       rax, rdi
mul       rcx
shr       rdx, 1
lea       rsi, QWORD PTR [rdx+rdx*2]
neg       rsi                            ; \  equivalent to:
add       rdi, rsi                       ; /    sub  rdi, rsi
mov       rax, rdi
ret

Whereas other compilers (including MSVC, GCC, and Clang) will all generate essentially equivalent code, except that the NEG+ADD sequence is replaced by a single SUB instruction.

Like I said, this isn't just a quirk of how ICC compiles this particular snippet. It's a pattern I've observed repeatedly when analyzing the disassembly for arithmetic operations. I normally wouldn't think much of this, except that ICC is known to be a pretty good optimizing compiler and it is developed by folks that have insider information about their microprocessors.

Could there be something that Intel knows about the implementation of the SUB instruction on their processors that makes it more optimal to decompose it into NEG+ADD instructions? Using RISC-style instructions that decode into simpler µops is well-known optimization advice for modern microarchitectures, so is it possible that SUB is broken down internally into individual NEG and ADD µops, and that it is actually more efficient for the front-end decoder to use these "simpler" instructions? Modern CPUs are complicated, so anything is possible.

Agner Fog's comprehensive instruction tables confirm my intuition, though, that this is actually a pessimization. SUB is equally as efficient as ADD on all processors, so the additional required NEG instruction just serves to slow things down.

I also ran the two sequences through Intel's own Architecture Code Analyzer to analyze the throughput. Though the exact cycle counts and port bindings vary from one microarchitecture to another, a single SUB appears to be superior in every respect from Nehalem to Broadwell. Here are the two reports generated by the tool for Haswell:

SUB

Intel(R) Architecture Code Analyzer Version - 2.2 build:356c3b8 (Tue, 13 Dec 2016 16:25:20 +0200)
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.85 Cycles       Throughput Bottleneck: Dependency chains (possibly between iterations)


Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 1.0    0.0  | 1.5  | 0.0    0.0  | 0.0    0.0  | 0.0  | 1.8  | 1.7  | 0.0  |
---------------------------------------------------------------------------------------

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |

---------------------------------------------------------------------------------
|   1    | 0.1       | 0.2 |           |           |     | 0.3 | 0.4 |     | CP | mov rax, 0xaaaaaaaaaaaaaaab
|   2    |           | 1.0 |           |           |     |     | 1.0 |     | CP | mul rcx
|   1    | 0.9       |     |           |           |     |     | 0.1 |     | CP | shr rdx, 0x1
|   1    |           |     |           |           |     | 1.0 |     |     | CP | lea rax, ptr [rdx+rdx*2]
|   1    |           | 0.3 |           |           |     | 0.4 | 0.2 |     | CP | sub rcx, rax
|   1*   |           |     |           |           |     |     |     |     |    | mov rax, rcx
Total Num Of Uops: 7

NEG+ADD

Intel(R) Architecture Code Analyzer Version - 2.2 build:356c3b8 (Tue, 13 Dec 2016 16:25:20 +0200)
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report

--------------------------
Block Throughput: 2.15 Cycles       Throughput Bottleneck: Dependency chains (possibly between iterations)

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 1.1    0.0  | 2.0  | 0.0    0.0  | 0.0    0.0  | 0.0  | 2.0  | 2.0  | 0.0  |
---------------------------------------------------------------------------------------


| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    | 0.1       | 0.9 |           |           |     | 0.1 | 0.1 |     |    | mov rax, 0xaaaaaaaaaaaaaaab
|   2    |           | 1.0 |           |           |     |     | 1.0 |     | CP | mul rcx
|   1    | 1.0       |     |           |           |     |     |     |     | CP | shr rdx, 0x1
|   1    |           |     |           |           |     | 1.0 |     |     | CP | lea rax, ptr [rdx+rdx*2]
|   1    |           | 0.1 |           |           |     | 0.8 | 0.1 |     | CP | neg rax
|   1    | 0.1       |     |           |           |     | 0.1 | 0.9 |     | CP | add rcx, rax
|   1*   |           |     |           |           |     |     |     |     |    | mov rax, rcx

Total Num Of Uops: 8

So, as far as I can tell, NEG+ADD increases the code size, increases the number of µops, increases pressure for execution ports, and increases the number of cycles, thus resulting in a net decrease in throughput compared to SUB. So why is Intel's compiler doing this?

Is this just some quirk of the code generator that should be reported as a defect, or am I missing some merit in my analysis?

Blog

Wednesday, 4 April 2018