In examining the output of various compilers for a variety of code snippets, I've noticed that Intel's C compiler (ICC) has a strong tendency to prefer emitting a pair of NEG
+ADD
instructions where other compilers would use a single SUB
instruction.
As a simple example, consider the following C code:
uint64_t Mod3(uint64_t value)
{
return (value % 3);
}
ICC translates this to the following machine code (regardless of optimization level):
mov rcx, 0xaaaaaaaaaaaaaaab
mov rax, rdi
mul rcx
shr rdx, 1
lea rsi, QWORD PTR [rdx+rdx*2]
neg rsi ; \ equivalent to:
add rdi, rsi ; / sub rdi, rsi
mov rax, rdi
ret
Whereas other compilers (including MSVC, GCC, and Clang) will all generate essentially equivalent code, except that the NEG
+ADD
sequence is replaced by a single SUB
instruction.
Like I said, this isn't just a quirk of how ICC compiles this particular snippet. It's a pattern I've observed repeatedly when analyzing the disassembly for arithmetic operations. I normally wouldn't think much of this, except that ICC is known to be a pretty good optimizing compiler and it is developed by folks that have insider information about their microprocessors.
Could there be something that Intel knows about the implementation of the SUB
instruction on their processors that makes it more optimal to decompose it into NEG
+ADD
instructions? Using RISC-style instructions that decode into simpler µops is well-known optimization advice for modern microarchitectures, so is it possible that SUB
is broken down internally into individual NEG
and ADD
µops, and that it is actually more efficient for the front-end decoder to use these "simpler" instructions? Modern CPUs are complicated, so anything is possible.
Agner Fog's comprehensive instruction tables confirm my intuition, though, that this is actually a pessimization. SUB
is equally as efficient as ADD
on all processors, so the additional required NEG
instruction just serves to slow things down.
I also ran the two sequences through Intel's own Architecture Code Analyzer to analyze the throughput. Though the exact cycle counts and port bindings vary from one microarchitecture to another, a single SUB
appears to be superior in every respect from Nehalem to Broadwell. Here are the two reports generated by the tool for Haswell:
SUB
Intel(R) Architecture Code Analyzer Version - 2.2 build:356c3b8 (Tue, 13 Dec 2016 16:25:20 +0200)
Binary Format - 64Bit
Architecture - HSW
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 1.85 Cycles Throughput Bottleneck: Dependency chains (possibly between iterations)
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 1.0 0.0 | 1.5 | 0.0 0.0 | 0.0 0.0 | 0.0 | 1.8 | 1.7 | 0.0 |
---------------------------------------------------------------------------------------
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | 0.1 | 0.2 | | | | 0.3 | 0.4 | | CP | mov rax, 0xaaaaaaaaaaaaaaab
| 2 | | 1.0 | | | | | 1.0 | | CP | mul rcx
| 1 | 0.9 | | | | | | 0.1 | | CP | shr rdx, 0x1
| 1 | | | | | | 1.0 | | | CP | lea rax, ptr [rdx+rdx*2]
| 1 | | 0.3 | | | | 0.4 | 0.2 | | CP | sub rcx, rax
| 1* | | | | | | | | | | mov rax, rcx
Total Num Of Uops: 7
NEG+ADD
Intel(R) Architecture Code Analyzer Version - 2.2 build:356c3b8 (Tue, 13 Dec 2016 16:25:20 +0200)
Binary Format - 64Bit
Architecture - HSW
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 2.15 Cycles Throughput Bottleneck: Dependency chains (possibly between iterations)
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 1.1 0.0 | 2.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 2.0 | 2.0 | 0.0 |
---------------------------------------------------------------------------------------
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | 0.1 | 0.9 | | | | 0.1 | 0.1 | | | mov rax, 0xaaaaaaaaaaaaaaab
| 2 | | 1.0 | | | | | 1.0 | | CP | mul rcx
| 1 | 1.0 | | | | | | | | CP | shr rdx, 0x1
| 1 | | | | | | 1.0 | | | CP | lea rax, ptr [rdx+rdx*2]
| 1 | | 0.1 | | | | 0.8 | 0.1 | | CP | neg rax
| 1 | 0.1 | | | | | 0.1 | 0.9 | | CP | add rcx, rax
| 1* | | | | | | | | | | mov rax, rcx
Total Num Of Uops: 8
So, as far as I can tell, NEG
+ADD
increases the code size, increases the number of µops, increases pressure for execution ports, and increases the number of cycles, thus resulting in a net decrease in throughput compared to SUB
. So why is Intel's compiler doing this?
Is this just some quirk of the code generator that should be reported as a defect, or am I missing some merit in my analysis?
No comments:
Post a Comment