Can GCC emit different instruction mnemonics when choosing between multiple alternative operand constraints of inline assembly?

Friday, 1 September 2017

Can GCC emit different instruction mnemonics when choosing between multiple alternative operand constraints of inline assembly?

I am trying to write inline x86-64 assembly for GCC to efficiently use the MULQ instruction. MULQ multiplies the 64-bit register RAX with another 64-bit value. The other value can be any 64-bit register (even RAX) or a value in memory. MULQ puts the high 64 bits of the product into RDX and the low 64 bits into RAX.

Now, it's easy enough to express a correct mulq as inline assembly:

#include 
static inline void mulq(uint64_t *high, uint64_t *low, uint64_t x, uint64_t y)
{
    asm ("mulq %[y]" 
          : "=d" (*high), "=a" (*low)
          : "a" (x), [y] "rm" (y)    
        );
}

This code is correct, but not optimal. MULQ is commutative, so if y happened to be in RAX already, then it would be correct to leave y where it is and do the multiply. But GCC doesn't know that, so it will emit extra instructions to move the operands into their pre-defined places. I want to tell GCC that it can put either input in either location, as long as one ends up in RAX and the MULQ references the other location. GCC has a syntax for this, called "multiple alternative constraints". Notice the commas (but the overall asm() is broken; see below):

asm ("mulq %[y]" 
      : "=d,d" (*high), "=a,a" (*low)
      : "a,rm" (x), [y] "rm,a" (y)    
    );

Unfortunately, this is wrong. If GCC chooses the second alternative constraint, it will emit "mulq %rax". To be clear, consider this function:

uint64_t f()
{
    uint64_t high, low;
    uint64_t rax;
    asm("or %0,%0": "=a" (rax));
    mulq(&high, &low, 7, rax);
    return high;
}

Compiled with gcc -O3 -c -fkeep-inline-functions mulq.c, GCC emits this assembly:

0000000000000010 :
  10:   or     %rax,%rax
  13:   mov    $0x7,%edx
  18:   mul    %rax
  1b:   mov    %rdx,%rax
  1e:   retq

The "mul %rax" should be "mul %rdx".

How can this inline asm be rewritten so it generates the correct output in every case?

Blog

Friday, 1 September 2017