Saturday, 24 March 2018

Performance optimisations of x86-64 assembly - Alignment and branch prediction

I’m currently coding highly optimised versions of some C99 standard library string functions, like strlen(), memset(), etc, using x86-64 assembly with SSE-2 instructions.



So far I’ve managed to get excellent results in terms of performance, but I sometimes get weird behaviour when I try to optimise more.




For instance, adding or even removing some simple instructions, or simply reorganising some local labels used with jumps completely degrades the overall performances. And there’s absolutely no reason in terms of code.



So my guess is that there is some issues with code alignment, and/or with branches which get mispredicted.



I know that, even with the same architecture (x86-64), different CPUs have different algorithms for branch prediction.



But is there some general advices, when developing for high performances on x86-64, about code alignment and branch prediction?



In particular, about alignment, should I ensure all labels used by jump instructions are aligned on a DWORD?




_func:
; ... Some code ...
test rax, rax
jz .label
; ... Some code ...
ret
.label:
; ... Some code ...
ret



In the previous code, should I use an align directive before .label:, like:



align 4
.label:


If so, is it enough to align on a DWORD when using SSE-2?



And about branch prediction, is there a «preffered» way to organize the labels used by jump instructions, in order to help the CPU, or are today's CPUs smart enough to determine that at runtime by counting the number of times a branch is taken?




EDIT



Ok, here's a concrete example - here's the start of strlen() with SSE-2:



_strlen64_sse2:
mov rsi, rdi
and rdi, -16
pxor xmm0, xmm0
pcmpeqb xmm0, [ rdi ]

pmovmskb rdx, xmm0
; ...


Running it 10'000'000 times with a 1000 character string gives about 0.48 seconds, which is fine.
But it does not check for a NULL string input. So obviously, I'll add a simple check:



_strlen64_sse2:
test rdi, rdi
jz .null
; ...



Same test, it runs now in 0.59 seconds. But if I align the code after this check:



_strlen64_sse2:
test rdi, rdi
jz .null
align 8
; ...



The original performances are back. I used 8 for alignment, as 4 doesn't change anything.
Can anyone explain this, and give some advices about when to align, or not to align code sections?



EDIT 2



Of course, it's not as simple as aligning every branch target. If I do it, performances will usually get worse, unless some specific cases like above.

No comments:

Post a Comment

casting - Why wasn't Tobey Maguire in The Amazing Spider-Man? - Movies & TV

In the Spider-Man franchise, Tobey Maguire is an outstanding performer as a Spider-Man and also reprised his role in the sequels Spider-Man...