c++ - SSE optimisation for a loop that finds zeros in an array and toggles a flag + updates another array

Sunday, 18 February 2018

c++ - SSE optimisation for a loop that finds zeros in an array and toggles a flag + updates another array

What type are the elements of var[]? int? Or char? Are zeroes frequent?

A SIMD prefix sum aka partial is possible (with log2(vector_width) work per element, e.g. 2 shuffles and 2 adds for a vector of 4 float), but the conditional-store based on the result is the other major problem. (Your array of 1000 elements is probably too small for multi-threading to be profitable.)

An integer prefix-sum is easier to do efficiently, and the lower latency of integer ops helps. NOT is just adding without carry, i.e. XOR, so use _mm_xor_si128 instead of _mm_add_ps. (You'd be using this on the integer all-zero/all-one compare result vector from _mm_cmpeq_epi32 (or epi8 or whatever, depending on the element size of var[]. You didn't specify, but different choices of strategy are probably optimal for different sizes).

But, just having a SIMD prefix sum actually barely helps: you'd still have to loop through and figure out where to store and where to leave unmodified.

I think your best bet is to generate a list of indices where you need to store, and then

for (size_t j = 0 ; j < scatter_count ; j+=2) {
    var_num2[ scatter_element[j+0] ] = 0;

    var_num2[ scatter_element[j+1] ] = 1;
}

You could generate the whole list if indices up-front, or you could work in small batches to overlap the search work with the store work.

The prefix-sum part of the problem is handled by alternately storing 0 and 1 in an unrolled loop. The real trick is avoiding branch mispredicts, and generating the indices efficiently.

To generate scatter_element[], you've transformed the problem into left-packing (filtering) an (implicit) array of indices based on the corresponding _mm_cmpeq_epi32( var[i..i+3], _mm_setzero_epi32() ). To generate the indices you're filtering, start with a vector of [0,1,2,3] and add [4,4,4,4] to it (_mm_add_epi32). I'm assuming the element size of var[] is 32 bits. If you have smaller elements, this require unpacking.

BTW, AVX512 has scatter instructions which you could use here, otherwise doing the store part with scalar code is your best bet. (But beware of Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake when just storing without loading.)

To overlap the left-packing with the storing, I think you want to left-pack until you have maybe 64 indices in a buffer. Then leave that loop and run another loop that left-packs indices and consumes indices, only stopping if your circular buffer is full (then just store) or empty (then just left-pack). This lets you overlap the vector compare / lookup-table work with the scatter-store work, but without too much unpredictable branching.

If zeros are very frequent, and var_num2[] elements are 32 or 64 bits, and you have AVX or AVX2 available, you could consider doing an standard prefix sum and using AVX masked stores. e.g. vpmaskmovd. Don't use SSE maskmovdqu, though: it has an NT hint, so it bypasses and evicts data from cache, and is quite slow.

Also, because your prefix sum is mod 2, i.e. boolean, you could use a lookup table based on the packed-compare result mask. Instead of horizontal ops with shuffles, use the 4-bit movmskps result of a compare + a 5th bit for the initial state as an index to a lookup table of 32 vectors (assuming 32-bit element size for var[]).

Blog

Sunday, 18 February 2018