本文介绍了如何向量化data_i16 [0到15]?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在英特尔内部站点上,但我不知道我想要什么指令的组合.我想做的是

I'm on the Intel Intrinsic site and I can't figure out what combination of instructions I want. What I'd like to do is

result = high_table[i8>>4] & low_table[i8&15]

两个表均为16位(或更多)的地方.随机播放看起来像我想要的(_mm_shuffle_epi8),但是获取8bit的值对我来说不起作用.似乎没有16位版本,非字节版本似乎需要第二个参数作为立即值.

Where both table are 16bits (or more). shuffle seems like what I want (_mm_shuffle_epi8) however getting a 8bit value doesn't work for me. There doesn't seem to be a 16bit version and the non byte version seems to need the second param as an immediate value.

我应该如何实现呢?我是否为每个表两次调用_mm_shuffle_epi8,将其强制转换为16位,然后将值移位8?如果是这样,我要看哪一个强制转换指令?

How am I suppose to implement this? Do I call _mm_shuffle_epi8 twice for each table, cast it to 16bits and shift the value by 8? If so which cast and shift instruction do I want to look at?

推荐答案

要将传入的索引分成两个半字节向量,您需要通常的移位和AND. SSE没有8位移位,因此您必须模拟更大的移位和AND,以掩盖移入字节顶部的位. (不幸的是,对于这种用例,_mm_shuffle_epi8不会忽略高位.如果设置了最高选择器位,则该输出元素将为零.)

To split your incoming indices into two vectors of nibbles, you want the usual bit-shift and AND. SSE doesn't have 8-bit shifts, so you have to emulate with a wider shift and an AND to mask away bits that shifted into the top of your bytes. (Because unfortunately for this use-case _mm_shuffle_epi8 does not ignore the high bits. If the top selector bit is set it zeros that output element.)

您肯定不是想将传入的i8向量扩展为16位元素. _mm_shuffle_epi8无法使用.

You definitely do not want to widen your incoming i8 vector to 16-bit elements; that would not be usable with _mm_shuffle_epi8.

AVX2具有vpermd:从8个32位元素的向量中选择dword. (只有3位索引,因此除非您的半字节仅为0..7,否则这对您的用例不利). AVX512BW的混洗范围更广,包括vpermi2w可以索引到两个向量的串联表中,或者只是vpermw可以索引词.

AVX2 has vpermd : select dwords from a vector of 8x 32-bit elements. (only 3-bit indices so it's not good for your use-case unless your nibbles are only 0..7). AVX512BW has wider shuffles, including vpermi2w to index into a table of the concatenation of two vectors, or just vpermw to index words.

但是对于只有SSSE3的128位向量,是的pshufb(_mm_shuffle_epi8)是可行的方法.对于high_table,您将需要两个单独的向量,一个用于每个单词条目的高字节,一个用于低字节.还有另外两个向量,分别用于low_table的两半.

But for 128-bit vectors with just SSSE3, yeah pshufb (_mm_shuffle_epi8) is the way to go. You'll need two separate vectors for high_table, one for the upper byte and one for the lower byte of each word entry. And another two vectors for the halves of low_table.

使用_mm_unpacklo_epi8_mm_unpackhi_epi8交织两个向量的低8个字节,或两个向量的高8个字节.这将为您提供所需的16位LUT结果,每个单词的上半部分来自上半部向量.

Use _mm_unpacklo_epi8 and _mm_unpackhi_epi8 to interleave the low 8 bytes of two vectors, or the high 8 bytes of two vectors. That will give you the 16-bit LUT results you want, with the upper half in each word coming from the high-half vector.

即您将通过这种交错从两个8位LUT中构建一个16位LUT.并且您要针对两个不同的LUT重复该过程两次.

i.e. you're building a 16-bit LUT out of two 8-bit LUTs with this interleave. And you're repeating the process twice for two different LUTs.

代码看起来像

// UNTESTED, haven't tried even compiling this.

// produces 2 output vectors, you might want to just put this in a loop instead of making a helper function for 1 vector.
// so I'll omit actually returning them.
void foo(__m128i indices)
{
   // these optimize away, only used at compile time for the vector initializers
   static const uint16_t high_table[16] = {...},
   static const uint16_t low_table[16] =  {...};

   // each LUT needs a separate vector of high-byte and low-byte parts
   // don't use SIMD intrinsics to load from the uint16_t tables and deinterleave at runtime, just get the same 16x 2 x 2 bytes of data into vector constants at compile time.
   __m128i high_LUT_lobyte = _mm_setr_epi8(high_table[0]&0xff, high_table[1]&0xff, high_table[2]&0xff, ... );
   __m128i high_LUT_hibyte = _mm_setr_epi8(high_table[0]>>8, high_table[1]>>8, high_table[2]>>8, ... );

   __m128i low_LUT_lobyte = _mm_setr_epi8(low_table[0]&0xff, low_table[1]&0xff, low_table[2]&0xff, ... );
   __m128i low_LUT_hibyte = _mm_setr_epi8(low_table[0]>>8, low_table[1]>>8, low_table[2]>>8, ... );


// split the input indexes: emulate byte shift with wider shift + AND
    __m128i lo_idx = _mm_and_si128(indices, _mm_set1_epi8(0x0f));
    __m128i hi_idx = _mm_and_si128(_mm_srli_epi32(indices, 4), _mm_set1_epi8(0x0f));

    __m128i lolo = _mm_shuffle_epi8(low_LUT_lobyte, lo_idx);
    __m128i lohi = _mm_shuffle_epi8(low_LUT_hibyte, lo_idx);

    __m128i hilo = _mm_shuffle_epi8(high_LUT_lobyte, hi_idx);
    __m128i hihi = _mm_shuffle_epi8(high_LUT_hibyte, hi_idx);

   // interleave results of LUT lookups into vectors 16-bit elements
    __m128i low_result_first  = _mm_unpacklo_epi8(lolo, lohi);
    __m128i low_result_second = _mm_unpackhi_epi8(lolo, lohi);
    __m128i high_result_first  = _mm_unpacklo_epi8(hilo, hihi);
    __m128i high_result_second = _mm_unpackhi_epi8(hilo, hihi);

    // first 8x 16-bit high_table[i8>>4] & low_table[i8&15] results
    __m128i and_first = _mm_and_si128(low_result_first, high_result_first);
    // second 8x 16-bit high_table[i8>>4] & low_table[i8&15] results
    __m128i and_second = _mm_and_si128(low_result_second, high_result_second);

    // TOOD: do something with the results.
}

在交织之前,您可以AND相乘,上半部分与下半部分相对,而下半部分与下半部分相对.对于指令级并行性,这可能会更好一些,让AND的执行与改组重叠. (英特尔Haswell通过Skylake的洗牌只有1个时钟的吞吐量.)

You could AND before interleaving, high halves against high halves and low against low. That might be somewhat better for instruction-level parallelism, letting execution of the ANDs overlap with the shuffles. (Intel Haswell through Skylake has only 1/clock throughput for shuffles.)

选择变量名与诸如此类的事情很不容易.有些人只是放弃,并在一些中间步骤中使用了无意义的名称.

Choosing variable names is a struggle with stuff like this. Some people just give up and use non-meaningful names for some intermediate steps.

这篇关于如何向量化data_i16 [0到15]?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-17 01:10