测试8个后续字节不会转换为单个比较指令

本文介绍了测试8个后续字节不会转换为单个比较指令的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述出于这个问题的动机，我比较了三种不同的函数来检查参数所指向的8个字节是否为零（请注意，在原始问题中，将字符与'0'而不是 0 ）进行比较：Motivated by this question, I compared three different functions for checking if 8 bytes pointed to by the argument are zeros (note that in the original question, characters are compared with '0', not 0):bool f1(const char *ptr){ for (int i = 0; i < 8; i++) if (ptr[i]) return false; return true;}bool f2(const char *ptr){ bool res = true; for (int i = 0; i < 8; i++) res &= (ptr[i] == 0); return res;}bool f3(const char *ptr){ static const char tmp[8]{}; return !std::memcmp(ptr, tmp, 8);}尽管我期望启用优化后的程序集结果相同，但只有 memcmp 版本已在x64上转换为单个 cmp 指令。 f1 和 f2 都被转换为缠绕或未缠绕的循环。此外，这适用于所有使用 -O3 的GCC，Clang和Intel编译器。Though I would expect the same assembly outcome with enabled optimizations, only the memcmp version was translated into a single cmp instruction on x64. Both f1 and f2 were translated into either a winded or unwinded loop. Moreover, this holds for all GCC, Clang, and Intel compilers with -O3. f1 和 f2 无法优化为单个比较指令吗？对我来说，这似乎是一个非常简单的优化。Is there any reason why f1 and f2 cannot be optimized into a single compare instruction? It seem to be a pretty straightforward optimization to me.实时演示： https ：//godbolt.org/z/j48366 推荐答案首先， f1 在第一个非零字节处停止读取，因此在某些情况下，如果将指针传递给页面末尾附近的较短对象，则不会出错，并且下一页未映射。如指出，在 f1 没有遇到UB 的情况下，无条件读取8个字节可能会出错。（在x86和x64的同一页面中读取缓冲区的末尾是否安全？）。编译器不知道您永远不会以这种方式使用它； First of all, f1 stops reading at the first non-zero byte, so there are cases where it won't fault if you pass it a pointer to a shorter object near the end of a page, and the next page is unmapped. Unconditionally reading 8 bytes can fault in cases where f1 doesn't encounter UB, as @bruno points out. (Is it safe to read past the end of a buffer within the same page on x86 and x64?). The compiler doesn't know that you're never going to use it this way; it has to make code that works for every possible non-UB case for any hypothetical caller.您可以通过使函数arg const char ptr [static]来解决此问题。 8] （但这是C99的功能，而不是C ++），以确保即使C抽象机不能，也可以安全地触摸所有8个字节。然后，编译器可以安全地创建读取。（指向 struct {char buf [8]}; 的指针也可以工作，但是如果实际指向的对象不是严格的，则严格严格地命名是安全的You can fix that by making the function arg const char ptr[static 8] (but that's a C99 feature, not C++) to guarantee that it's safe to touch all 8 bytes even if the C abstract machine wouldn't. Then the compiler can safely invent reads. (A pointer to a struct {char buf[8]}; would also work, but wouldn't be strict-aliasing safe if the actual pointed-to object wasn't that.) GCC和clang无法自动向量化其跳闸次数在第一次迭代之前未知的循环。这样就可以排除所有搜索循环，例如 f1 ，即使它检查了已知大小的静态数组或其他内容也是如此。（不过，ICC可以对某些搜索循环进行矢量化处理，例如朴素的strlen实现。）GCC and clang can't auto-vectorize loops whose trip-count isn't known before the first iteration. So that rules out all search loops like f1, even if made it check a static array of known size or something. (ICC can vectorize some search loops like a naive strlen implementation, though.)您的 f2 可以与 f3 转换为qword cmp ，而不必克服编译器内部的主要限制，因为它总是执行8次迭代。实际上，当前每晚的clang生成可以优化 f2 ，感谢@Tharwen指出了这一点。Your f2 could have been optimized the same as f3, to a qword cmp, without overcoming that major compiler-internals limitations because it always does 8 iterations. In fact, current nightly builds of clang do optimize f2, thanks @Tharwen for spotting that. 识别循环模式并不是那么简单，并且需要花费编译时间来寻找。 > IDK这种优化在实践中将具有多大的价值；这是编译器开发人员在考虑编写更多代码以查找此类模式时需要权衡的。（代码的维护成本和编译时成本。）Recognizing loop patterns is not that simple, and takes compile time to look for. IDK how valuable this optimization would be in practice; that's what compiler devs need trade off against when considering writing more code to look for such patterns. (Maintenance cost of code, and compile-time cost.)该值取决于多少真实世界代码实际具有这样的模式以及找到它可以节省很多。在这种情况下，这是一个很好的节省方式，因此clang并不奇怪，特别是如果它们具有将8字节的循环转换为8字节整数操作的基础结构。The value depends on how much real world code actually has patterns like this, as well as how big a saving it is when you find it. In this case it's a very nice saving, so it's not crazy for clang to look for it, especially if they have the infrastructure to turn a loop over 8 bytes into an 8-byte integer operation in general.实际上，如果您要使用 memcmp ；显然，大多数编译器不会花时间寻找 f2 之类的模式。现代编译器确实可靠地内联了它，尤其是对于x86-64，其中已知未对齐的负载在asm中是安全高效的。In practice, just use memcmp if that's what you want; apparently most compilers don't spend time looking for patterns like f2. Modern compilers do reliably inline it, especially for x86-64 where unaligned loads are known to be safe and efficient in asm.或使用 memcpy 进行别名安全的不对齐加载并进行比较，如果您认为编译器比memcmp更可能具有内置的memcpy。Or use memcpy to do an aliasing-safe unaligned load and compare that, if you think your compiler is more likely to have a builtin memcpy than memcmp.或者在GNU C ++中，使用typedef表示未对齐的may-alias负载：Or in GNU C++, use a typedef to express unaligned may-alias loads:bool f4(const char *ptr) { typedef uint64_t aliasing_unaligned_u64 __attribute__((aligned(1), may_alias)); auto val = *(const aliasing_unaligned_u64*)ptr; return val != 0;}在 Godbolt与GCC10 -O3 ：f4(char const*): cmp QWORD PTR [rdi], 0 setne al ret投射到 uint 64_t * 可能会违反 alignof（uint64_t），并且可能会违反严格混叠规则，除非 char * 与 uint64_t 兼容。Casting to uint64_t* would potentially violate alignof(uint64_t), and probably violate the strict-aliasing rule unless the actual object pointed to by the char* was compatible with uint64_t.是的，对齐可以在x86-64上很重要，因为ABI允许编译器基于它进行假设。错误的 movaps 或其他问题可能在实际情况下与实际的编译器一起发生。And yes, alignment does matter on x86-64 because the ABI allows compilers to make assumptions based on it. A faulting movaps or other problems can happen with real compilers in corner cases. https：// trust-in- soft.com/blog/2020/04/06/gcc-always-assumes-aligned-pointers/ 为什么无法对齐访问mmap的内存有时会在AMD64？硬件SIMD向量指针与相应类型之间的 reinterpret_cast是否为未定义行为？是使用 may_alias （在这种情况下，不使用 aligned（1），因为隐含长度的字符串可以在任何点结束，s o您需要进行对齐的加载，以确保包含至少1个有效字符串字节的块不会跨越页面边界。）另外，是 Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? is another example of using may_alias (without aligned(1) in that case because implicit-length strings could end at any point, so you need to do aligned loads to make sure that your chunk that contains at least 1 valid string byte doesn't cross a page boundary.) Also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? 这篇关于测试8个后续字节不会转换为单个比较指令的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！