问题描述
将 reinterpret_cast
一个 float*
转换为一个 __m256*
并通过一个不同的访问 float
对象是否合法?指针类型?
constexpr size_t _m256_float_step_sz = sizeof(__m256)/sizeof(float);alignas(__m256) float stack_store[100 * _m256_float_step_sz]{};__m256&hwvec1 = *reinterpret_cast<__m256*>(&stack_store[0 * _m256_float_step_sz]);使用 arr_t = float[_m256_float_step_sz];arr_t&arr1 = *reinterpret_cast(&hwvec1);
hwvec1
和 arr1
是否依赖于 未定义行为
s?
它们是否违反了严格的别名规则?[basic.lval]/11
或者只有一种定义的内在方式:
__m256 hwvec2 = _mm256_load_ps(&stack_store[0 * _m256_float_step_sz]);_mm256_store_ps(&stack_store[1 * _m256_float_step_sz], hwvec2);
godbolt
ISO C++没有定义__m256
,所以我们需要看看做了什么定义了它们的行为关于支持它们的实现.
Intel 的内在函数定义了向量指针,比如 __m256*
允许为其他任何东西设置别名,就像 ISO C++ 定义 char*
为允许别名一样.
所以是的,取消引用 __m256*
而不是使用 _mm256_load_ps()
对齐加载内部函数是安全的.
但特别是对于 float/double,使用内在函数通常更容易,因为它们也负责从 float*
进行转换.对于整数,AVX512 加载/存储内在函数被定义为采用 void*
,但在此之前,您需要一个额外的 (__m256i*)
,这只是很混乱.
在 gcc 中,这是通过使用 may_alias
属性定义 __m256
来实现的:来自 gcc7.3 的 avxintrin.h
(头文件之一)
包括):
/* Intel API 足够灵活,我们必须允许与其他向量类型及其标量分量.*/typedef float __m256 __attribute__ ((__vector_size__ (32),__may_alias__));typedef long long __m256i __attribute__ ((__vector_size__ (32),__may_alias__));typedef double __m256d __attribute__ ((__vector_size__ (32),__may_alias__));/* 相同类型的未对齐版本.*/typedef float __m256_u __attribute__ ((__vector_size__ (32),__may_alias__,__对齐__ (1)));typedef long long __m256i_u __attribute__ ((__vector_size__ (32),__may_alias__,__对齐__ (1)));typedef double __m256d_u __attribute__ ((__vector_size__ (32),__may_alias__,__对齐__ (1)));
(如果您想知道,这就是为什么取消引用 __m256*
就像 _mm256_store_ps
,而不是 storeu
.)
没有 may_alias
的 GNU C 原生向量被允许为其标量类型设置别名,例如即使没有 may_alias
,您也可以安全地在 float*
和假设的 v8sf
类型之间进行转换.但是 may_alias
可以安全地从 int[]
、char[]
或其他数组中加载.
我谈论 GCC 如何实现 Intel 的内在函数只是因为那是我所熟悉的.我从 gcc 开发人员那里听说他们选择这种实现是因为它是与英特尔兼容的必要条件.
<小时>需要定义英特尔内在函数的其他行为
将英特尔的 API 用于 _mm_storeu_si128( (__m128i*)&arr[i], vec);
要求您创建可能未对齐的指针,如果您尊重它们,这些指针会出错.并且 _mm_storeu_ps
到非 4 字节对齐的位置需要创建一个未对齐的 float*
.
只要创建未对齐的指针或对象外的指针,在 ISO C++ 中就是 UB,即使您不取消引用它们.我想这允许在异国情调上实现在创建指针时对指针进行某种检查的硬件(可能而不是在取消引用时),或者可能无法存储指针的低位.(我不知道是否存在任何特定硬件,因为此 UB 可以提供更高效的代码.)
但是支持 Intel 内在函数的实现必须定义行为,至少对于 __m*
类型和 float*
/double*
.这对于面向任何普通现代 CPU 的编译器来说是微不足道的,包括具有扁平内存模型(无分段)的 x86;asm 中的指针只是与数据保存在相同寄存器中的整数.(m68k 具有地址和数据寄存器,但它永远不会因为在 A 寄存器中保留无效地址的位模式而出错,只要您不取消引用它们.)
另一种方式:向量的元素访问.
注意may_alias
,就像char*
别名规则一样,只有一种方式:它不是保证使用 int32_t*
读取 __m256
是安全的.使用 float*
读取 __m256
甚至可能不安全.就像这样做是不安全的char buf[1024];
int *p = (int*)buf;
.
通过 char*
读/写可以给任何东西取别名,但是当你有一个 char
object 时,严格别名确实使它成为 UB通过其他类型阅读它.(我不确定 x86 上的主要实现是否确实定义了该行为,但您不需要依赖它,因为它们将 4 个字节的 memcpy
优化为 int32_t
.您可以并且应该使用 memcpy
来表示来自 char[]
缓冲区的未对齐加载,因为允许使用更宽类型的自动矢量化假设 2 字节int16_t*
对齐,如果不是,则制作失败的代码:为什么在 AMD64 上对 mmap'ed 内存的未对齐访问有时会出现段错误?)
要插入/提取向量元素,请使用 shuffle 内在函数,SSE2 _mm_insert_epi16
/_mm_extract_epi16
或 SSE4.1 insert/_mm_extract_epi8/32/64
.对于浮点数,没有您应该与标量 float
一起使用的插入/提取内在函数.
或者存储到一个数组并读取该数组.(打印 __m128i 变量).这实际上优化了向量提取指令.
GNU C 向量语法为向量提供了 []
运算符,例如 __m256 v = ...;
v[3] = 1.25;
.MSVC 将向量类型定义为具有 .m128_f32[]
成员的联合,用于按元素访问.
有像 Agner Fog 的(GPL 许可)矢量类库这样的包装库,它们提供了可移植的operator[]
重载它们的向量类型,以及 operator +
/-
/*
/<<
等.这非常好,特别是对于整数类型,其中不同元素宽度的不同类型使 v1 + v2
以正确的大小工作.(GNU C 本机向量语法对浮点/双向量执行此操作,并将 __m128i
定义为带符号 int64_t 的向量,但 MSVC 不提供基本 __m128
类型的运算符.)
您还可以在向量和某种类型的数组之间使用联合类型双关,这在 ISO C99 和 GNU C++ 中是安全的,但在 ISO C++ 中不安全.我认为它在 MSVC 中也是正式安全的,因为我认为他们将 __m128
定义为正常联合的方式.
不过,不能保证您会从任何这些元素访问方法中获得高效的代码.不要使用内部内部循环,如果性能很重要,请查看生成的 asm.
Is it legal to reinterpret_cast
a float*
to a __m256*
and access float
objects through a different pointer type?
constexpr size_t _m256_float_step_sz = sizeof(__m256) / sizeof(float);
alignas(__m256) float stack_store[100 * _m256_float_step_sz ]{};
__m256& hwvec1 = *reinterpret_cast<__m256*>(&stack_store[0 * _m256_float_step_sz]);
using arr_t = float[_m256_float_step_sz];
arr_t& arr1 = *reinterpret_cast<float(*)[_m256_float_step_sz]>(&hwvec1);
Do hwvec1
and arr1
depend on undefined behavior
s?
Do they violate strict aliasing rules? [basic.lval]/11
Or there is only one defined way of intrinsic:
__m256 hwvec2 = _mm256_load_ps(&stack_store[0 * _m256_float_step_sz]);
_mm256_store_ps(&stack_store[1 * _m256_float_step_sz], hwvec2);
godbolt
ISO C++ doesn't define __m256
, so we need to look at what does define their behaviour on the implementations that support them.
Intel's intrinsics define vector-pointers like __m256*
as being allowed to alias anything else, the same way ISO C++ defines char*
as being allowed to alias.
So yes, it's safe to dereference a __m256*
instead of using a _mm256_load_ps()
aligned-load intrinsic.
But especially for float/double, it's often easier to use the intrinsics because they take care of casting from float*
, too. For integers, the AVX512 load/store intrinsics are defined as taking void*
, but before that you need an extra (__m256i*)
which is just a lot of clutter.
In gcc, this is implemented by defining __m256
with a may_alias
attribute: from gcc7.3's avxintrin.h
(one of the headers that <immintrin.h>
includes):
/* The Intel API is flexible enough that we must allow aliasing with other vector types, and their scalar components. */ typedef float __m256 __attribute__ ((__vector_size__ (32), __may_alias__)); typedef long long __m256i __attribute__ ((__vector_size__ (32), __may_alias__)); typedef double __m256d __attribute__ ((__vector_size__ (32), __may_alias__)); /* Unaligned version of the same types. */ typedef float __m256_u __attribute__ ((__vector_size__ (32), __may_alias__, __aligned__ (1))); typedef long long __m256i_u __attribute__ ((__vector_size__ (32), __may_alias__, __aligned__ (1))); typedef double __m256d_u __attribute__ ((__vector_size__ (32), __may_alias__, __aligned__ (1)));
(In case you were wondering, this is why dereferencing a __m256*
is like _mm256_store_ps
, not storeu
.)
GNU C native vectors without may_alias
are allowed to alias their scalar type, e.g. even without the may_alias
, you could safely cast between float*
and a hypothetical v8sf
type. But may_alias
makes it safe to load from an array of int[]
, char[]
, or whatever.
I'm talking about how GCC implements Intel's intrinsics only because that's what I'm familiar with. I've heard from gcc developers that they chose that implementation because it was required for compatibility with Intel.
Other behaviour Intel's intrinsics require to be defined
Using Intel's API for _mm_storeu_si128( (__m128i*)&arr[i], vec);
requires you to create potentially-unaligned pointers which would fault if you deferenced them. And _mm_storeu_ps
to a location that isn't 4-byte aligned requires creating an under-aligned float*
.
Just creating unaligned pointers, or pointers outside an object, is UB in ISO C++, even if you don't dereference them. I guess this allows implementations on exotic hardware which do some kinds of checks on pointers when creating them (possibly instead of when dereferencing), or maybe which can't store the low bits of pointers. (I have no idea if any specific hardware exists where more efficient code is possible because of this UB.)
But implementations which support Intel's intrinsics must define the behaviour, at least for the __m*
types and float*
/double*
. This is trivial for compilers targeting any normal modern CPU, including x86 with a flat memory model (no segmentation); pointers in asm are just integers kept in the same registers as data. (m68k has address vs. data registers, but it never faults from keeping bit-patterns that aren't valid addresses in A registers, as long as you don't deref them.)
Going the other way: element access of a vector.
Note that may_alias
, like the char*
aliasing rule, only goes one way: it is not guaranteed to be safe to use int32_t*
to read a __m256
. It might not even be safe to use float*
to read a __m256
. Just like it's not safe to do char buf[1024];
int *p = (int*)buf;
.
Reading/writing through a char*
can alias anything, but when you have a char
object, strict-aliasing does make it UB to read it through other types. (I'm not sure if the major implementations on x86 do define that behaviour, but you don't need to rely on it because they optimize away memcpy
of 4 bytes into an int32_t
. You can and should use memcpy
to express an unaligned load from a char[]
buffer, because auto-vectorization with a wider type is allowed to assume 2-byte alignment for int16_t*
, and make code that fails if it's not: Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?)
To insert/extract vector elements, use shuffle intrinsics, SSE2 _mm_insert_epi16
/ _mm_extract_epi16
or SSE4.1 insert / _mm_extract_epi8/32/64
. For float, there are no insert/extract intrinsics that you should use with scalar float
.
Or store to an array and read the array. (print a __m128i variable). This does actually optimize away to vector extract instructions.
GNU C vector syntax provides the []
operator for vectors, like __m256 v = ...;
v[3] = 1.25;
. MSVC defines vector types as a union with a .m128_f32[]
member for per-element access.
There are wrapper libraries like Agner Fog's (GPL licensed) Vector Class Library which provide portable operator[]
overloads for their vector types, and operator +
/ -
/ *
/ <<
and so on. It's quite nice, especially for integer types where having different types for different element widths make v1 + v2
work with the right size. (GNU C native vector syntax does that for float/double vectors, and defines __m128i
as a vector of signed int64_t, but MSVC doesn't provide operators on the base __m128
types.)
You can also use union type-punning between a vector and an array of some type, which is safe in ISO C99, and in GNU C++, but not in ISO C++. I think it's officially safe in MSVC, too, because I think the way they define __m128
as a normal union.
There's no guarantee you'll get efficient code from any of these element-access methods, though. Do not use inside inner loops, and have a look at the resulting asm if performance matters.
这篇关于硬件 SIMD 向量指针和相应类型之间的“reinterpret_cast"是否是未定义的行为?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!