c# simd 指令
Here’s a quick look at algorithm vectorization capabilities in .NET Framework and .NET Core. This article is for those who know nothing about these techniques. I will also show that .NET doesn’t actually lag behind "real compiled" languages for native development.
快速浏览一下.NET Framework和.NET Core中的算法矢量化功能。 本文适用于对这些技术一无所知的人。 我还将说明,.NET实际上在本机开发方面并不落后于“真正的编译”语言。
I’m just starting to learn vectorization techniques. So, I will appreciate if community members find clear errors or suggest improvements to the described algorithms.
我才刚刚开始学习矢量化技术。 因此,如果社区成员发现明显的错误或对所描述的算法提出改进建议,我将不胜感激。
SIMD appeared in .NET Framework 4.6 in 2015. That's when Matrix3x2, Matrix4x4, Plane, Quaternion, Vector2, Vector3 and Vector4 types were added. They allowed vectorized computations. Next was the Vector
SIMD出现在2015年的.NET Framework 4.6中。那时,添加了Matrix3x2,Matrix4x4,Plane,Quaternion,Vector2,Vector3和Vector4类型。 他们允许向量化计算。 接下来是Vector
I decided to start with a classic task which usually comes first if there’s vectorization involved. It deals with finding the sum of array elements. Let’s write four implementations of this task to sum the elements of Array.
我决定从经典任务开始,如果涉及矢量化,通常会首先执行。 它涉及查找数组元素的总和。 让我们编写此任务的四个实现以汇总Array的元素。
The most obvious implementation:
最明显的实现:
public int Naive() {
int result = 0;
foreach (int i in Array) {
result += i;
}
return result;
}
LINQ-based implementation:
基于LINQ的实现:
public long LINQ() => Array.Aggregate(0, (current, i) => current + i);
The implementation based on vectors from System.Numerics:
基于System.Numerics中的向量的实现:
public int Vectors() {
int vectorSize = Vector.Count;
var accVector = Vector.Zero;
int i;
var array = Array;
for (i = 0; i <= array.Length - vectorSize; i += vectorSize) {
var v = new Vector(array, i);
accVector = Vector.Add(accVector, v);
}
int result = Vector.Dot(accVector, Vector.One);
for (; i < array.Length; i++) {
result += array[i];
}
return result;
}
The implementation based on code from System.Runtime.Intrinsics namespace:
基于System.Runtime.Intrinsics命名空间中的代码的实现:
public unsafe int Intrinsics() {
int vectorSize = 256 / 8 / 4;
var accVector = Vector256.Zero;
int i;
var array = Array;
fixed (int* ptr = array) {
for (i = 0; i <= array.Length - vectorSize; i += vectorSize) {
var v = Avx2.LoadVector256(ptr + i);
accVector = Avx2.Add(accVector, v);
}
}
int result = 0;
var temp = stackalloc int[vectorSize];
Avx2.Store(temp, accVector);
for (int j = 0; j < vectorSize; j++) {
result += temp[j];
}
for (; i < array.Length; i++) {
result += array[i];
}
return result;
}
I benchmarked those 4 methods on my computer and got the following results:
我在计算机上对这4种方法进行了基准测试,并得到以下结果:
Method | ItemsCount | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|
Naive | 10 | 3.531 ns | 0.0336 ns | 0.0314 ns | 1.00 |
LINQ | 10 | 76.925 ns | 0.4166 ns | 0.3897 ns | 21.79 |
Vectors | 10 | 2.750 ns | 0.0210 ns | 0.0196 ns | 0.78 |
Intrinsics | 10 | 6.513 ns | 0.0623 ns | 0.0582 ns | 1.84 |
Naive | 100 | 47.982 ns | 0.3975 ns | 0.3524 ns | 1.00 |
LINQ | 100 | 590.414 ns | 3.8808 ns | 3.4402 ns | 12.31 |
Vectors | 100 | 10.122 ns | 0.0747 ns | 0.0699 ns | 0.21 |
Intrinsics | 100 | 14.277 ns | 0.0566 ns | 0.0529 ns | 0.30 |
Naive | 1000 | 569.910 ns | 2.8297 ns | 2.6469 ns | 1.00 |
LINQ | 1000 | 5,658.570 ns | 31.7465 ns | 29.6957 ns | 9.93 |
Vectors | 1000 | 79.598 ns | 0.3498 ns | 0.3272 ns | 0.14 |
Intrinsics | 1000 | 66.970 ns | 0.3937 ns | 0.3682 ns | 0.12 |
Naive | 10000 | 5,637.571 ns | 37.5050 ns | 29.2814 ns | 1.00 |
LINQ | 10000 | 56,498.987 ns | 294.8776 ns | 275.8287 ns | 10.02 |
Vectors | 10000 | 772.900 ns | 2.6802 ns | 2.5070 ns | 0.14 |
Intrinsics | 10000 | 579.152 ns | 2.8371 ns | 2.6538 ns | 0.10 |
Naive | 100000 | 56,352.865 ns | 230.7916 ns | 215.8826 ns | 1.00 |
LINQ | 100000 | 562,610.571 ns | 3,775.7631 ns | 3,152.9332 ns | 9.98 |
Vectors | 100000 | 8,389.647 ns | 165.9590 ns | 227.1666 ns | 0.15 |
Intrinsics | 100000 | 7,261.334 ns | 89.6468 ns | 69.9903 ns | 0.13 |
方法 | ItemsCount | 意思 | 错误 | 标准差 | 比 |
---|---|---|---|---|---|
幼稚 | 10 | 3.531 ns | 0.0336纳秒 | 0.0314纳秒 | 1.00 |
LINQ | 10 | 76.925 ns | 0.4166纳秒 | 0.3897 ns | 21.79 |
向量 | 10 | 2.750纳秒 | 0.0210纳秒 | 0.0196 ns | 0.78 |
本征 | 10 | 6.513 ns | 0.0623 ns | 0.0582 ns | 1.84 |
幼稚 | 100 | 47.982 ns | 0.3975纳秒 | 0.3524 ns | 1.00 |
LINQ | 100 | 590.414 ns | 3.8808 ns | 3.4402纳秒 | 12.31 |
向量 | 100 | 10.122 ns | 0.0747 ns | 0.0699 ns | 0.21 |
本征 | 100 | 14.277 ns | 0.0566 ns | 0.0529 ns | 0.30 |
幼稚 | 1000 | 569.910 ns | 2.8297 ns | 2.6469 ns | 1.00 |
LINQ | 1000 | 5,658.570 ns | 31.7465 ns | 29.6957 ns | 9.93 |
向量 | 1000 | 79.598 ns | 0.3498 ns | 0.3272纳秒 | 0.14 |
本征 | 1000 | 66.970 ns | 0.3937纳秒 | 0.3682纳秒 | 0.12 |
幼稚 | 10000 | 5,637.571 ns | 37.5050 ns | 29.2814 ns | 1.00 |
LINQ | 10000 | 56,498.987 ns | 294.8776 ns | 275.8287 ns | 10.02 |
向量 | 10000 | 772.900 ns | 2.6802纳秒 | 2.5070 ns | 0.14 |
本征 | 10000 | 579.152 ns | 2.8371 ns | 2.6538 ns | 0.10 |
幼稚 | 100000 | 56,352.865 ns | 230.7916 ns | 215.8826 ns | 1.00 |
LINQ | 100000 | 562,610.571 ns | 3,775.7631 ns | 3,152.9332 ns | 9.98 |
向量 | 100000 | 8,389.647 ns | 165.9590 ns | 227.1666 ns | 0.15 |
本征 | 100000 | 7,261.334 ns | 89.6468 ns | 69.9903 ns | 0.13 |
It’s clear that solutions with Vectors and Intrinsics are much faster than the obvious and LINQ-based solutions. Now we need to figure out what goes on in these two methods.
显然,使用向量和内在函数的解决方案比明显的基于LINQ的解决方案要快得多。 现在,我们需要弄清楚这两种方法的情况。
Let’s consider Vectors method more closely:
让我们更仔细地考虑Vectors方法:
public int Vectors() {
int vectorSize = Vector.Count;
var accVector = Vector.Zero;
int i;
var array = Array;
for (i = 0; i <= array.Length - vectorSize; i += vectorSize) {
var v = new Vector(array, i);
accVector = Vector.Add(accVector, v);
}
int result = Vector.Dot(accVector, Vector.One);
for (; i < array.Length; i++) {
result += array[i];
}
return result;
}
accVector = Vector.Add(accVector, v); — two vectors are summed.
accVector = Vector.Add(accVector,v); —两个向量相加。
For example, there are 8 numbers in Array: {0, 1, 2, 3, 4, 5, 6, 7} and vectorSize == 4.
例如,数组中有8个数字:{0、1、2、3、4、5、6、7}和vectorSize == 4。
Then during the first cycle iteration accVector = {0, 0, 0, 0}, v = {0, 1, 2, 3} and after addition accVector will hold: {0, 0, 0, 0} + {0, 1, 2, 3} = {0, 1, 2, 3}.
然后在第一个循环迭代期间,accVector = {0,0,0,0},v = {0,1,2,3},加法后,accVector将保持:{0,0,0,0} + {0,1 ,2,3} = {0,1,2,3}。
During the second iteration v = {4, 5, 6, 7} and after addition accVector = {0, 1, 2, 3} + {4, 5, 6, 7} = {4, 6, 8, 10}.
在第二次迭代中,v = {4,5,6,7},加法后accVector = {0,1,2,3} + {4,5,6,7} = {4,6,8,10}。
Now we just need to get the sum of all vector elements. To do this we can use scalar multiplication by a vector filled with ones: int result = Vector.Dot(accVector, Vector
现在我们只需要获取所有向量元素的总和即可。 为此,我们可以使用标量乘以一个填充有一个的向量:int result = Vector.Dot(accVector,Vector
Then we get: {4, 6, 8, 10} * {1, 1, 1, 1} = 4 * 1 + 6 * 1 + 8 * 1 + 10 * 1 = 28.
然后得到:{4,6,8,10} * {1,1,1,1} = 4 * 1 + 6 * 1 + 8 * 1 + 10 * 1 = 28。
Let's look into Intrinsics code:
让我们看一下内部代码:
public unsafe int Intrinsics() {
int vectorSize = 256 / 8 / 4;
var accVector = Vector256.Zero;
int i;
var array = Array;
fixed (int* ptr = array) {
for (i = 0; i <= array.Length - vectorSize; i += vectorSize) {
var v = Avx2.LoadVector256(ptr + i);
accVector = Avx2.Add(accVector, v);
}
}
int result = 0;
var temp = stackalloc int[vectorSize];
Avx2.Store(temp, accVector);
for (int j = 0; j < vectorSize; j++) {
result += temp[j];
}
for (; i < array.Length; i++) {
result += array[i];
}
return result;
}
We can see that it’s like Vectors with one exception:
我们可以看到它类似于Vectors,但有一个例外:
vectorSize is specified by a constant. This is because this method explicitly uses Avx2 instructions that operate with 256-bit registers. A real application should include a check of whether a current processor supports Avx2. If not, another code should be called. It looks like this:
vectorSize由常量指定。 这是因为此方法明确使用了与256位寄存器一起操作的Avx2指令。 实际的应用程序应包括检查当前处理器是否支持Avx2。 如果不是,则应调用另一个代码。 看起来像这样:
if (Avx2.IsSupported) {
DoThingsForAvx2();
}
else if (Avx.IsSupported) {
DoThingsForAvx();
}
...
else if (Sse2.IsSupported) {
DoThingsForSse2();
}
...
The summing of vector elements uses the following method:
向量元素的求和使用以下方法:
We need to compare two arrays of bytes. It is exactly this task that made me study SIMD in .NET. Again, let’s write several methods for benchmarking and compare two arrays: ArrayA and ArrayB.
我们需要比较两个字节数组。 正是这一任务使我在.NET中学习SIMD。 再次,让我们编写几种基准测试方法并比较两个数组:ArrayA和ArrayB。
The most obvious solution:
最明显的解决方案:
public bool Naive() {
for (int i = 0; i < ArrayA.Length; i++) {
if (ArrayA[i] != ArrayB[i]) return false;
}
return true;
}
LINQ-based solution:
基于LINQ的解决方案:
public bool LINQ() => ArrayA.SequenceEqual(ArrayB);
The solution based on MemCmp function:
基于MemCmp功能的解决方案:
[DllImport("msvcrt.dll", CallingConvention = CallingConvention.Cdecl)]
static extern int memcmp(byte[] b1, byte[] b2, long count);
public bool MemCmp() => memcmp(ArrayA, ArrayB, ArrayA.Length) == 0;
The solution based on vectors from System.Numerics:
基于System.Numerics中的向量的解决方案:
public bool Vectors() {
int vectorSize = Vector.Count;
int i = 0;
for (; i <= ArrayA.Length - vectorSize; i += vectorSize) {
var va = new Vector(ArrayA, i);
var vb = new Vector(ArrayB, i);
if (!Vector.EqualsAll(va, vb)) {
return false;
}
}
for (; i < ArrayA.Length; i++) {
if (ArrayA[i] != ArrayB[i])
return false;
}
return true;
}
Intrinsics-based solution:
基于内在的解决方案:
public unsafe bool Intrinsics() {
int vectorSize = 256 / 8;
int i = 0;
const int equalsMask = unchecked((int) (0b1111_1111_1111_1111_1111_1111_1111_1111));
fixed (byte* ptrA = ArrayA)
fixed (byte* ptrB = ArrayB) {
for (; i <= ArrayA.Length - vectorSize; i += vectorSize) {
var va = Avx2.LoadVector256(ptrA + i);
var vb = Avx2.LoadVector256(ptrB + i);
var areEqual = Avx2.CompareEqual(va, vb);
if (Avx2.MoveMask(areEqual) != equalsMask) {
return false;
}
}
for (; i < ArrayA.Length; i++) {
if (ArrayA[i] != ArrayB[i])
return false;
}
return true;
}
}
The results of running the benchmark on my computer:
在计算机上运行基准测试的结果:
Method | ItemsCount | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|
Naive | 10000 | 7,033.8 ns | 50.636 ns | 47.365 ns | 1.00 |
LINQ | 10000 | 64,841.4 ns | 289.157 ns | 270.478 ns | 9.22 |
Vectors | 10000 | 504.0 ns | 2.406 ns | 2.251 ns | 0.07 |
MemCmp | 10000 | 368.1 ns | 2.637 ns | 2.466 ns | 0.05 |
Intrinsics | 10000 | 283.6 ns | 1.135 ns | 1.061 ns | 0.04 |
Naive | 100000 | 85,214.4 ns | 903.868 ns | 845.478 ns | 1.00 |
LINQ | 100000 | 702,279.4 ns | 2,846.609 ns | 2,662.720 ns | 8.24 |
Vectors | 100000 | 5,179.2 ns | 45.337 ns | 42.409 ns | 0.06 |
MemCmp | 100000 | 4,510.5 ns | 24.292 ns | 22.723 ns | 0.05 |
Intrinsics | 100000 | 2,957.0 ns | 11.452 ns | 10.712 ns | 0.03 |
Naive | 1000000 | 844,006.1 ns | 3,552.478 ns | 3,322.990 ns | 1.00 |
LINQ | 1000000 | 6,483,079.3 ns | 42,641.040 ns | 39,886.455 ns | 7.68 |
Vectors | 1000000 | 54,180.1 ns | 357.258 ns | 334.180 ns | 0.06 |
MemCmp | 1000000 | 49,480.1 ns | 515.675 ns | 457.133 ns | 0.06 |
Intrinsics | 1000000 | 36,633.9 ns | 680.525 ns | 636.564 ns | 0.04 |
方法 | ItemsCount | 意思 | 错误 | 标准差 | 比 |
---|---|---|---|---|---|
幼稚 | 10000 | 7,033.8 ns | 50.636 ns | 47.365 ns | 1.00 |
LINQ | 10000 | 64,841.4 ns | 289.157 ns | 270.478 ns | 9.22 |
向量 | 10000 | 504.0 ns | 2.406纳秒 | 2.251纳秒 | 0.07 |
记忆卡 | 10000 | 368.1 ns | 2.637 ns | 2.466纳秒 | 0.05 |
本征 | 10000 | 283.6 ns | 1.135纳秒 | 1.061纳秒 | 0.04 |
幼稚 | 100000 | 85,214.4 ns | 903.868 ns | 845.478 ns | 1.00 |
LINQ | 100000 | 702,279.4 ns | 2,846.609 ns | 2,662.720 ns | 8.24 |
向量 | 100000 | 5,179.2 ns | 45.337 ns | 42.409 ns | 0.06 |
记忆卡 | 100000 | 4,510.5 ns | 24.292 ns | 22.723 ns | 0.05 |
本征 | 100000 | 2,957.0 ns | 11.452 ns | 10.712 ns | 0.03 |
幼稚 | 1000000 | 844,006.1 ns | 3,552.478 ns | 3,322.990 ns | 1.00 |
LINQ | 1000000 | 6,483,079.3 ns | 42,641.040 ns | 39,886.455 ns | 7.68 |
向量 | 1000000 | 54,180.1 ns | 357.258 ns | 334.180 ns | 0.06 |
记忆卡 | 1000000 | 49,480.1 ns | 515.675 ns | 457.133 ns | 0.06 |
本征 | 1000000 | 36,633.9 ns | 680.525 ns | 636.564 ns | 0.04 |
I guess the code of these methods is clear, except for two lines in Intrinsics:
我猜这些方法的代码很清楚,除了Intrinsics中的两行:
var areEqual = Avx2.CompareEqual(va, vb);
if (Avx2.MoveMask(areEqual) != equalsMask) {
return false;
}
In the first line two vectors are compared for equality and the result is saved in areEqual vector in which all bits in the element at a particular position are set to 1 if the corresponding elements in va and vb are equal. So, it turns out that if byte vectors va and vb are equal, all the elements in areEquals should equal to 255 (11111111b). As Avx2.CompareEqual is a wrapper over _mm256_cmpeq_epi8, we can go to Intel website and see the pseudocode of this operation: MoveMask method makes a 32-bit number from a vector. The top bits of each 32 onebyte elements in a vector are the values of bits in the MoveMask result. The pseudocode is available here.
在第一行中,比较两个向量的相等性,并将结果保存在areEqual向量中,如果va和vb中的对应元素相等,则将特定位置元素中的所有位都设置为1。 因此,事实证明,如果字节向量va和vb相等,则areEquals中的所有元素都应等于255(11111111b)。 由于Avx2.CompareEqual是_mm256_cmpeq_epi8的包装器,因此我们可以访问Intel网站并查看此操作的伪代码:MoveMask方法从向量中得出32位数字。 向量中每32个单字节元素的高位是MoveMask结果中的位的值。 伪代码在此处可用。
Thus, if some bytes in va and vb don’t match, the corresponding bytes in areEqual will be 0. Therefore, the top bits of these bytes will be 0 too. This means the corresponding bits in Avx2.MoveMask response will also be 0 and areEqual will not equal to equalsMask.
因此,如果va和vb中的某些字节不匹配,则areEqual中的相应字节将为0。因此,这些字节的高位也将为0。 这意味着Avx2.MoveMask响应中的相应位也将为0,而areEqual将不等于equalsMask。
Let’s look at one example assuming that vector length is 8 bytes (to write less):
让我们看一个示例,假设向量长度为8字节(以减少写入):
Sometimes you need to count the occurrences of a particular element, e.g. integers, in a collection. We can speed up this algorithm too. For comparison let’s write several methods to search Item element in Array.
有时您需要计算集合中特定元素(例如整数)的出现次数。 我们也可以加快该算法的速度。 为了进行比较,让我们编写几种方法来搜索Array中的Item元素。
The most obvious one:
最明显的一个:
public int Naive() {
int result = 0;
foreach (int i in Array) {
if (i == Item) {
result++;
}
}
return result;
}
Using LINQ:
使用LINQ:
public int LINQ() => Array.Count(i => i == Item);
Using vectors from System.Numerics.Vectors:
使用System.Numerics.Vectors中的向量:
public int Vectors() {
var mask = new Vector(Item);
int vectorSize = Vector.Count;
var accResult = new Vector();
int i;
var array = Array;
for (i = 0; i <= array.Length - vectorSize; i += vectorSize) {
var v = new Vector(array, i);
var areEqual = Vector.Equals(v, mask);
accResult = Vector.Subtract(accResult, areEqual);
}
int result = 0;
for (; i < array.Length; i++) {
if (array[i] == Item) {
result++;
}
}
result += Vector.Dot(accResult, Vector.One);
return result;
}
Using Intrinsics:
使用本征:
public unsafe int Intrinsics() {
int vectorSize = 256 / 8 / 4;
var temp = stackalloc int[vectorSize];
for (int j = 0; j < vectorSize; j++) {
temp[j] = Item;
}
var mask = Avx2.LoadVector256(temp);
var accVector = Vector256.Zero;
int i;
var array = Array;
fixed (int* ptr = array) {
for (i = 0; i <= array.Length - vectorSize; i += vectorSize) {
var v = Avx2.LoadVector256(ptr + i);
var areEqual = Avx2.CompareEqual(v, mask);
accVector = Avx2.Subtract(accVector, areEqual);
}
}
int result = 0;
Avx2.Store(temp, accVector);
for(int j = 0; j < vectorSize; j++) {
result += temp[j];
}
for(; i < array.Length; i++) {
if (array[i] == Item) {
result++;
}
}
return result;
}
The results of running the benchmark on my computer:
在计算机上运行基准测试的结果:
Method | ItemsCount | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|
Naive | 10 | 8.844 ns | 0.0772 ns | 0.0603 ns | 1.00 |
LINQ | 10 | 87.456 ns | 0.9496 ns | 0.8883 ns | 9.89 |
Vectors | 10 | 3.140 ns | 0.0406 ns | 0.0380 ns | 0.36 |
Intrinsics | 10 | 13.813 ns | 0.0825 ns | 0.0772 ns | 1.56 |
Naive | 100 | 107.310 ns | 0.6975 ns | 0.6183 ns | 1.00 |
LINQ | 100 | 626.285 ns | 5.7677 ns | 5.3951 ns | 5.83 |
Vectors | 100 | 11.844 ns | 0.2113 ns | 0.1873 ns | 0.11 |
Intrinsics | 100 | 19.616 ns | 0.1018 ns | 0.0903 ns | 0.18 |
Naive | 1000 | 1,032.466 ns | 6.3799 ns | 5.6556 ns | 1.00 |
LINQ | 1000 | 6,266.605 ns | 42.6585 ns | 39.9028 ns | 6.07 |
Vectors | 1000 | 83.417 ns | 0.5393 ns | 0.4780 ns | 0.08 |
Intrinsics | 1000 | 88.358 ns | 0.4921 ns | 0.4603 ns | 0.09 |
Naive | 10000 | 9,942.503 ns | 47.9732 ns | 40.0598 ns | 1.00 |
LINQ | 10000 | 62,305.598 ns | 643.8775 ns | 502.6972 ns | 6.27 |
Vectors | 10000 | 914.967 ns | 7.2959 ns | 6.8246 ns | 0.09 |
Intrinsics | 10000 | 931.698 ns | 6.3444 ns | 5.9346 ns | 0.09 |
Naive | 100000 | 94,834.804 ns | 793.8585 ns | 703.7349 ns | 1.00 |
LINQ | 100000 | 626,620.968 ns | 4,696.9221 ns | 4,393.5038 ns | 6.61 |
Vectors | 100000 | 9,000.827 ns | 179.5351 ns | 192.1005 ns | 0.09 |
Intrinsics | 100000 | 8,690.771 ns | 101.7078 ns | 95.1376 ns | 0.09 |
Naive | 1000000 | 959,302.249 ns | 4,268.2488 ns | 3,783.6914 ns | 1.00 |
LINQ | 1000000 | 6,218,681.888 ns | 31,321.9277 ns | 29,298.5506 ns | 6.48 |
Vectors | 1000000 | 99,778.488 ns | 1,975.6001 ns | 4,252.6877 ns | 0.10 |
Intrinsics | 1000000 | 96,449.350 ns | 1,171.8067 ns | 978.5116 ns | 0.10 |
方法 | ItemsCount | 意思 | 错误 | 标准差 | 比 |
---|---|---|---|---|---|
幼稚 | 10 | 8.844纳秒 | 0.0772 ns | 0.0603 ns | 1.00 |
LINQ | 10 | 87.456 ns | 0.9496纳秒 | 0.8883纳秒 | 9.89 |
向量 | 10 | 3.140纳秒 | 0.0406纳秒 | 0.0380纳秒 | 0.36 |
本征 | 10 | 13.813 ns | 0.0825 ns | 0.0772 ns | 1.56 |
幼稚 | 100 | 107.310 ns | 0.6975纳秒 | 0.6183 ns | 1.00 |
LINQ | 100 | 626.285 ns | 5.7677 ns | 5.3951 ns | 5.83 |
向量 | 100 | 11.844 ns | 0.2113 ns | 0.1873 ns | 0.11 |
本征 | 100 | 19.616纳秒 | 0.1018纳秒 | 0.0903 ns | 0.18 |
幼稚 | 1000 | 1,032.466 ns | 6.3799 ns | 5.6556 ns | 1.00 |
LINQ | 1000 | 6,266.605 ns | 42.6585 ns | 39.9028 ns | 6.07 |
向量 | 1000 | 83.417 ns | 0.5393纳秒 | 0.4780纳秒 | 0.08 |
本征 | 1000 | 88.358 ns | 0.4921纳秒 | 0.4603纳秒 | 0.09 |
幼稚 | 10000 | 9,942.503 ns | 47.9732 ns | 40.0598 ns | 1.00 |
LINQ | 10000 | 62,305.598 ns | 643.8775 ns | 502.6972 ns | 6.27 |
向量 | 10000 | 914.967 ns | 7.2959 ns | 6.8246 ns | 0.09 |
本征 | 10000 | 931.698 ns | 6.3444 ns | 5.9346 ns | 0.09 |
幼稚 | 100000 | 94,834.804 ns | 793.8585 ns | 703.7349 ns | 1.00 |
LINQ | 100000 | 626,620.968 ns | 4,696.9221 ns | 4,393.5038 ns | 6.61 |
向量 | 100000 | 9,000.827 ns | 179.5351 ns | 192.1005 ns | 0.09 |
本征 | 100000 | 8,690.771 ns | 101.7078 ns | 95.1376 ns | 0.09 |
幼稚 | 1000000 | 959,302.249 ns | 4,268.2488 ns | 3,783.6914 ns | 1.00 |
LINQ | 1000000 | 6,218,681.888 ns | 31,321.9277 ns | 29,298.5506 ns | 6.48 |
向量 | 1000000 | 99,778.488 ns | 1,975.6001 ns | 4,252.6877 ns | 0.10 |
本征 | 1000000 | 96,449.350 ns | 1,171.8067 ns | 978.5116 ns | 0.10 |
Vectors and Intrinsics methods completely coincide in logic but differ in the implementation of particular operations. The idea is the following:
向量和本征方法在逻辑上完全重合,但在特定操作的实现上有所不同。 这个想法如下:
The whole code from the article is on GitHub.
本文的全部代码在GitHub上 。
I described only a small part of .NET capabilities for computation vectorization. To see the full updated list of all intrinsics available in .NET Core under x86, turn to the source code. It’s convenient that the summary of each intrinsic in C# files contains its name in C world. This helps either to understand the purpose of this intrinsic and the transfer of existing C++/C algorithms to .NET. System.Numerics.Vector documentation is available on msdn.
我仅描述了用于计算矢量化的.NET功能的一小部分。 若要查看x86下.NET Core中可用的所有内部函数的完整更新列表,请转到源代码 。 C#文件中每个内在函数的摘要都包含其在C world中的名称,这很方便。 这有助于了解此内在功能的目的,以及帮助将现有C ++ / C算法转移到.NET。 msdn上提供了System.Numerics.Vector文档。
I think .NET has a great advantage over C++. As JIT compilation already occurs on a client machine, a compiler can optimize code for a particular client processor, giving maximum performance. At the same time, a programmer can stay within one language and the same technologies to write fast code.
我认为.NET比C ++具有很大的优势。 由于JIT编译已在客户端计算机上进行,因此编译器可以为特定客户端处理器优化代码,从而提供最佳性能。 同时,程序员可以使用一种语言和相同的技术来编写快速代码。
翻译自: https://habr.com/en/post/467689/
c# simd 指令