如何在现代 x86-64 Intel CPU 上实现每个周期 4 个浮点运算(双精度)的理论峰值性能?
据我了解,在大多数现代英特尔 CPU 上完成SSE add需要三个周期,完成一个周期需要五个周期(例如参见Agner Fog 的“指令表”)。由于流水线,如果算法具有至少三个独立的求和,则每个周期可以获得一个吞吐量。由于打包版本和标量版本都是如此,并且 SSE 寄存器可以包含两个’,因此吞吐量可以高达每个周期两个触发器。muladd``addpd``addsd``double
add
mul
add``addpd``addsd``double
此外,似乎(尽管我没有看到任何适当的文档)add‘ 和mul‘ 可以并行执行,理论上每个周期的最大吞吐量为四个触发器。
但是,我无法使用简单的 C/C 程序复制该性能。我最好的尝试导致大约 2.7 次失败/周期。如果有人可以贡献一个简单的 C/C 或汇编程序来展示最佳性能,那将不胜感激。
我的尝试:
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <sys/time.h> double stoptime(void) { struct timeval t; gettimeofday(&t,NULL); return (double) t.tv_sec + t.tv_usec/1000000.0; } double addmul(double add, double mul, int ops){ // Need to initialise differently otherwise compiler might optimise away double sum1=0.1, sum2=-0.1, sum3=0.2, sum4=-0.2, sum5=0.0; double mul1=1.0, mul2= 1.1, mul3=1.2, mul4= 1.3, mul5=1.4; int loops=ops/10; // We have 10 floating point operations inside the loop double expected = 5.0*add*loops + (sum1+sum2+sum3+sum4+sum5) + pow(mul,loops)*(mul1+mul2+mul3+mul4+mul5); for (int i=0; i<loops; i++) { mul1*=mul; mul2*=mul; mul3*=mul; mul4*=mul; mul5*=mul; sum1+=add; sum2+=add; sum3+=add; sum4+=add; sum5+=add; } return sum1+sum2+sum3+sum4+sum5+mul1+mul2+mul3+mul4+mul5 - expected; } int main(int argc, char** argv) { if (argc != 2) { printf("usage: %s <num>\n", argv[0]); printf("number of operations: <num> millions\n"); exit(EXIT_FAILURE); } int n = atoi(argv[1]) * 1000000; if (n<=0) n=1000; double x = M_PI; double y = 1.0 + 1e-8; double t = stoptime(); x = addmul(x, y, n); t = stoptime() - t; printf("addmul:\t %.3f s, %.3f Gflops, res=%f\n", t, (double)n/t/1e9, x); return EXIT_SUCCESS; }
编译:
g++ -O2 -march=native addmul.cpp ; ./a.out 1000
在 Intel Core i5-750, 2.66 GHz 上产生以下输出:
addmul: 0.270 s, 3.707 Gflops, res=1.326463
也就是说,每个周期只有大约 1.4 次触发器。用主循环查看汇编代码 g++ -S -O2 -march=native -masm=intel addmul.cpp对我来说似乎是最佳选择。
g++ -S -O2 -march=native -masm=intel addmul.cpp
.L4: inc eax mulsd xmm8, xmm3 mulsd xmm7, xmm3 mulsd xmm6, xmm3 mulsd xmm5, xmm3 mulsd xmm1, xmm3 addsd xmm13, xmm2 addsd xmm12, xmm2 addsd xmm11, xmm2 addsd xmm10, xmm2 addsd xmm9, xmm2 cmp eax, ebx jne .L4
使用打包版本 (addpd和mulpd) 更改标量版本将使 flop 计数翻倍,而不会改变执行时间,因此每个周期我会得到 2.8 次 flop。有没有一个简单的例子可以实现每个周期四次翻转?
addpd
mulpd
Mysticial 的小程序不错;这是我的结果(虽然只运行了几秒钟):
gcc -O2 -march=nocona
cl /O2
这一切似乎有点复杂,但到目前为止我的结论是:
gcc -O2更改独立浮点运算的顺序,目的是 尽可能交替addpd和’s。mulpd同样适用于gcc-4.6.2 -O2 -march=core2。
gcc -O2
gcc-4.6.2 -O2 -march=core2
gcc -O2 -march=nocona似乎保持了 C++ 源代码中定义的浮点运算的顺序。
cl /O2,来自SDK for Windows 7的 64 位编译器 会自动进行循环展开,并且似乎尝试并安排操作,以便三个 ‘ 的组与三个addpd‘ 交替mulpd(嗯,至少在我的系统和我的简单程序上) .
我的Core i5 750(Nehalem 架构)不喜欢交替使用 add 和 mul,而且似乎无法同时运行这两个操作。但是,如果将其分组为 3,它会突然像魔术一样起作用。
如果其他架构(可能是Sandy Bridge和其他架构)在汇编代码中交替执行,它们似乎能够毫无问题地并行执行 add/mul。
虽然很难承认,但在我的系统上,我的系统cl /O2在低级优化操作方面做得更好,并且在上面的小 C++ 示例中实现了接近峰值的性能。我测量了 1.85-2.01 flops/cycle 之间(在 Windows 中使用了 clock() 不是那么精确。我想,需要使用更好的计时器 - 感谢 Mackie Messer)。
我管理的最好gcc方法是手动循环展开并以三人一组的方式安排加法和乘法。我得到 g++ -O2 -march=nocona addmul_unroll.cpp 的充其量0.207s, 4.825 Gflops相当于 1.8 次失败/周期,我现在对此非常满意。
gcc
g++ -O2 -march=nocona addmul_unroll.cpp
0.207s, 4.825 Gflops
在 C++ 代码中,我将for循环替换为:
for
for (int i=0; i<loops/3; i++) { mul1*=mul; mul2*=mul; mul3*=mul; sum1+=add; sum2+=add; sum3+=add; mul4*=mul; mul5*=mul; mul1*=mul; sum4+=add; sum5+=add; sum1+=add; mul2*=mul; mul3*=mul; mul4*=mul; sum2+=add; sum3+=add; sum4+=add; mul5*=mul; mul1*=mul; mul2*=mul; sum5+=add; sum1+=add; sum2+=add; mul3*=mul; mul4*=mul; mul5*=mul; sum3+=add; sum4+=add; sum5+=add; }
程序集现在看起来像:
.L4: mulsd xmm8, xmm3 mulsd xmm7, xmm3 mulsd xmm6, xmm3 addsd xmm13, xmm2 addsd xmm12, xmm2 addsd xmm11, xmm2 mulsd xmm5, xmm3 mulsd xmm1, xmm3 mulsd xmm8, xmm3 addsd xmm10, xmm2 addsd xmm9, xmm2 addsd xmm13, xmm2 ...
我以前做过这个确切的任务。但主要是测量功耗和CPU温度。下面的代码(相当长)在我的 Core i7 2600K 上实现了接近最优。
这里要注意的关键是大量的手动循环展开以及乘法和加法的交错......
完整的项目可以在我的 GitHub 上找到:https ://github.com/Mysticial/Flops
如果您决定编译并运行它,请注意您的 CPU 温度!!! 确保不要过热。并确保 CPU 节流不会影响您的结果!
此外,我对运行此代码可能造成的任何损害不承担任何责任。
笔记:
此代码已经过测试,可在 Visual Studio 2010/2012 和 GCC 4.6 上正常运行。 ICC 11 (Intel Compiler 11) 出人意料地难以很好地编译它。
这些适用于 FMA 之前的处理器。为了在 Intel Haswell 和 AMD Bulldozer 处理器(及更高版本)上实现峰值 FLOPS,将需要 FMA(Fused Multiply Add)指令。这些超出了本基准的范围。
using namespace std;
typedef unsigned long long uint64;
double test_dp_mac_SSE(double x,double y,uint64 iterations){ register __m128d r0,r1,r2,r3,r4,r5,r6,r7,r8,r9,rA,rB,rC,rD,rE,rF;
// Generate starting data. r0 = _mm_set1_pd(x); r1 = _mm_set1_pd(y); r8 = _mm_set1_pd(-0.0); r2 = _mm_xor_pd(r0,r8); r3 = _mm_or_pd(r0,r8); r4 = _mm_andnot_pd(r8,r0); r5 = _mm_mul_pd(r1,_mm_set1_pd(0.37796447300922722721)); r6 = _mm_mul_pd(r1,_mm_set1_pd(0.24253562503633297352)); r7 = _mm_mul_pd(r1,_mm_set1_pd(4.1231056256176605498)); r8 = _mm_add_pd(r0,_mm_set1_pd(0.37796447300922722721)); r9 = _mm_add_pd(r1,_mm_set1_pd(0.24253562503633297352)); rA = _mm_sub_pd(r0,_mm_set1_pd(4.1231056256176605498)); rB = _mm_sub_pd(r1,_mm_set1_pd(4.1231056256176605498)); rC = _mm_set1_pd(1.4142135623730950488); rD = _mm_set1_pd(1.7320508075688772935); rE = _mm_set1_pd(0.57735026918962576451); rF = _mm_set1_pd(0.70710678118654752440); uint64 iMASK = 0x800fffffffffffffull; __m128d MASK = _mm_set1_pd(*(double*)&iMASK); __m128d vONE = _mm_set1_pd(1.0); uint64 c = 0; while (c < iterations){ size_t i = 0; while (i < 1000){ // Here's the meat - the part that really matters. r0 = _mm_mul_pd(r0,rC); r1 = _mm_add_pd(r1,rD); r2 = _mm_mul_pd(r2,rE); r3 = _mm_sub_pd(r3,rF); r4 = _mm_mul_pd(r4,rC); r5 = _mm_add_pd(r5,rD); r6 = _mm_mul_pd(r6,rE); r7 = _mm_sub_pd(r7,rF); r8 = _mm_mul_pd(r8,rC); r9 = _mm_add_pd(r9,rD); rA = _mm_mul_pd(rA,rE); rB = _mm_sub_pd(rB,rF); r0 = _mm_add_pd(r0,rF); r1 = _mm_mul_pd(r1,rE); r2 = _mm_sub_pd(r2,rD); r3 = _mm_mul_pd(r3,rC); r4 = _mm_add_pd(r4,rF); r5 = _mm_mul_pd(r5,rE); r6 = _mm_sub_pd(r6,rD); r7 = _mm_mul_pd(r7,rC); r8 = _mm_add_pd(r8,rF); r9 = _mm_mul_pd(r9,rE); rA = _mm_sub_pd(rA,rD); rB = _mm_mul_pd(rB,rC); r0 = _mm_mul_pd(r0,rC); r1 = _mm_add_pd(r1,rD); r2 = _mm_mul_pd(r2,rE); r3 = _mm_sub_pd(r3,rF); r4 = _mm_mul_pd(r4,rC); r5 = _mm_add_pd(r5,rD); r6 = _mm_mul_pd(r6,rE); r7 = _mm_sub_pd(r7,rF); r8 = _mm_mul_pd(r8,rC); r9 = _mm_add_pd(r9,rD); rA = _mm_mul_pd(rA,rE); rB = _mm_sub_pd(rB,rF); r0 = _mm_add_pd(r0,rF); r1 = _mm_mul_pd(r1,rE); r2 = _mm_sub_pd(r2,rD); r3 = _mm_mul_pd(r3,rC); r4 = _mm_add_pd(r4,rF); r5 = _mm_mul_pd(r5,rE); r6 = _mm_sub_pd(r6,rD); r7 = _mm_mul_pd(r7,rC); r8 = _mm_add_pd(r8,rF); r9 = _mm_mul_pd(r9,rE); rA = _mm_sub_pd(rA,rD); rB = _mm_mul_pd(rB,rC); i++; } // Need to renormalize to prevent denormal/overflow. r0 = _mm_and_pd(r0,MASK); r1 = _mm_and_pd(r1,MASK); r2 = _mm_and_pd(r2,MASK); r3 = _mm_and_pd(r3,MASK); r4 = _mm_and_pd(r4,MASK); r5 = _mm_and_pd(r5,MASK); r6 = _mm_and_pd(r6,MASK); r7 = _mm_and_pd(r7,MASK); r8 = _mm_and_pd(r8,MASK); r9 = _mm_and_pd(r9,MASK); rA = _mm_and_pd(rA,MASK); rB = _mm_and_pd(rB,MASK); r0 = _mm_or_pd(r0,vONE); r1 = _mm_or_pd(r1,vONE); r2 = _mm_or_pd(r2,vONE); r3 = _mm_or_pd(r3,vONE); r4 = _mm_or_pd(r4,vONE); r5 = _mm_or_pd(r5,vONE); r6 = _mm_or_pd(r6,vONE); r7 = _mm_or_pd(r7,vONE); r8 = _mm_or_pd(r8,vONE); r9 = _mm_or_pd(r9,vONE); rA = _mm_or_pd(rA,vONE); rB = _mm_or_pd(rB,vONE); c++; } r0 = _mm_add_pd(r0,r1); r2 = _mm_add_pd(r2,r3); r4 = _mm_add_pd(r4,r5); r6 = _mm_add_pd(r6,r7); r8 = _mm_add_pd(r8,r9); rA = _mm_add_pd(rA,rB); r0 = _mm_add_pd(r0,r2); r4 = _mm_add_pd(r4,r6); r8 = _mm_add_pd(r8,rA); r0 = _mm_add_pd(r0,r4); r0 = _mm_add_pd(r0,r8); // Prevent Dead Code Elimination double out = 0; __m128d temp = r0; out += ((double*)&temp)[0]; out += ((double*)&temp)[1]; return out;
}
void test_dp_mac_SSE(int tds,uint64 iterations){
double *sum = (double*)malloc(tds * sizeof(double)); double start = omp_get_wtime();
{ double ret = test_dp_mac_SSE(1.1,2.1,iterations); sum[omp_get_thread_num()] = ret; } double secs = omp_get_wtime() - start; uint64 ops = 48 * 1000 * iterations * tds * 2; cout << "Seconds = " << secs << endl; cout << "FP Ops = " << ops << endl; cout << "FLOPs = " << ops / secs << endl; double out = 0; int c = 0; while (c < tds){ out += sum[c++]; } cout << "sum = " << out << endl; cout << endl; free(sum);
int main(){ // (threads, iterations) test_dp_mac_SSE(8,10000000);
system("pause");
输出(1 个线程,10000000 次迭代) - 使用 Visual Studio 2010 SP1 编译 - x64 版本:
Seconds = 55.5104 FP Ops = 960000000000 FLOPs = 1.7294e+010 sum = 2.22652
该机器是 Core i7 2600K @ 4.4 GHz。理论 SSE 峰值为 4 flops * 4.4 GHz = 17.6 GFlops 。这段代码达到了 17.3 GFlops - 不错。
输出(8 个线程,10000000 次迭代) - 使用 Visual Studio 2010 SP1 编译 - x64 版本:
Seconds = 117.202 FP Ops = 7680000000000 FLOPs = 6.55279e+010 sum = 17.8122
理论 SSE 峰值为 4 触发器 * 4 个内核 * 4.4 GHz = 70.4 GFlops。 实际是 65.5 GFlops 。
#include <immintrin.h> #include <omp.h> #include <iostream> using namespace std; typedef unsigned long long uint64; double test_dp_mac_AVX(double x,double y,uint64 iterations){ register __m256d r0,r1,r2,r3,r4,r5,r6,r7,r8,r9,rA,rB,rC,rD,rE,rF; // Generate starting data. r0 = _mm256_set1_pd(x); r1 = _mm256_set1_pd(y); r8 = _mm256_set1_pd(-0.0); r2 = _mm256_xor_pd(r0,r8); r3 = _mm256_or_pd(r0,r8); r4 = _mm256_andnot_pd(r8,r0); r5 = _mm256_mul_pd(r1,_mm256_set1_pd(0.37796447300922722721)); r6 = _mm256_mul_pd(r1,_mm256_set1_pd(0.24253562503633297352)); r7 = _mm256_mul_pd(r1,_mm256_set1_pd(4.1231056256176605498)); r8 = _mm256_add_pd(r0,_mm256_set1_pd(0.37796447300922722721)); r9 = _mm256_add_pd(r1,_mm256_set1_pd(0.24253562503633297352)); rA = _mm256_sub_pd(r0,_mm256_set1_pd(4.1231056256176605498)); rB = _mm256_sub_pd(r1,_mm256_set1_pd(4.1231056256176605498)); rC = _mm256_set1_pd(1.4142135623730950488); rD = _mm256_set1_pd(1.7320508075688772935); rE = _mm256_set1_pd(0.57735026918962576451); rF = _mm256_set1_pd(0.70710678118654752440); uint64 iMASK = 0x800fffffffffffffull; __m256d MASK = _mm256_set1_pd(*(double*)&iMASK); __m256d vONE = _mm256_set1_pd(1.0); uint64 c = 0; while (c < iterations){ size_t i = 0; while (i < 1000){ // Here's the meat - the part that really matters. r0 = _mm256_mul_pd(r0,rC); r1 = _mm256_add_pd(r1,rD); r2 = _mm256_mul_pd(r2,rE); r3 = _mm256_sub_pd(r3,rF); r4 = _mm256_mul_pd(r4,rC); r5 = _mm256_add_pd(r5,rD); r6 = _mm256_mul_pd(r6,rE); r7 = _mm256_sub_pd(r7,rF); r8 = _mm256_mul_pd(r8,rC); r9 = _mm256_add_pd(r9,rD); rA = _mm256_mul_pd(rA,rE); rB = _mm256_sub_pd(rB,rF); r0 = _mm256_add_pd(r0,rF); r1 = _mm256_mul_pd(r1,rE); r2 = _mm256_sub_pd(r2,rD); r3 = _mm256_mul_pd(r3,rC); r4 = _mm256_add_pd(r4,rF); r5 = _mm256_mul_pd(r5,rE); r6 = _mm256_sub_pd(r6,rD); r7 = _mm256_mul_pd(r7,rC); r8 = _mm256_add_pd(r8,rF); r9 = _mm256_mul_pd(r9,rE); rA = _mm256_sub_pd(rA,rD); rB = _mm256_mul_pd(rB,rC); r0 = _mm256_mul_pd(r0,rC); r1 = _mm256_add_pd(r1,rD); r2 = _mm256_mul_pd(r2,rE); r3 = _mm256_sub_pd(r3,rF); r4 = _mm256_mul_pd(r4,rC); r5 = _mm256_add_pd(r5,rD); r6 = _mm256_mul_pd(r6,rE); r7 = _mm256_sub_pd(r7,rF); r8 = _mm256_mul_pd(r8,rC); r9 = _mm256_add_pd(r9,rD); rA = _mm256_mul_pd(rA,rE); rB = _mm256_sub_pd(rB,rF); r0 = _mm256_add_pd(r0,rF); r1 = _mm256_mul_pd(r1,rE); r2 = _mm256_sub_pd(r2,rD); r3 = _mm256_mul_pd(r3,rC); r4 = _mm256_add_pd(r4,rF); r5 = _mm256_mul_pd(r5,rE); r6 = _mm256_sub_pd(r6,rD); r7 = _mm256_mul_pd(r7,rC); r8 = _mm256_add_pd(r8,rF); r9 = _mm256_mul_pd(r9,rE); rA = _mm256_sub_pd(rA,rD); rB = _mm256_mul_pd(rB,rC); i++; } // Need to renormalize to prevent denormal/overflow. r0 = _mm256_and_pd(r0,MASK); r1 = _mm256_and_pd(r1,MASK); r2 = _mm256_and_pd(r2,MASK); r3 = _mm256_and_pd(r3,MASK); r4 = _mm256_and_pd(r4,MASK); r5 = _mm256_and_pd(r5,MASK); r6 = _mm256_and_pd(r6,MASK); r7 = _mm256_and_pd(r7,MASK); r8 = _mm256_and_pd(r8,MASK); r9 = _mm256_and_pd(r9,MASK); rA = _mm256_and_pd(rA,MASK); rB = _mm256_and_pd(rB,MASK); r0 = _mm256_or_pd(r0,vONE); r1 = _mm256_or_pd(r1,vONE); r2 = _mm256_or_pd(r2,vONE); r3 = _mm256_or_pd(r3,vONE); r4 = _mm256_or_pd(r4,vONE); r5 = _mm256_or_pd(r5,vONE); r6 = _mm256_or_pd(r6,vONE); r7 = _mm256_or_pd(r7,vONE); r8 = _mm256_or_pd(r8,vONE); r9 = _mm256_or_pd(r9,vONE); rA = _mm256_or_pd(rA,vONE); rB = _mm256_or_pd(rB,vONE); c++; } r0 = _mm256_add_pd(r0,r1); r2 = _mm256_add_pd(r2,r3); r4 = _mm256_add_pd(r4,r5); r6 = _mm256_add_pd(r6,r7); r8 = _mm256_add_pd(r8,r9); rA = _mm256_add_pd(rA,rB); r0 = _mm256_add_pd(r0,r2); r4 = _mm256_add_pd(r4,r6); r8 = _mm256_add_pd(r8,rA); r0 = _mm256_add_pd(r0,r4); r0 = _mm256_add_pd(r0,r8); // Prevent Dead Code Elimination double out = 0; __m256d temp = r0; out += ((double*)&temp)[0]; out += ((double*)&temp)[1]; out += ((double*)&temp)[2]; out += ((double*)&temp)[3]; return out; } void test_dp_mac_AVX(int tds,uint64 iterations){ double *sum = (double*)malloc(tds * sizeof(double)); double start = omp_get_wtime(); #pragma omp parallel num_threads(tds) { double ret = test_dp_mac_AVX(1.1,2.1,iterations); sum[omp_get_thread_num()] = ret; } double secs = omp_get_wtime() - start; uint64 ops = 48 * 1000 * iterations * tds * 4; cout << "Seconds = " << secs << endl; cout << "FP Ops = " << ops << endl; cout << "FLOPs = " << ops / secs << endl; double out = 0; int c = 0; while (c < tds){ out += sum[c++]; } cout << "sum = " << out << endl; cout << endl; free(sum); } int main(){ // (threads, iterations) test_dp_mac_AVX(8,10000000); system("pause"); }
Seconds = 57.4679 FP Ops = 1920000000000 FLOPs = 3.34099e+010 sum = 4.45305
理论 AVX 峰值为 8 flops * 4.4 GHz = 35.2 GFlops 。实际是 33.4 GFlops 。
Seconds = 111.119 FP Ops = 15360000000000 FLOPs = 1.3823e+011 sum = 35.6244
理论 AVX 峰值为 8 触发器 * 4 个内核 * 4.4 GHz = 140.8 GFlops。 实际是 138.2 GFlops 。
现在进行一些解释:
性能关键部分显然是内部循环内的 48 条指令。您会注意到它分为 4 块,每块 12 条指令。这 12 个指令块中的每一个都完全相互独立 - 平均需要 6 个周期来执行。
所以从发布到使用之间有 12 条指令和 6 个周期。乘法的延迟为 5 个周期,因此足以避免延迟停止。
需要规范化步骤来防止数据上溢/下溢。这是必需的,因为无操作代码将缓慢增加/减少数据的大小。
So it’s actually possible to do better than this if you just use all zeros and get rid of the normalization step. However, since I wrote the benchmark to measure power consumption and temperature, I had to make sure the flops were on “real” data, rather than zeros - as the execution units may very well have special case-handling for zeros that use less power and produce less heat.
Threads: 1
Seconds = 72.1116 FP Ops = 960000000000 FLOPs = 1.33127e+010 sum = 2.22652
Theoretical SSE Peak: 4 flops * 3.5 GHz = 14.0 GFlops. Actual is 13.3 GFlops.
Threads: 8
Seconds = 149.576 FP Ops = 7680000000000 FLOPs = 5.13452e+010 sum = 17.8122
Theoretical SSE Peak: 4 flops * 4 cores * 3.5 GHz = 56.0 GFlops. Actual is 51.3 GFlops.
My processor temps hit 76C on the multi-threaded run! If you runs these, be sure the results aren’t affected by CPU throttling.
Seconds = 78.3357 FP Ops = 960000000000 FLOPs = 1.22549e+10 sum = 2.22652
Theoretical SSE Peak: 4 flops * 3.2 GHz = 12.8 GFlops. Actual is 12.3 GFlops.
Seconds = 78.4733 FP Ops = 7680000000000 FLOPs = 9.78676e+10 sum = 17.8122
Theoretical SSE Peak: 4 flops * 8 cores * 3.2 GHz = 102.4 GFlops. Actual is 97.9 GFlops.