我注意到计数在很大程度上取决于我执行它的机器.
我用gcc 7.3,8.2和clang 6.0编译了std=c++17 -O3
.
在i7-4790(4.17.14-arch1-1-ARCH内核):〜3e8
但在Xeon E5-2630 v4(3.10.0-514.el7.x86_64)上:〜8e6
现在这是我想要理解的差异,所以我已经检查过了 perf stat -d
在i7上:
4999.419546 task-clock:u (msec) # 0.999 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
120 page-faults:u # 0.024 K/sec
19,605,598,394 cycles:u # 3.922 GHz (49.94%)
33,601,884,120 instructions:u # 1.71 insn per cycle (62.48%)
7,397,994,820 branches:u # 1479.771 M/sec (62.53%)
34,788 branch-misses:u # 0.00% of all branches (62.58%)
10,809,601,166 L1-dcache-loads:u # 2162.171 M/sec (62.41%)
13,632 L1-dcache-load-misses:u # 0.00% of all L1-dcache hits (24.95%)
3,944 LLC-loads:u # 0.789 K/sec (24.95%)
1,034 LLC-load-misses:u # 26.22% of all LL-cache hits (37.42%)
5.003180401 seconds time elapsed
4.969048000 seconds user
0.016557000 seconds sys
至强:
5001.000000 task-clock (msec) # 0.999 CPUs utilized
42 context-switches # 0.008 K/sec
2 cpu-migrations # 0.000 K/sec
412 page-faults # 0.082 K/sec
15,100,238,798 cycles # 3.019 GHz (50.01%)
794,184,899 instructions # 0.05 insn per cycle (62.51%)
188,083,219 branches # 37.609 M/sec (62.49%)
85,924 branch-misses # 0.05% of all branches (62.51%)
269,848,346 L1-dcache-loads # 53.959 M/sec (62.49%)
246,532 L1-dcache-load-misses # 0.09% of all L1-dcache hits (62.51%)
13,327 LLC-loads # 0.003 M/sec (49.99%)
7,417 LLC-load-misses # 55.65% of all LL-cache hits (50.02%)
5.006139971 seconds time elapsed
弹出的是Xeon上每个周期的指令数量少以及我不理解的非零上下文切换.但是,我无法使用这些诊断程序来解释.
并且为了给问题添加一点怪异,在尝试调试时我也在一台机器上静态编译并在另一台机器上执行.
在Xeon上,静态编译的可执行文件输出降低了约10%,在xeon或i7上编译没有区别.
在i7上做同样的事情,计数器实际上都从3e8
〜下降2e7
所以最后我还有两个问题:
为什么我在两台机器之间看到如此显着的差异.
为什么静态链接的exectuable表现更差,而我期望相反?
编辑:在将centos 7机器上的内核更新到4.18之后,我们实际上看到了从〜8e6
到的额外下降5e6
.
有趣地显示不同的数字:
5002.000000 task-clock:u (msec) # 0.999 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
119 page-faults:u # 0.024 K/sec
409,723,790 cycles:u # 0.082 GHz (50.00%)
392,228,592 instructions:u # 0.96 insn per cycle (62.51%)
115,475,503 branches:u # 23.086 M/sec (62.51%)
26,355 branch-misses:u # 0.02% of all branches (62.53%)
115,799,571 L1-dcache-loads:u # 23.151 M/sec (62.51%)
42,327 L1-dcache-load-misses:u # 0.04% of all L1-dcache hits (62.50%)
88 LLC-loads:u # 0.018 K/sec (49.96%)
2 LLC-load-misses:u # 2.27% of all LL-cache hits (49.98%)
5.005940327 seconds time elapsed
0.533000000 seconds user
4.469000000 seconds sys
有趣的是,没有更多的上下文切换和每个周期的结构显着上升,但周期和colck是超低!