所以我试图用C测量L1,L2,L3缓存的延迟.我知道它们的大小,我觉得我从概念上理解如何做到这一点,但我遇到了我的实现问题.我想知道是否有一些其他硬件错综复杂如预取问题.
#include#include #include int main(){ srand(time(NULL)); // Seed ONCE const int L1_CACHE_SIZE = 32768/sizeof(int); const int L2_CACHE_SIZE = 262144/sizeof(int); const int L3_CACHE_SIZE = 6587392/sizeof(int); const int NUM_ACCESSES = 1000000; const int SECONDS_PER_NS = 1000000000; int arrayAccess[L1_CACHE_SIZE]; int arrayInvalidateL1[L1_CACHE_SIZE]; int arrayInvalidateL2[L2_CACHE_SIZE]; int arrayInvalidateL3[L3_CACHE_SIZE]; int count=0; int index=0; int i=0; struct timespec startAccess, endAccess; double mainMemAccess, L1Access, L2Access, L3Access; int readValue=0; memset(arrayAccess, 0, L1_CACHE_SIZE*sizeof(int)); memset(arrayInvalidateL1, 0, L1_CACHE_SIZE*sizeof(int)); memset(arrayInvalidateL2, 0, L2_CACHE_SIZE*sizeof(int)); memset(arrayInvalidateL3, 0, L3_CACHE_SIZE*sizeof(int)); index = 0; clock_gettime(CLOCK_REALTIME, &startAccess); //start clock while (index < L1_CACHE_SIZE) { int tmp = arrayAccess[index]; //Access Value from L2 index = (index + tmp + ((index & 4) ? 28 : 36)); // on average this should give 32 element skips, with changing strides count++; //divide overall time by this } clock_gettime(CLOCK_REALTIME, &endAccess); //end clock mainMemAccess = ((endAccess.tv_sec - startAccess.tv_sec) * SECONDS_PER_NS) + (endAccess.tv_nsec - startAccess.tv_nsec); mainMemAccess /= count; printf("Main Memory Access %lf\n", mainMemAccess); index = 0; count=0; clock_gettime(CLOCK_REALTIME, &startAccess); //start clock while (index < L1_CACHE_SIZE) { int tmp = arrayAccess[index]; //Access Value from L2 index = (index + tmp + ((index & 4) ? 28 : 36)); // on average this should give 32 element skips, with changing strides count++; //divide overall time by this } clock_gettime(CLOCK_REALTIME, &endAccess); //end clock L1Access = ((endAccess.tv_sec - startAccess.tv_sec) * SECONDS_PER_NS) + (endAccess.tv_nsec - startAccess.tv_nsec); L1Access /= count; printf("L1 Cache Access %lf\n", L1Access); //invalidate L1 by accessing all elements of array which is larger than cache for(count=0; count < L1_CACHE_SIZE; count++){ int read = arrayInvalidateL1[count]; read++; readValue+=read; } index = 0; count = 0; clock_gettime(CLOCK_REALTIME, &startAccess); //start clock while (index < L1_CACHE_SIZE) { int tmp = arrayAccess[index]; //Access Value from L2 index = (index + tmp + ((index & 4) ? 28 : 36)); // on average this should give 32 element skips, with changing strides count++; //divide overall time by this } clock_gettime(CLOCK_REALTIME, &endAccess); //end clock L2Access = ((endAccess.tv_sec - startAccess.tv_sec) * SECONDS_PER_NS) + (endAccess.tv_nsec - startAccess.tv_nsec); L2Access /= count; printf("L2 Cache Acces %lf\n", L2Access); //invalidate L2 by accessing all elements of array which is larger than cache for(count=0; count < L2_CACHE_SIZE; count++){ int read = arrayInvalidateL2[count]; read++; readValue+=read; } index = 0; count=0; clock_gettime(CLOCK_REALTIME, &startAccess); //sreadValue+=read;tart clock while (index < L1_CACHE_SIZE) { int tmp = arrayAccess[index]; //Access Value from L2 index = (index + tmp + ((index & 4) ? 28 : 36)); // on average this should give 32 element skips, with changing strides count++; //divide overall time by this } clock_gettime(CLOCK_REALTIME, &endAccess); //end clock L3Access = ((endAccess.tv_sec - startAccess.tv_sec) * SECONDS_PER_NS) + (endAccess.tv_nsec - startAccess.tv_nsec); L3Access /= count; printf("L3 Cache Access %lf\n", L3Access); printf("Read Value: %d", readValue); }
我首先访问我想要数据的数组中的值.这应该显然来自主内存,因为它是第一次访问.该阵列很小(小于页面大小),因此应该将其复制到L1,L2,L3中.我从相同的数组访问值,现在应该是L1.然后,我从与L1高速缓存相同大小的数组中访问所有值,以使我想要访问的数据无效(所以现在它应该只在L2/3中).然后我重复L2和L3的这个过程.访问时间显然是关闭的,这意味着我做错了...
我认为时钟可能会出现问题(启动和停止需要花费一些时间在ns中,并且当它们被缓存/取消时它会改变)
有人可以给我一些关于我可能做错的指示吗?
UPDATE1:所以我通过进行大量访问来分摊计时器的成本,我修改了我的缓存的大小,我也采取了建议,以制定更复杂的索引方案,以避免固定的步幅.不幸的是,时代仍未结束.他们似乎都在为L1而来.我认为问题可能是无效而不是访问.随机vs LRU方案是否会影响被无效的数据?
UPDATE2:修复了memset(添加L3 memset以使L3中的数据无效以及首次访问从主内存开始)和索引方案,仍然没有运气.
更新3:我无法使用这种方法,但有一些很好的建议答案,我发布了一些我自己的解决方案.
我还运行Cachegrind来查看命中/未命中
==6710== I refs: 1,735,104 ==6710== I1 misses: 1,092 ==6710== LLi misses: 1,084 ==6710== I1 miss rate: 0.06% ==6710== LLi miss rate: 0.06% ==6710== ==6710== D refs: 1,250,696 (721,162 rd + 529,534 wr) ==6710== D1 misses: 116,492 ( 7,627 rd + 108,865 wr) ==6710== LLd misses: 115,102 ( 6,414 rd + 108,688 wr) ==6710== D1 miss rate: 9.3% ( 1.0% + 20.5% ) ==6710== LLd miss rate: 9.2% ( 0.8% + 20.5% ) ==6710== ==6710== LL refs: 117,584 ( 8,719 rd + 108,865 wr) ==6710== LL misses: 116,186 ( 7,498 rd + 108,688 wr) ==6710== LL miss rate: 3.8% ( 0.3% + 20.5% ) Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw . . . . . . . . . #include. . . . . . . . . #include . . . . . . . . . #include . . . . . . . . . 6 0 0 0 0 0 2 0 0 int main(){ 5 1 1 0 0 0 2 0 0 srand(time(NULL)); // Seed ONCE 1 0 0 0 0 0 1 0 0 const int L1_CACHE_SIZE = 32768/sizeof(int); 1 0 0 0 0 0 1 0 0 const int L2_CACHE_SIZE = 262144/sizeof(int); 1 0 0 0 0 0 1 0 0 const int L3_CACHE_SIZE = 6587392/sizeof(int); 1 0 0 0 0 0 1 0 0 const int NUM_ACCESSES = 1000000; 1 0 0 0 0 0 1 0 0 const int SECONDS_PER_NS = 1000000000; 21 2 2 3 0 0 3 0 0 int arrayAccess[L1_CACHE_SIZE]; 21 1 1 3 0 0 3 0 0 int arrayInvalidateL1[L1_CACHE_SIZE]; 21 2 2 3 0 0 3 0 0 int arrayInvalidateL2[L2_CACHE_SIZE]; 21 1 1 3 0 0 3 0 0 int arrayInvalidateL3[L3_CACHE_SIZE]; 1 0 0 0 0 0 1 0 0 int count=0; 1 1 1 0 0 0 1 0 0 int index=0; 1 0 0 0 0 0 1 0 0 int i=0; . . . . . . . . . struct timespec startAccess, endAccess; . . . . . . . . . double mainMemAccess, L1Access, L2Access, L3Access; 1 0 0 0 0 0 1 0 0 int readValue=0; . . . . . . . . . 7 0 0 2 0 0 1 1 1 memset(arrayAccess, 0, L1_CACHE_SIZE*sizeof(int)); 7 1 1 2 2 0 1 0 0 memset(arrayInvalidateL1, 0, L1_CACHE_SIZE*sizeof(int)); 7 0 0 2 2 0 1 0 0 memset(arrayInvalidateL2, 0, L2_CACHE_SIZE*sizeof(int)); 7 1 1 2 2 0 1 0 0 memset(arrayInvalidateL3, 0, L3_CACHE_SIZE*sizeof(int)); . . . . . . . . . 1 0 0 0 0 0 1 1 1 index = 0; 4 0 0 0 0 0 1 0 0 clock_gettime(CLOCK_REALTIME, &startAccess); //start clock 772 1 1 514 0 0 0 0 0 while (index < L1_CACHE_SIZE) { 1,280 1 1 768 257 257 256 0 0 int tmp = arrayAccess[index]; //Access Value from L2 2,688 0 0 768 0 0 256 0 0 index = (index + tmp + ((index & 4) ? 28 : 36)); // on average this should give 32 element skips, with changing strides 256 0 0 256 0 0 0 0 0 count++; //divide overall time by this . . . . . . . . . } 4 0 0 0 0 0 1 0 0 clock_gettime(CLOCK_REALTIME, &endAccess); //end clock 14 1 1 5 1 1 1 1 1 mainMemAccess = ((endAccess.tv_sec - startAccess.tv_sec) * SECONDS_PER_NS) + (endAccess.tv_nsec - startAccess.tv_nsec); 6 0 0 2 0 0 1 0 0 mainMemAccess /= count; . . . . . . . . . 6 1 1 2 0 0 2 0 0 printf("Main Memory Access %lf\n", mainMemAccess); . . . . . . . . . 1 0 0 0 0 0 1 0 0 index = 0; 1 0 0 0 0 0 1 0 0 count=0; 4 1 1 0 0 0 1 0 0 clock_gettime(CLOCK_REALTIME, &startAccess); //start clock 772 1 1 514 0 0 0 0 0 while (index < L1_CACHE_SIZE) { 1,280 0 0 768 240 0 256 0 0 int tmp = arrayAccess[index]; //Access Value from L2 2,688 0 0 768 0 0 256 0 0 index = (index + tmp + ((index & 4) ? 28 : 36)); // on average this should give 32 element skips, with changing strides 256 0 0 256 0 0 0 0 0 count++; //divide overall time by this . . . . . . . . . } 4 0 0 0 0 0 1 0 0 clock_gettime(CLOCK_REALTIME, &endAccess); //end clock 14 1 1 5 0 0 1 1 0 L1Access = ((endAccess.tv_sec - startAccess.tv_sec) * SECONDS_PER_NS) + (endAccess.tv_nsec - startAccess.tv_nsec); 6 1 1 2 0 0 1 0 0 L1Access /= count; . . . . . . . . . 6 0 0 2 0 0 2 0 0 printf("L1 Cache Access %lf\n", L1Access); . . . . . . . . . . . . . . . . . . //invalidate L1 by accessing all elements of array which is larger than cache 32,773 1 1 24,578 0 0 1 0 0 for(count=0; count < L1_CACHE_SIZE; count++){ 40,960 0 0 24,576 513 513 8,192 0 0 int read = arrayInvalidateL1[count]; 8,192 0 0 8,192 0 0 0 0 0 read++; 16,384 0 0 16,384 0 0 0 0 0 readValue+=read; . . . . . . . . . } . . . . . . . . . 1 0 0 0 0 0 1 0 0 index = 0; 1 1 1 0 0 0 1 0 0 count = 0; 4 0 0 0 0 0 1 1 0 clock_gettime(CLOCK_REALTIME, &startAccess); //start clock 772 1 1 514 0 0 0 0 0 while (index < L1_CACHE_SIZE) { 1,280 0 0 768 256 0 256 0 0 int tmp = arrayAccess[index]; //Access Value from L2 2,688 0 0 768 0 0 256 0 0 index = (index + tmp + ((index & 4) ? 28 : 36)); // on average this should give 32 element skips, with changing strides 256 0 0 256 0 0 0 0 0 count++; //divide overall time by this . . . . . . . . . } 4 1 1 0 0 0 1 0 0 clock_gettime(CLOCK_REALTIME, &endAccess); //end clock 14 0 0 5 1 0 1 1 0 L2Access = ((endAccess.tv_sec - startAccess.tv_sec) * SECONDS_PER_NS) + (endAccess.tv_nsec - startAccess.tv_nsec); 6 1 1 2 0 0 1 0 0 L2Access /= count; . . . . . . . . . 6 0 0 2 0 0 2 0 0 printf("L2 Cache Acces %lf\n", L2Access); . . . . . . . . . . . . . . . . . . //invalidate L2 by accessing all elements of array which is larger than cache 262,149 2 2 196,610 0 0 1 0 0 for(count=0; count < L2_CACHE_SIZE; count++){ 327,680 0 0 196,608 4,097 4,095 65,536 0 0 int read = arrayInvalidateL2[count]; 65,536 0 0 65,536 0 0 0 0 0 read++; 131,072 0 0 131,072 0 0 0 0 0 readValue+=read; . . . . . . . . . } . . . . . . . . . 1 0 0 0 0 0 1 0 0 index = 0; 1 0 0 0 0 0 1 0 0 count=0; 4 0 0 0 0 0 1 1 0 clock_gettime(CLOCK_REALTIME, &startAccess); //sreadValue+=read;tart clock 772 1 1 514 0 0 0 0 0 while (index < L1_CACHE_SIZE) { 1,280 0 0 768 256 0 256 0 0 int tmp = arrayAccess[index]; //Access Value from L2 2,688 0 0 768 0 0 256 0 0 index = (index + tmp + ((index & 4) ? 28 : 36)); // on average this should give 32 element skips, with changing strides 256 0 0 256 0 0 0 0 0 count++; //divide overall time by this . . . . . . . . . } 4 0 0 0 0 0 1 0 0 clock_gettime(CLOCK_REALTIME, &endAccess); //end clock 14 1 1 5 1 0 1 1 0 L3Access = ((endAccess.tv_sec - startAccess.tv_sec) * SECONDS_PER_NS) + (endAccess.tv_nsec - startAccess.tv_nsec); 6 0 0 2 0 0 1 0 0 L3Access /= count; . . . . . . . . . 6 1 1 2 0 0 2 0 0 printf("L3 Cache Access %lf\n", L3Access); . . . . . . . . . 6 0 0 1 0 0 1 0 0 printf("Read Value: %d", readValue); . . . . . . . . . 3 0 0 3 0 0 0 0 0 }
Sergey L... 26
我宁愿尝试使用硬件时钟作为衡量标准.该rdtsc
指令将告诉您自CPU启动以来的当前循环计数.此外,最好使用asm
以确保在测量和干运行中始终使用相同的指令.使用它和一些聪明的统计数据我很久以前就已经做过了:
#include#include #include #include #include #include #include int i386_cpuid_caches (size_t * data_caches) { int i; int num_data_caches = 0; for (i = 0; i < 32; i++) { // Variables to hold the contents of the 4 i386 legacy registers uint32_t eax, ebx, ecx, edx; eax = 4; // get cache info ecx = i; // cache id asm ( "cpuid" // call i386 cpuid instruction : "+a" (eax) // contains the cpuid command code, 4 for cache query , "=b" (ebx) , "+c" (ecx) // contains the cache id , "=d" (edx) ); // generates output in 4 registers eax, ebx, ecx and edx // taken from http://download.intel.com/products/processor/manual/325462.pdf Vol. 2A 3-149 int cache_type = eax & 0x1F; if (cache_type == 0) // end of valid cache identifiers break; char * cache_type_string; switch (cache_type) { case 1: cache_type_string = "Data Cache"; break; case 2: cache_type_string = "Instruction Cache"; break; case 3: cache_type_string = "Unified Cache"; break; default: cache_type_string = "Unknown Type Cache"; break; } int cache_level = (eax >>= 5) & 0x7; int cache_is_self_initializing = (eax >>= 3) & 0x1; // does not need SW initialization int cache_is_fully_associative = (eax >>= 1) & 0x1; // taken from http://download.intel.com/products/processor/manual/325462.pdf 3-166 Vol. 2A // ebx contains 3 integers of 10, 10 and 12 bits respectively unsigned int cache_sets = ecx + 1; unsigned int cache_coherency_line_size = (ebx & 0xFFF) + 1; unsigned int cache_physical_line_partitions = ((ebx >>= 12) & 0x3FF) + 1; unsigned int cache_ways_of_associativity = ((ebx >>= 10) & 0x3FF) + 1; // Total cache size is the product size_t cache_total_size = cache_ways_of_associativity * cache_physical_line_partitions * cache_coherency_line_size * cache_sets; if (cache_type == 1 || cache_type == 3) { data_caches[num_data_caches++] = cache_total_size; } printf( "Cache ID %d:\n" "- Level: %d\n" "- Type: %s\n" "- Sets: %d\n" "- System Coherency Line Size: %d bytes\n" "- Physical Line partitions: %d\n" "- Ways of associativity: %d\n" "- Total Size: %zu bytes (%zu kb)\n" "- Is fully associative: %s\n" "- Is Self Initializing: %s\n" "\n" , i , cache_level , cache_type_string , cache_sets , cache_coherency_line_size , cache_physical_line_partitions , cache_ways_of_associativity , cache_total_size, cache_total_size >> 10 , cache_is_fully_associative ? "true" : "false" , cache_is_self_initializing ? "true" : "false" ); } return num_data_caches; } int test_cache(size_t attempts, size_t lower_cache_size, int * latencies, size_t max_latency) { int fd = open("/dev/urandom", O_RDONLY); if (fd < 0) { perror("open"); abort(); } char * random_data = mmap( NULL , lower_cache_size , PROT_READ | PROT_WRITE , MAP_PRIVATE | MAP_ANON // | MAP_POPULATE , -1 , 0 ); // get some random data if (random_data == MAP_FAILED) { perror("mmap"); abort(); } size_t i; for (i = 0; i < lower_cache_size; i += sysconf(_SC_PAGESIZE)) { random_data[i] = 1; } int64_t random_offset = 0; while (attempts--) { // use processor clock timer for exact measurement random_offset += rand(); random_offset %= lower_cache_size; int32_t cycles_used, edx, temp1, temp2; asm ( "mfence\n\t" // memory fence "rdtsc\n\t" // get cpu cycle count "mov %%edx, %2\n\t" "mov %%eax, %3\n\t" "mfence\n\t" // memory fence "mov %4, %%al\n\t" // load data "mfence\n\t" "rdtsc\n\t" "sub %2, %%edx\n\t" // substract cycle count "sbb %3, %%eax" // substract cycle count : "=a" (cycles_used) , "=d" (edx) , "=r" (temp1) , "=r" (temp2) : "m" (random_data[random_offset]) ); // printf("%d\n", cycles_used); if (cycles_used < max_latency) latencies[cycles_used]++; else latencies[max_latency - 1]++; } munmap(random_data, lower_cache_size); return 0; } int main() { size_t cache_sizes[32]; int num_data_caches = i386_cpuid_caches(cache_sizes); int latencies[0x400]; memset(latencies, 0, sizeof(latencies)); int empty_cycles = 0; int i; int attempts = 1000000; for (i = 0; i < attempts; i++) { // measure how much overhead we have for counting cyscles int32_t cycles_used, edx, temp1, temp2; asm ( "mfence\n\t" // memory fence "rdtsc\n\t" // get cpu cycle count "mov %%edx, %2\n\t" "mov %%eax, %3\n\t" "mfence\n\t" // memory fence "mfence\n\t" "rdtsc\n\t" "sub %2, %%edx\n\t" // substract cycle count "sbb %3, %%eax" // substract cycle count : "=a" (cycles_used) , "=d" (edx) , "=r" (temp1) , "=r" (temp2) : ); if (cycles_used < sizeof(latencies) / sizeof(*latencies)) latencies[cycles_used]++; else latencies[sizeof(latencies) / sizeof(*latencies) - 1]++; } { int j; size_t sum = 0; for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) { sum += latencies[j]; } size_t sum2 = 0; for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) { sum2 += latencies[j]; if (sum2 >= sum * .75) { empty_cycles = j; fprintf(stderr, "Empty counting takes %d cycles\n", empty_cycles); break; } } } for (i = 0; i < num_data_caches; i++) { test_cache(attempts, cache_sizes[i] * 4, latencies, sizeof(latencies) / sizeof(*latencies)); int j; size_t sum = 0; for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) { sum += latencies[j]; } size_t sum2 = 0; for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) { sum2 += latencies[j]; if (sum2 >= sum * .75) { fprintf(stderr, "Cache ID %i has latency %d cycles\n", i, j - empty_cycles); break; } } } return 0; }
我的Core2Duo输出:
Cache ID 0: - Level: 1 - Type: Data Cache - Total Size: 32768 bytes (32 kb) Cache ID 1: - Level: 1 - Type: Instruction Cache - Total Size: 32768 bytes (32 kb) Cache ID 2: - Level: 2 - Type: Unified Cache - Total Size: 262144 bytes (256 kb) Cache ID 3: - Level: 3 - Type: Unified Cache - Total Size: 3145728 bytes (3072 kb) Empty counting takes 90 cycles Cache ID 0 has latency 6 cycles Cache ID 2 has latency 21 cycles Cache ID 3 has latency 168 cycles
你能写一下你编译的方式吗?我得到'错误:'asm'操作数有不可能的限制 (2认同)
Leeor.. 8
好的,你的代码有几个问题:
如您所述,您的测量需要很长时间.事实上,他们很可能比单一访问本身更长时间,所以他们没有测量任何有用的东西.为了减轻这种影响,请访问多个元素并进行摊销(将总时间除以访问次数.请注意,要测量延迟,您希望将这些访问序列化,否则它们可以并行执行,您只需测量吞吐量无关的访问.要实现这一点,你可以在访问之间添加一个错误的依赖.
例如,将数组初始化为零,并执行:
clock_gettime(CLOCK_REALTIME, &startAccess); //start clock for (int i = 0; i < NUM_ACCESSES; ++i) { int tmp = arrayAccess[index]; //Access Value from Main Memory index = (index + i + tmp) & 1023; } clock_gettime(CLOCK_REALTIME, &endAccess); //end clock
..当然记得把时间分开NUM_ACCESSES
.
现在,我已经使索引故意复杂化,以便你避免一个可能触发预取的固定步幅(有点矫枉过正,你不太可能注意到影响,但为了演示......).您可能会选择一个简单的index += 32
,这将为您提供128k(两个缓存行)的大步,并避免大多数简单的相邻行/简单流预取程序的"好处".我也更换了% 1000
with,& 1023
因为&
它更快,但它需要2的幂才能以相同的方式工作 - 所以只需增加到ACCESS_SIZE
1024就可以了.
通过加载其他东西使L1失效是好的,但尺寸看起来很有趣.您没有指定系统,但256000
对于L1来说似乎相当大.在许多常见的现代x86 CPU上,L2通常为256k,例如,请注意256k 不是 256000
,而是256*1024=262144
.第二个尺寸也是如此:1M不是1024000
,它是1024*1024=1048576
.假设确实是你的L2大小(更可能是L3,但可能太小了).
您的无效数组是类型int
,因此每个元素都比单个字节长(很可能是4个字节,具体取决于系统).你实际上是无效L1_CACHE_SIZE*sizeof(int)
的字节数(对于L2失效循环也是如此)
memset
接收以字节为单位的大小,您的大小除以 sizeof(int)
您的失效读取永远不会被使用,并且可能会被优化.尝试将读数累积到某个值并最终打印出来,以避免这种可能性.
开头的memset也是访问数据,因此你的第一个循环是从L3访问数据(因为其他2个memset仍然有效地从L1 + L2中驱逐它,尽管只是部分由于大小错误.
步幅可能太小,因此您可以两次访问同一个高速缓存行(L1命中).通过添加32个元素(x4字节)确保它们足够分散 - 这是2个高速缓存行,因此您也不会获得任何相邻的高速缓存行预取优势.
由于NUM_ACCESSES大于ACCESS_SIZE,因此您基本上重复了相同的元素,并且可能会为它们获得L1命中(因此平均时间偏移有利于L1访问延迟).而是尝试使用L1大小,以便您只访问整个L1(跳过除外)一次.对于像这样的 -
index = 0; while (index < L1_CACHE_SIZE) { int tmp = arrayAccess[index]; //Access Value from L2 index = (index + tmp + ((index & 4) ? 28 : 36)); // on average this should give 32 element skips, with changing strides count++; //divide overall time by this }
不要忘记增加到arrayAccess
L1大小.
现在,随着上面的变化(或多或少),我得到这样的东西:
L1 Cache Access 7.812500 L2 Cache Acces 15.625000 L3 Cache Access 23.437500
这仍然看起来有点长,但可能是因为它包含对算术运算的额外依赖