返回64位整数中所有设置位的最快方法是什么？

Question

问

返回64位整数中所有设置位的最快方法是什么？

鲁有军_644 发布于 2023-02-08 19:15

我需要一种快速的方法来获取64位整数中所有位的位置.例如,给定x = 123703,我想填充一个数组idx[] = {0, 1, 2, 4, 5, 8, 9, 13, 14, 15, 16}.我们可以假设我们先验地知道比特数.这将被称为10 ^ 12 - 10 ^ 15次,因此速度至关重要.到目前为止,我提出的最快答案是以下怪物,它使用64位整数的每个字节作为表的索引,这些表给出了在该字节中设置的位数以及这些位的位置:

int64_t x;            // this is the input
unsigned char idx[K]; // this is the array of K bits that are set
unsigned char *dst=idx, *src;
unsigned char zero, one, two, three, four, five;  // these hold the 0th-5th bytes
zero  =  x & 0x0000000000FFUL;
one   = (x & 0x00000000FF00UL) >> 8;
two   = (x & 0x000000FF0000UL) >> 16;
three = (x & 0x0000FF000000UL) >> 24;
four  = (x & 0x00FF00000000UL) >> 32;
five  = (x & 0xFF0000000000UL) >> 40;
src=tab0+tabofs[zero ]; COPY(dst, src, n[zero ]);
src=tab1+tabofs[one  ]; COPY(dst, src, n[one  ]);
src=tab2+tabofs[two  ]; COPY(dst, src, n[two  ]);
src=tab3+tabofs[three]; COPY(dst, src, n[three]);
src=tab4+tabofs[four ]; COPY(dst, src, n[four ]);
src=tab5+tabofs[five ]; COPY(dst, src, n[five ]);

其中COPY是一个复制最多8个字节的switch语句,n是一个字节中设置的位数的数组,并tabofs给出了偏移量tabX,它保存了第X个字节中设置位的位置. ~~__builtin_ctz()在我的Xeon E5-2609上,这比基于循环的非循环方法快3倍.~~(见下文.)我目前正在x按照字典顺序迭代一组给定的位数.

有没有更好的办法？

编辑:添加了一个例子(我后来修复了).完整代码可在此处获取:http://pastebin.com/79X8XL2P.注意:带有-O2的GCC似乎可以优化它,但英特尔的编译器(我曾经编写它)不会...

另外,让我提供一些额外的背景来解决下面的一些评论.目标是对N个可能的解释变量范围内的每个可能的K变量子集进行统计检验; 现在的具体目标是N = 41,但我可以看到一些项目需要N到45-50.该测试基本上涉及分解相应的数据子矩阵.在伪代码中,这样的事情:

double doTest(double *data, int64_t model) {
  int nidx, idx[];
  double submatrix[][];
  nidx = getIndices(model, idx);  // get the locations of ones in model
  // copy data into submatrix
  for(int i=0; i

我为英特尔Phi板编写了一个这样的版本,它应该在大约15天内完成N = 41的情况,其中大约5-10%的时间花在一个天真的getIndices()所以,所以马上就可以节省更快的版本一天或更长时间.我正在为NVidia Kepler开发一个实现,但遗憾的是我遇到的问题(很少的小矩阵运算)并不适合硬件(非常大的矩阵运算).也就是说,本文提出了一种解决方案,通过积极地展开循环并在寄存器中执行整个分解,似乎在我的大小的矩阵上实现了数百个GFLOPS/s,并且需要注意矩阵的维度在编译时定义.(这个循环展开应该有助于减少开销并改善Phi版本中的矢量化,因此getIndices()将变得更加重要!)所以现在我认为我的内核看起来应该更像:

double *data;  // move data to GPU/Phi once into shared memory
template double doTestUnrolled(int *idx) {
  double submatrix[K][K];
  // copy data into submatrix
  #pragma unroll
  for(int i=0; i(submatrix);
  return the_answer;
}


Phi版本在一个`cilk_for'循环中从model = 0到2 ^ N(或者更确切地说是一个用于测试的子集)解决每个模型,但现在为了批处理GPU工作并分摊内核启动开销我必须按字典顺序迭代每个K = 1到41位的模型编号(如doynax所述).  

编辑2:   现在假期结束了,这里有一些使用icc版本15的我的Xeon E5-2602的结果.我用来测试的代码在这里:http://pastebin.com/XvrGQUat.我对具有完全K位设置的整数执行位提取,因此在下表的"基本"列中测量的词典迭代有一些开销.这些进行2 ^ 30次,N = 48(必要时重复).

"CTZ"是一个循环,它使用gcc内在函数__builtin_ctzll来获得最低位集:

for(int i=0; i

Mark是Mark的无分支循环: 

for(int i=0; i>= 1;
} 


Tab1是我原始的基于表格的代码,带有以下副本宏:

#define COPY(d, s, n) \
switch(n) { \
case 8: *(d++) = *(s++); \
case 7: *(d++) = *(s++); \
case 6: *(d++) = *(s++); \
case 5: *(d++) = *(s++); \
case 4: *(d++) = *(s++); \
case 3: *(d++) = *(s++); \
case 2: *(d++) = *(s++); \
case 1: *(d++) = *(s++); \
case 0: break;        \
}


Tab2与Tab1的代码相同,但复制宏只移动8个字节作为单个副本(从doynax和LưuVĩnhPhúc中获取想法......但请注意,这不能确保对齐):

#define COPY2(d, s, n) { *((uint64_t *)d) = *((uint64_t *)s); d+=n; }


结果如下.我想我最初声称Tab1比CTZ快3倍只适用于大K(我在测试的地方).Mark的循环比我原来的代码更快,但是在COPY2宏中删除分支会占用K> 8的蛋糕.

 K    Base    CTZ   Mark   Tab1   Tab2
001  4.97s  6.42s  6.66s 18.23s 12.77s
002  4.95s  8.49s  7.28s 19.50s 12.33s
004  4.95s  9.83s  8.68s 19.74s 11.92s
006  4.95s 16.86s  9.53s 20.48s 11.66s
008  4.95s 19.21s 13.87s 20.77s 11.92s
010  4.95s 21.53s 13.09s 21.02s 11.28s
015  4.95s 32.64s 17.75s 23.30s 10.98s
020  4.99s 42.00s 21.75s 27.15s 10.96s
030  5.00s 100.64s 35.48s 35.84s 11.07s
040  5.01s 131.96s 44.55s 44.51s 11.58s

    
  




  
  
  

    

      

        doynax..
         7
      
      
我认为这里表现的关键是关注更大的问题,而不是微观优化从随机整数中提取比特位置.

根据您的示例代码和之前的SO问题判断,您将枚举所有按顺序设置K位的字,并从中提取位索引.这大大简化了事情.

如果是这样,那么不是每次迭代重建位位置而是直接递增位数组中的位置.一半时间这将涉及单循环迭代和增量.

这些方面的东西:

// Walk through all len-bit words with num-bits set in order
void enumerate(size_t num, size_t len) {
    size_t i;
    unsigned int bitpos[64 + 1];

    // Seed with the lowest word plus a sentinel
    for(i = 0; i < num; ++i)
        bitpos[i] = i;
    bitpos[i] = 0;

    // Here goes the main loop
    do {
        // Do something with the resulting data
        process(bitpos, num);

        // Increment the least-significant series of consecutive bits
        for(i = 0; bitpos[i + 1] == bitpos[i] + 1; ++i)
            bitpos[i] = i;
    // Stop on reaching the top
    } while(++bitpos[i] != len);
}

// Test function
void process(const unsigned int *bits, size_t num) {
    do
        printf("%d ", bits[--num]);
    while(num);
    putchar('\n');
}


没有特别优化,但你得到了一般的想法.

2 个回答

这里有一些非常简单的东西可能会更快 - 没有测试就无法知道.很大程度上取决于设置的位数与未设置的数量.您可以展开这个以完全删除分支,但是对于今天的处理器,我不知道它是否会加速.
```
unsigned char idx[K+1]; // need one extra for overwrite protection
unsigned char *dst=idx;
for (unsigned char i = 0; i < 50; i++)
{
    *dst = i;
    dst += x & 1;
    x >>= 1;
}
```
PS问题中的示例输出是错误的,请参阅http://ideone.com/2o032E
2023-02-08 19:17 回答

enlend_443

我认为这里表现的关键是关注更大的问题,而不是微观优化从随机整数中提取比特位置.

根据您的示例代码和之前的SO问题判断,您将枚举所有按顺序设置K位的字,并从中提取位索引.这大大简化了事情.

如果是这样,那么不是每次迭代重建位位置而是直接递增位数组中的位置.一半时间这将涉及单循环迭代和增量.

这些方面的东西:

// Walk through all len-bit words with num-bits set in order
void enumerate(size_t num, size_t len) {
    size_t i;
    unsigned int bitpos[64 + 1];

    // Seed with the lowest word plus a sentinel
    for(i = 0; i < num; ++i)
        bitpos[i] = i;
    bitpos[i] = 0;

    // Here goes the main loop
    do {
        // Do something with the resulting data
        process(bitpos, num);

        // Increment the least-significant series of consecutive bits
        for(i = 0; bitpos[i + 1] == bitpos[i] + 1; ++i)
            bitpos[i] = i;
    // Stop on reaching the top
    } while(++bitpos[i] != len);
}

// Test function
void process(const unsigned int *bits, size_t num) {
    do
        printf("%d ", bits[--num]);
    while(num);
    putchar('\n');
}

没有特别优化,但你得到了一般的想法.

2023-02-08 19:17 回答

MrBatista

撰写答案