问

带有线程的矢量和没有加速

请让我来打酱油发布于 2023-01-16 08:18

多线程

我有一个C++程序,基本上执行一些矩阵计算.对于这些我使用LAPACK/BLAS并且通常根据平台链接到MKL或ACML.很多这些矩阵计算都在不同的独立矩阵上运行,因此我使用std :: thread来让这些操作并行运行.但是,我注意到在使用更多线程时我没有加速.我将问题追溯到daxpy Blas例程.似乎如果两个线程并行使用此例程,则每个线程花费两倍的时间,即使两个线程在不同的阵列上运行.

我尝试的下一件事是编写一个新的简单方法来执行向量添加以替换daxpy例程.使用一个线程,这个新方法与BLAS例程一样快,但是,当使用gcc进行编译时,它会遇到与BLAS例程相同的问题:并行运行的线程数加倍也会使每个线程需要的时间加倍,所以没有加速.但是,使用英特尔C++编译器时,这个问题就会消失:随着线程数量的增加,单个线程需要的时间是不变的.

但是,我需要在没有英特尔编译器的系统上进行编译.所以我的问题是:为什么gcc没有加速,是否有提高gcc性能的可能性？

我写了一个小程序来演示效果:

// $(CC) -std=c++11 -O2 threadmatrixsum.cpp -o threadmatrixsum -pthread

#include 
#include 
#include 

#include "boost/date_time/posix_time/posix_time.hpp"
#include "boost/timer.hpp"

void simplesum(double* a, double* b, std::size_t dim);

int main() {

    for (std::size_t num_threads {1}; num_threads <= 4; num_threads++) {
        const std::size_t N { 936 };

        std::vector  times(num_threads, 0);    

        auto threadfunction = [&](std::size_t tid)
        {
            const std::size_t dim { N * N };
            double* pA = new double[dim];
            double* pB = new double[dim];

            for (std::size_t i {0}; i < N; ++i){
                pA[i] = i;
                pB[i] = 2*i;
            }   

            boost::posix_time::ptime now1 = 
                boost::posix_time::microsec_clock::universal_time();    

            for (std::size_t n{0}; n < 1000; ++n){
                simplesum(pA, pB, dim);
            }

            boost::posix_time::ptime now2 = 
                boost::posix_time::microsec_clock::universal_time(); 
            boost::posix_time::time_duration dur = now2 - now1; 
            times[tid] += dur.total_milliseconds(); 
            delete[] pA;
            delete[] pB;
        };

        std::vector  mythreads;

        // start threads
        for (std::size_t n {0} ; n < num_threads; ++n)
        {
            mythreads.emplace_back(threadfunction, n);
        }

        // wait for threads to finish
        for (std::size_t n {0} ; n < num_threads; ++n)
        {
            mythreads[n].join();
            std::cout << " Thread " << n+1 << " of " << num_threads 
                      << "  took " << times[n]<< "msec" << std::endl;
        }
    }
}

void simplesum(double* a, double* b, std::size_t dim){

    for(std::size_t i{0}; i < dim; ++i)
    {*(++a) += *(++b);}
}

与gcc的外出:

Thread 1 of 1  took 532msec
Thread 1 of 2  took 1104msec
Thread 2 of 2  took 1103msec
Thread 1 of 3  took 1680msec
Thread 2 of 3  took 1821msec
Thread 3 of 3  took 1808msec
Thread 1 of 4  took 2542msec
Thread 2 of 4  took 2536msec
Thread 3 of 4  took 2509msec
Thread 4 of 4  took 2515msec



与icc的outout:

Thread 1 of 1  took 663msec
Thread 1 of 2  took 674msec
Thread 2 of 2  took 674msec
Thread 1 of 3  took 681msec
Thread 2 of 3  took 681msec
Thread 3 of 3  took 681msec
Thread 1 of 4  took 688msec
Thread 2 of 4  took 689msec
Thread 3 of 4  took 687msec
Thread 4 of 4  took 688msec


因此,使用icc一个线程执行计算所需的时间是恒定的(正如我预期的那样;我的CPU有4个物理内核)和gcc一个线程的时间增加.用BLAS :: daxpy替换simplesum例程会产生与icc和gcc相同的结果(毫不奇怪,因为大部分时间花在库中),这几乎与上面提到的gcc结果相同.



    
    
        今天，你开发时遇到什么问题呢？
        立即提问
    

    
        热门标签