C++ multithreading with openMP: poor performance despite localized variables (false sharing?) -
i have run weird openmp problem.
the task take vector of strings , split each element contained k-mers (all contained substrings of length k). should parallelize trivially along elements of vector, k-merification procedure happens independently each element. want store results in map/set stl data structure (std::map<long long, std::map<std::string, std::set<unsigned int> > > local_forreturn
) , , allocate thread-local variable that.
the achieved parallelization behaviour, however, surprisingly bad - top on linux shows cpu usage of ~ 200%, despite running 40 threads on 40 core machine. (and have tested #omp critical
section not bottleneck).
my hunch might related false sharing, actual data contained in localized map/set stl classes end on heap. however, have neither idea of how test intuition, or how reduce false sharing stl constructs (if problem). appreciate ideas!
complete code:
#include <string> #include <assert.h> #include <set> #include <map> #include <vector> #include <omp.h> #include <iostream> int threads = 40; int k = 31; std::string generaterandomsequence(int length); char randomnucleotide(); std::vector<std::string> partitionstringintokmers(std::string str, int k); int main(int argc, char *argv[]) { // generate test data std::vector<std::string> requiredseq; for(unsigned int = 0; < 10000; i++) { std::string seq = generaterandomsequence(20000); requiredseq.push_back(seq); } // variable contain final result std::map<long long, std::map<std::string, std::map<unsigned int, int> > > forreturn; omp_set_num_threads(threads); std::cerr << "data generated, start parallel processing\n" << std::flush; // split workload (ie requiredseq) according number of threads long long max_i = requiredseq.size() - 1; long long chunk_size = max_i / threads; #pragma omp parallel { assert(omp_get_num_threads() == threads); long long thisthread = omp_get_thread_num(); long long firstpair = thisthread * chunk_size; long long lastpair = (thisthread+1) * chunk_size - 1; if((thisthread == (threads-1)) && (lastpair < max_i)) { lastpair = max_i; } std::map<long long, std::map<std::string, std::map<unsigned int, int> > > local_forreturn; for(long long seqi = firstpair; seqi <= lastpair; seqi++) { const std::string& seq_sequence = requiredseq.at(seqi); const std::vector<std::string> kmersinsegment = partitionstringintokmers(seq_sequence, k); for(unsigned int kmeri = 0; kmeri < kmersinsegment.size(); kmeri++) { const std::string& kmerseq = kmersinsegment.at(kmeri); local_forreturn[seqi][kmerseq][kmeri]++; } } #pragma omp critical { forreturn.insert(local_forreturn.begin(), local_forreturn.end()); } } return 0; } std::string generaterandomsequence(int length) { std::string forreturn; forreturn.resize(length); for(int = 0; < length; i++) { forreturn.at(i) = randomnucleotide(); } return forreturn; } char randomnucleotide() { char nucleotides[4] = {'a', 'c', 'g', 't'}; int n = rand() % 4; assert((n >= 0) && (n <= 3)); return nucleotides[n]; } std::vector<std::string> partitionstringintokmers(std::string str, int k) { std::vector<std::string> forreturn; if((int)str.length() >= k) { forreturn.resize((str.length() - k)+1); for(int = 0; <= (int)(str.length() - k); i++) { std::string kmer = str.substr(i, k); assert((int)kmer.length() == k); forreturn.at(i) = kmer; } } return forreturn; }
Comments
Post a Comment