Cheap and Secure Web Hosting Provider : See Now

[Solved]: How do we analyze Algorithms with parallelism features

, , No Comments
Problem Detail: 

As I am comparing between two algorithms, I was wondering which is the best approach to use to compare between the both. Big-O seems not the perfect metric for me, as the it's based on the worst-case, while most of cases in the area that I am working on is optimized. The second though is, I am using "Python-Maps" (and other parallelism features, but let's take just "map no python as an example" to simplify the question) to implement the first algorithm, while I don't use parallelism features in the second algorithm.

What is the approach to compare, abstractly, the both algorithms ?. Can you please give any hint ?

Asked By : user3378649

Answered By : Wandering Logic

The conceptual/abstract model I use when developing a parallel algorithm is (loosely) Leslie Valiant's Bulk Synchronous Parallel model.

You have $P$ processors (where $P$ is typically some small number like 4, 8, 16, maybe 128, but certainly $P \ll N$). You are trying to design an algorithm where you spend most of your time computing in parallel, then you use a barrier to synchronize the processors, then do some communication between processors, synchronize again, and then go on to the next parallel computation step. It is called bulk synchronous because you can think of the algorithm as a series of steps where the work inside each step is completely asynchronous, but the steps happen one at a time and in order.

There are a handful of different things you need to worry about that all interact and have inescapable tradeoffs.

  1. Algorithm efficiency. You should have a best sequential algorithm you are using as a baseline. Often the best sequential algorithm won't parallelize well (because of all the rest of this list) so your parallel algorithm may require substantially more work. For example: There are often clever sequential algorithms that are $O(N \log N)$ where the only realistic parallel algorithms are $O(N^2)$. Alpha-Beta pruning tree search algorithms are another case where you may need to do (significantly) more work to avoid synchronization/communication overheads.
  2. Work independence. You need to choose an algorithm where, in each bulk step, there are approximately $O(N)$ (or more) independent pieces of work to do.
  3. Load imbalance. During each step of the algorithm you want to break the work into $P$ nearly-equal chunks. If the chunks are unequal then each bulk synchronous step will take the amount of time for the largest chunk and many of your processors will spend a significant portion of their time idle. Examples of load imbalance problems happen with:
    • divide and conquer algorithms (the first step has to happen on one processor, then you get two (perhaps unequal chunks) so the second step happens on two processors. It takes $O(\log P)$ steps to get all $P$ processors working, and the amount of work they are doing may not be similar, let alone the same.)
    • hashing algorithms (it depends on whether your hash function is distributing the data evenly among hash buckets or whether there is some bias. The bias ends up as load imbalance.)
    • sparse graphs and sparse matrices. (Similar to the hashing problem. It may be easy to divide up the number of nodes in your graph evenly, but the number of edges per node determines the amount of work so two chunks with similar numbers of nodes may have very different amounts of work involved.)
  4. Synchronization overhead. Synchronization is "pure overhead". It is added cost for parallelization that simply isn't required in the sequential algorithm.
  5. Data communication overhead. This one is quite tricky because it interacts with a problem (cache locality) that is quite difficult in the sequential case as well.
  6. Reductions. Often some of your work looks unavoidably sequential, but involves associative operators that can be mostly parallelized (with some cleverness, and assuming that your hardware supports it.) Similarly you sometimes have operators that are (a) associative, (b) commutative and (c) applied sparsely, in which case it may be possible to parallelize a chunk of code using fine-grained mutual-exclusion (locks).
  7. SIMD hardware limitations. The Bulk Synchronous Parallel model is single-program-multiple-data (SPMD). In each chunk of work between barrier synchronizations each processor has something like $O(N/P)$ work to do, and that work may require a substantial amount of data-dependent branching. But most modern parallel hardware exposes a substantial fraction of its parallelism only with single-instruction-multiple-data (SIMD) short-vector units. This means that you have to go beyond the work independence in step (2) above. You have to find work to do that is not just independent but where there are few (hopefully none) data-dependent branches for each chunk of work.
  8. Accelerator data marshalling. This is starting to get better, but for the moment many parallel accelerators (GPGPUs) require you to move data back and forth between the accelerator and the processor at each barrier synchronization point. This means that you need to find more work independence to make the cost of using the accelerator break even.

That's a lot to keep in mind, but you can often think about it a little more simply.

  1. Rewrite the sequential code using the least clever algorithm you can think of.
  2. Look for the loop nest where you are spending most of your time (use a profiler.)
  3. Choose the outer-most loop where the iterations are independent of one another.
  4. Turn that loop into a parallel-for (using whatever method your programming language and/or libraries provide.)
  5. Hope that everything else works out.
Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/30764

3.2K people like this

 Download Related Notes/Documents

0 comments:

Post a Comment

Let us know your responses and feedback