First 5 minutes of hell

Parallel time:

  • = a total time elapsed, consists of two parts:
    • computation time = time spent on actual computation
    • communication and synchronization time = overhead (e.g. distributing data partitions onto all processors)

Parallel cost:

  • = total cost of the parallel computation
    • the cost is optimal if it is asymptotically the same as the sequential time (so the is p-times faster)
    • the sequential time is also a lowerbound for the parallel cost, because if the cost would be less than sequential time, that would mean finding better sequential algorithm (by just simulating the parallel one sequentially)

Parallel speedup:

  • = how much faster the parallel algorithm is?
    • if , the parallel algorithm does not make sense (it is slower than sequential solution)
    • the ideal speedup is : if I use processors, the parallel algorithm will be p-times faster than sequential
    • the goal is linear speedup: if increases times, the total computation time () should decrease times
    • superlinear speedup (): can occur, when:
      • multiple processors (with their main memory units) can cumulatively hold all data in the main memory to avoid swapping entirely (each processing smaller chunk of data at once)
      • state-space search anomaly: multiple processors exploring different parts of the state-space could find the solution faster

Parallel efficiency:

    • if the total parallel cost is equal to the sequential time, we have 100% efficiency (but this is often not possible due to communication/synchronization overhead)
    • it could also be expressed as speedup per core
    • constant efficiency means that the efficiency does not degrade to 0 when and grow (it should be bounded below by a positive constant

Parallel performance optimality:

  • all three expressions are equal:
    • parallel algorithm is cost optimal
    • it has linear speedup
    • it has a constant efficiency

Prerequisities

Sequential time complexity
  • parameters which influence the time complexity: size of the data, algorithm used, what problem do I solve
  • every problem has a sequential lower bound (it doesn’t have to be known)
    • so the worst-time run of any sequential algorithm cannot be better than this
    • typical (trivial) lowerbound is the size of the data (because for solving the problem we need to read the data)
    • this lower bound cannot be beaten in the future
  • upper bound is set by the complexity of the worst-case run of the best known sequential algorithm
    • e.g. sorting (we know the fundamental lower bound and we have algorithms that reach it)
    • e.g. matrix multiplication (trivial lower bound is , but we didn’t discover that fast algorithms (yet)) - or the true lower bound is different (we don’t know that also)
  • sequential optimality
    • a sequential algorithm for problem is optimal iff (the best known algorithm matches the lower bound)

Parallel time

  • parameters: same as sequential + number of processors, threads, processes etc.

    • - amount of data
    • - number of processors/threads/processes (depending on the context)
      • it says nothing about their organization etc., it is simplified
  • definition: = the time elapsed from the beginning of a parallel computation until the last thread finishes the execution

    • consists of two components:
      • computational time: arithmetic, logical, memory operations
      • communication and synchronization overhead: shared data access, message passing, barrier synchronizations, mutual exclusion, etc.
  • the whole parallel programming is about a trade-off with the speed gain and costs (increased overhead costs, distributed data etc.)

    • we need to account for overhead latencies related to communication and synchronization
    • e.g. broadcast part in array search (distributing the information to all computing nodes what to search for)
    • balancing the computation itself against the overhead costs imposed by distributing both computation and data
  • example - parallel search in unsorted shared array

    • given , what is the best choice of to minimize time?
      • small : local search dominates
      • big : communication overhead dominates
  • parallel time lower bound :
    • it’s only theoretical minimum parallel time (counting with zero overhead)

Parallel cost

  • = processor-time product
  • definition:
- the cost we need to pay for the parallel computation
	- I rent 8 CPUs for 6 hours, I will pay for 48 CPU-hours
- this is the simplest calculation (the cores are assigned statistically (based on previous costs) and I have to pay even for idle cores)
  • the best I can do is sequential complexity (in terms of the cost)
    • we cannot have a parallel algorithm that has a lower cost than a sequential complexity
      • that would imply that by simulating the parallel algorithm sequentially would yield a faster sequential algorithm - that’s a contradiction
      • the work cannot vanish, it is only distributed (so the total work (= parallel cost) cannot be lower than the sequential work, only higher (adding some overhead))
    • = the total work done is asymptotically the same as the best sequential algorithm
  • cost optimality
    • means you are not wasting parallel resources - the total work done by all processors is asymptotically the same as the best sequential algorithm

Parallel speedup

  • how many times faster the parallel execution is compared to the best known sequential algorithm
  • definition:
- $SU(n)$ is the sequential upper bound (the worst-case time for the fastest sequential algorithm (if the parallel time is worse than $SU(n)$, it does not make sense to parallelize the algorithm))
  • the best speedup is : ( times faster with processors), in reality, it’s often less (due to overhead costs)
      • the speedup cannot be greater than (you cannot do better than times faster with processors (in the asymptotic sense))
  • linear speedup is the goal (if increases times, the total computation time () should decrease times)
    • it depends on the degree of data independence (more independent = more options to paralellize)
  • superlinear speedup (speedup exceeds )
    • better speedup caused by hardware limitations with the sequential approach
      • memory effect: where sequential algorithm requires more main memory, resulting to disk swapping (and cumulative memory of the parallel system holds everything in RAM, avoiding swapping entirely)
      • state-space search anomaly: parallel search may explore a different part of the search tree first and find the solution faster by luck
    • this is rather an exception

Parallel efficiency

  • = a relative utilization of computational resources during a parallel computation
  • it will never be 100 % due to overhead (communication and synchronization overhead)
  • if we scale and the efficiency remains the same, we have an optimal parallel algorithm
  • definition:

Lemma: is the speedup per core:

Constant efficiency
  • Definition (Constant efficiency): Given a constant , a parallel algorithm has constant efficiency if , i.e., asymptotically .
  • e.g. if efficiency is , it is not optimal as it goes to 0 as grows

Parallel performance optimality theorem

  • parallel optimality
    • these are equivalent:
      • parallel algorithm is cost-optimal
      • it has linear speedup
      • it has a constant efficiency Theorem (Parallel performance optimality): The following three conditions are equivalent for a parallel algorithm:

These three measures express the same thing about the quality of a parallel algorithm. Verifying any one of them establishes optimal parallel performance.


Summary of definitions and relationships

MeasureDefinitionBound
Sequential lower boundfundamental limit
Sequential upper boundbest known algorithm
Parallel timetime until last thread finishes
Parallel time lower boundtheoretical minimum
Parallel cost
Parallel speedup
Parallel efficiency

The equivalence theorem: cost-optimal linear speedup constant efficiency.


Potential exam questions

  1. Define and . What is the trivial lower bound? When is a sequential algorithm optimal vs. merely the best known?
  2. Define parallel time . What two components does it consist of on a real parallel computer?
  3. Define parallel cost . Prove that (Lemma 5).
  4. Define cost optimality. Why does Lemma 5 imply that cost optimality is equivalent to ?
  5. Define parallel speedup . Prove that (Lemma 9).
  6. Define linear speedup. Is considered linear speedup? Why or why not?
  7. Under what circumstances can superlinear speedup occur? Does this violate the theoretical bound ?
  8. Define parallel efficiency . Show that .
  9. State and prove Theorem 16 (parallel performance optimality) - the equivalence of cost optimality, linear speedup, and constant efficiency.
  10. Define the parallel time lower bound . Compute it for comparison-based sorting with .
  11. Given the parallel search example with , what happens if is too large relative to ?