First 5 minutes of hell

The sequential algorithm has complexity (first polynomial has coefficients, second polynomial has coefficients). The final C[n+m+1] array has to be initialized to zeros first, so running sums can take place.

  1. Parallelization of the outer i-loop
  • the outer loop is split to different threads and each thread reads the inner array sequentially
  • atomic update is needed for writing to the array (the threads may may meet each other at the same memory cell of the thread)
  1. Parallelization of the inner j-loop
  • here are two ways:
    • sequential outer loop and team of threads is initialized within each outer-loop (fork-join overhead)
    • team of threads created once, each thread running outer loop in parallel and only inner-loop iterations are split
  • since the inner-loop iterations are split, the i+j combinations are unique and there is no need for atomicity
  • false sharing is unavoidable in both variants, because the written areas of C[i+j] are shifted only 1 position to the right (outer loop moves it synchronously for all threads), so they will often invalidate each other’s cache blocks
  1. Parallelization via output array decomposition
  • this approach uses a different view on the problem, we utilize the convolution formula C[k] = sum over (l) of A[l] * B[k-l]
    • we separate the output into disjoint areas and each thread computes different areas
    • this way false sharing and atomic operations are entirely eliminated
    • the process visually:
      • take array B, reverse it and slide it over array A (and multiply overlapping elements)
      • for , only first element from A and first element from (reversed) B overlap
      • for , first two elements overlap
      • this way, we can easily calculate the separately
  • and for A[l] and B[k-l] to be valid for a fixed k (set by the outer loop), they need to be in those bounds:
    • l >= max(0, k - n)
    • l <= min(k, m)
    • it creates the diamond shape (we need to distribute the iterations (and those disjoint areas) in a way that there is a similar load on all threads)
  • next improvements:
    • align the chunk size with the cache block size (cache_line_size / sizeof(element))
      • eliminates false sharing
    • compute everything locally and write only the final result to the shared memory (eliminates frequent writes into the shared memory)

Problem definition

Input: Polynomials and , where and . Output: Polynomial , where:

Polynomials are represented as arrays of coefficients (A[m+1], B[n+1], C[m+n+1]). The classical sequential algorithm has O(nm) complexity.


Sequential algorithm (Code PM-Seq)

int A[m+1], B[n+1], C[n+m+1];
 
for (int k = 0; k <= m + n; k++)
    C[k] = 0;
 
for (int i = 0; i <= m; i++)
    for (int j = 0; j <= n; j++)
        C[i+j] += A[i] * B[j];

The core operation C[i+j] += A[i]*B[j] accumulates products into running sums - this is why C must be initialized to zero first. The multiplication combines all pairs of coefficients whose exponents sum to the same value, producing a diagonal-like summation pattern.


Strategy 1: Parallelization of the outer i-loop (Code PM-A)

Access pattern (for m = n = 5, p = 3)
  • Array A is split among threads by schedule(static): τ₀ gets A[0..1], τ₁ gets A[2..3], τ₂ gets A[4..5].
  • All threads read all of B sequentially.
  • Write regions in C overlap: τ₀ writes to C[0..6], τ₁ to C[2..8], τ₂ to C[4..10].
Incorrect version (race condition)
#pragma omp parallel for schedule(static)
for (int i = 0; i <= m; i++)
    for (int j = 0; j <= n; j++)
        C[i+j] += A[i]*B[j];  // INCORRECT - atomicity needed!

Different threads can simultaneously update the same element of C (e.g., thread with i=1, j=3 and thread with i=3, j=1 both write to C[4]). Without atomic protection, the results are completely useless.

Corrected version
#pragma omp parallel for schedule(static)
for (int i = 0; i <= m; i++)
    for (int j = 0; j <= n; j++) {
        #pragma omp atomic update
        C[i+j] += A[i]*B[j];
    }
Properties of PM-A

Strategy 2: Parallelization of the inner j-loop (Codes PM-B1, PM-B2)

Key insight

All threads share the same i value (guaranteed by the barrier at the end of each parallel for). Each thread gets a different subset of j values. Since i is fixed and j values are disjoint across threads, the sums i + j are also disjoint across threads - no race conditions, no need for atomic operations.

Variant PM-B1: fork-join per outer iteration
int A[m+1], B[n+1], C[m+n+1];
 
for (int i = 0; i <= m; i++)
    #pragma omp parallel for schedule(static)
    for (int j = 0; j <= n; j++)
        C[i+j] += A[i]*B[j];

In each iteration of the outer loop, p new threads are forked and joined. The overhead is m × T_barr (one fork-join per outer iteration).

Access pattern (for m = n = 5, p = 3, showing iteration i = 2)
  • All threads read the same A[2].
  • B is split: τ₀ reads B[0..1], τ₁ reads B[2..3], τ₂ reads B[4..5].
  • C writes are disjoint: τ₀ writes C[2..3], τ₁ writes C[4..5], τ₂ writes C[6..7].
Variant PM-B2: single thread team
int A[m+1], B[n+1], C[m+n+1];
 
#pragma omp parallel
for (int i = 0; i <= m; i++)
    #pragma omp for schedule(static)
    for (int j = 0; j <= n; j++)
        C[i+j] += A[i]*B[j];

The thread team is created only once. All threads iterate the outer i-loop synchronously (each with its own copy of i), with a barrier at the end of each #pragma omp for. This eliminates the fork/join overhead of PM-B1 - just one fork at the beginning and one join at the end. The array B is divided equally among threads in each iteration and will be cached.

Properties of PM-B1/B2
  • Atomicity: Not needed - write regions are disjoint within each outer iteration.
  • Synchronization: PM-B1 has m × T_barr (repeated fork-join). PM-B2 has m × T_barr (barriers only, no fork-join overhead).
  • False sharing: Unavoidable in both variants, for a fundamentally different reason than PM-A. As i increments by 1, each thread’s write region C[i+j] shifts one position to the right. Threads are continuously moving through cache blocks - a cache block that was private to one thread in iteration i may become shared in iteration i+1. The write regions are mobile, sweeping across the array, making it impossible to fix cache blocks to specific threads.

False sharing is unavoidable here. The threads are running through cache blocks freely, and there is no way to fix cache blocks to threads because the write regions are mobile.

  • Load balance: Good - each thread performs ⌈(n+1)/p⌉ iterations of equal cost per outer iteration.

Strategy 3: Parallelization via output decomposition (Codes PM-C1 → PM-C3)

This approach completely restructures the algorithm: instead of iterating over input arrays and accumulating into C, each thread is responsible for computing specific output elements C[k] directly using the convolution formula.

  • e.g. C[3] will be calculated by thread 2 (so it will be the only thread writing into this element)
PM-C1: basic output decomposition with schedule(static)
int A[m+1], B[n+1], C[m+n+1];
 
#pragma omp parallel for schedule(static)
for (int k = 0; k <= (m + n); k++)
    for (int l = max(0, k - n); l <= min(k, m); l++)
        C[k] += A[l] * B[k-l];
Properties of PM-C1
  • Atomicity: Not needed - each thread writes to a disjoint part of C.
  • Synchronization: Minimal - only 1 implicit barrier.
  • False sharing: Possible at chunk boundaries (standard problem for any static distribution of contiguous output elements).
  • Load balance: Poor with schedule(static). The number of terms in the inner sum depends on k:
kContributing [i, j] pairs# terms
0[0,0]1
1[0,1] + [1,0]2
2[0,2] + [1,1] + [2,0]3
m (= n)[0,m] + [1,m-1] + ... + [m,0]m+1
2m[m,m]1

For m = n, the number of terms grows linearly from 1 to m+1 and then decreases back to 1. With schedule(static), the middle thread gets roughly twice the load of the edge threads.

Why schedule(dynamic, 1) is not the answer

Dynamic scheduling with default chunk_size = 1 would balance the load but introduces false sharing (single-element chunks cause neighboring threads to write to the same cache block). Dynamic scheduling also has higher runtime overhead than static.

PM-C2: balanced output decomposition with schedule(static, X)

Since the load profile is known and predictable (linear growth then linear decay), optimal partitioning can be precomputed statically rather than relying on dynamic scheduling. The key is to use chunk_size = X = cache_line_size / sizeof(int) with proper alignment of C:

int A[m+1], B[n+1], alignas(cache_line_size) int C[m+n+1];
int X = cache_line_size / sizeof(int);
 
#pragma omp parallel for schedule(static, X)
for (int k = 0; k <= (m + n); k++)
    for (int l = max(0, k - n); l <= min(k, m); l++)
        C[k] += A[l] * B[k-l];
  • is an index of the final array
  • needs to run between 0 and , which is in fact (because ) because the array A is of size
  • it creates the diamond-like shape
    • left side of the diamond (where is low):
      • the runs from 0 (we are still in the array) to (we are still in the array)
    • right side of the diamond (where is large):
      • the runs from (beyond A’s range)

This works because schedule(static, X) distributes chunks of size X in round-robin fashion across threads. The symmetric load profile (grows then shrinks) means each thread gets a mix of cheap chunks (near edges) and expensive chunks (near center), naturally balancing the total work. At the same time, each chunk occupies exactly one cache block (given alignment), eliminating false sharing. The requirement is that m, n ≫ Xp (the polynomials must be sufficiently large).

PM-C3: local accumulation optimization

This further optimization accumulates the inner sum in a local register variable s, writing to shared memory only the final result:

int A[m+1], B[n+1], alignas(cache_line_size) int C[m+n+1];
int X = cache_line_size / sizeof(int);
 
#pragma omp parallel for schedule(static, X)
for (int k = 0; k <= (m + n); k++) {
    int s = 0;
    for (int l = max(0, k - n); l <= min(k, m); l++)
        s += A[l] * B[k-l];
    C[k] = s;
}

The general principle: perform maximum computation locally (in registers/stack) and write to shared memory only when the final result is ready. This minimizes shared memory writes and reduces memory bus traffic.

Visualization of PM-C2/C3 load distribution

For p = 3, the computation can be visualized as a diamond shape (reflecting the triangular growth and decay of per-index cost). Using schedule(static, X), horizontal stripes of width X are assigned round-robin to threads. Each color (thread) covers approximately 1/3 of the total area, illustrating balanced load distribution.


Comprehensive comparison of all strategies

AspectPM-A (outer i-loop)PM-B1/B2 (inner j-loop)PM-C2/C3 (output decomposition)
Atomic neededYesNoNo
Race conditionsYes (overlapping writes to C)No (disjoint writes per iteration)No (disjoint writes globally)
Synchronization1 barrierm × T_barr (B1: fork-join, B2: barriers only)1 barrier
False sharingYes (can eliminate by inflating C)Unavoidable (mobile write regions)Eliminated by alignment + schedule(static, X)
Load balanceGood (uniform iterations)Good (uniform iterations)Poor with static, good with static,X (round-robin over symmetric profile)
Memory overheadNone (or p × (m+n+1) if inflating C)NoneNone
Thread creation1 fork-joinB1: m fork-joins, B2: 1 fork-join1 fork-join
Best variant within strategy-PM-B2 (single team)PM-C3 (local accumulation)

Key design lessons from polynomial multiplication

The polynomial multiplication example demonstrates several important principles:

  • Output decomposition is often the best strategy: By making each thread responsible for computing disjoint output elements, race conditions and atomic operations are eliminated entirely. This is a general technique applicable whenever the output can be partitioned.
  • Load balance and false sharing can conflict: schedule(dynamic, 1) balances load but causes false sharing. schedule(static) avoids false sharing overhead but may imbalance load. schedule(static, X) with round-robin distribution over a symmetric load profile achieves both goals simultaneously.
  • Mobile write regions defeat false sharing elimination: When thread write regions shift with each outer iteration (as in PM-B), no static assignment of cache blocks to threads is possible.
  • Minimize shared memory writes: Accumulating partial results in local variables (PM-C3) and writing only the final result to shared memory reduces bus traffic and potential for false sharing.
  • Avoid repeated fork-join: Creating a single thread team (PM-B2 over PM-B1) eliminates overhead proportional to the number of outer iterations.

Potential exam questions

  1. State the formula for the coefficients of the product polynomial . What is the sequential complexity of classical polynomial multiplication?
  2. Write OpenMP code for parallelization of the outer i-loop (PM-A). Why are atomic operations required? Draw the access pattern for C with m = n = 5, p = 3.
  3. Write OpenMP code for parallelization of the inner j-loop (PM-B1). Why are atomic operations not needed here? What is the synchronization overhead?
  4. Explain the difference between PM-B1 and PM-B2. Why is PM-B2 faster? Write both codes.
  5. Explain why false sharing is unavoidable in PM-B1/B2. What is fundamentally different about the cause of false sharing here compared to PM-A?
  6. Write OpenMP code for the output decomposition approach (PM-C1). Why is schedule(static) insufficient for good performance? Use the table of [i,j] contributions for m = n = 5 to explain the load imbalance.
  7. Why is schedule(dynamic, 1) a poor choice for PM-C, despite solving the load imbalance problem?
  8. Explain how schedule(static, X) with X = cache_line_size / sizeof(int) and alignment of C simultaneously achieves load balance and eliminates false sharing in PM-C2. What assumption on m, n is needed?
  9. Write the PM-C3 code with local accumulation. What general principle does this optimization illustrate?
  10. Compare all three strategies (PM-A, PM-B, PM-C) in terms of atomicity requirements, synchronization overhead, false sharing, and load balance. Which is the best overall and why?
  11. For m = n = 5 and p = 3, how many terms does each thread compute under schedule(static) in PM-C1? Quantify the load imbalance.
  12. Describe the diamond-shaped visualization of PM-C2/C3 load distribution. Why does round-robin assignment of cache-line-sized chunks naturally balance the symmetric load profile?